Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
150
1
Share
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
130
Apache Spark: Unified Platform for Big Data
rxin
1
250
Advanced Spark @ Spark Summit 2014
rxin
4
350
Apache Spark: Easier and Faster Big Data
rxin
2
300
GraphX at Spark User Meetup
rxin
0
160
Shark SIGMOD research deck
rxin
2
540
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
720
Other Decks in Research
See All in Research
Ankylosing Spondylitis
ankh2054
0
160
ウェブ・ソーシャルメディア論文読み会 第36回: The Stepwise Deception: Simulating the Evolution from True News to Fake News with LLM Agents (EMNLP, 2025)
hkefka385
0
210
はじまりの クエスチョンブック —余暇と豊かさにあふれた社会とは?
culturaltransition
PRO
0
280
生成AI による論文執筆サポート・ワークショップ 論文執筆・推敲編 / Generative AI-Assisted Paper Writing Support Workshop: Drafting and Revision Edition
ks91
PRO
0
170
Sequences of Logits Reveal the Low Rank Structure of Language Models
sansantech
PRO
1
140
2026年1月の生成AI領域の重要リリース&トピック解説
kajikent
0
910
Unified Audio Source Separation (Defense Slides)
kohei_1979
1
560
2026.01ウェビナー資料
elith
0
330
2026 東京科学大 情報通信系 研究室紹介 (大岡山)
icttitech
0
1.7k
ForestCast: Forecasting Deforestation Risk at Scale with Deep Learning
satai
3
650
SREはサイバネティクスの夢をみるか? / Do SREs Dream of Cybernetics?
yuukit
3
460
ドメイン知識がない領域での自然言語処理の始め方
hargon24
1
280
Featured
See All Featured
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
480
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
340
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
250
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
27
3.4k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.2k
Between Models and Reality
mayunak
3
260
KATA
mclloyd
PRO
35
15k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.8k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
190
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
260
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
85
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com