$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
230
Advanced Spark @ Spark Summit 2014
rxin
4
330
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
150
Shark SIGMOD research deck
rxin
2
520
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
音声感情認識技術の進展と展望
nagase
0
380
とあるSREの博士「過程」 / A Certain SRE’s Ph.D. Journey
yuukit
11
5k
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
110
SNLP2025:Can Language Models Reason about Individualistic Human Values and Preferences?
yukizenimoto
0
220
[論文紹介] Intuitive Fine-Tuning
ryou0634
0
150
大学見本市2025 JSTさきがけ事業セミナー「顔の見えないセンシング技術:多様なセンサにもとづく個人情報に配慮した人物状態推定」
miso2024
0
190
Nullspace MPC
mizuhoaoki
1
460
2025/7/5 応用音響研究会招待講演@北海道大学
takuma_okamoto
1
240
SREのためのテレメトリー技術の探究 / Telemetry for SRE
yuukit
12
2.3k
多言語カスタマーインタビューの“壁”を越える~PMと生成AIの共創~ 株式会社ジグザグ 松野 亘
watarumatsuno
0
160
AIスパコン「さくらONE」の オブザーバビリティ / Observability for AI Supercomputer SAKURAONE
yuukit
2
960
Vision and LanguageからのEmbodied AIとAI for Science
yushiku
PRO
1
600
Featured
See All Featured
How to Ace a Technical Interview
jacobian
280
24k
Automating Front-end Workflow
addyosmani
1371
200k
A designer walks into a library…
pauljervisheath
210
24k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.3k
Music & Morning Musume
bryan
46
7k
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Balancing Empowerment & Direction
lara
5
790
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.6k
How to Think Like a Performance Engineer
csswizardry
28
2.3k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.3k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com