Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
230
Advanced Spark @ Spark Summit 2014
rxin
4
330
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
140
Shark SIGMOD research deck
rxin
2
520
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
Mechanistic Interpretability:解釈可能性研究の新たな潮流
koshiro_aoki
1
440
20250605_新交通システム推進議連_熊本都市圏「車1割削減、渋滞半減、公共交通2倍」から考える地方都市交通政策
trafficbrain
0
840
RHO-1: Not All Tokens Are What You Need
sansan_randd
1
190
日本語新聞記事を用いた大規模言語モデルの暗記定量化 / LLMC2025
upura
0
230
20250725-bet-ai-day
cipepser
2
470
長期・短期メモリを活用したエージェントの個別最適化
isidaitc
0
160
Generative Models 2025
takahashihiroshi
25
13k
Adaptive Experimental Design for Efficient Average Treatment Effect Estimation and Treatment Choice
masakat0
0
120
診断前の病歴テキストを対象としたLLMによるエンティティリンキング精度検証
hagino3000
1
150
AI エージェントを活用した研究再現性の自動定量評価 / scisci2025
upura
1
160
超高速データサイエンス
matsui_528
1
140
Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation
satai
3
210
Featured
See All Featured
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.7k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
33
2.5k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
Designing Experiences People Love
moore
142
24k
A designer walks into a library…
pauljervisheath
209
24k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.8k
Bash Introduction
62gerente
615
210k
Git: the NoSQL Database
bkeepers
PRO
431
66k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
30
2.9k
Rails Girls Zürich Keynote
gr2m
95
14k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com