Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
220
Advanced Spark @ Spark Summit 2014
rxin
4
310
Apache Spark: Easier and Faster Big Data
rxin
2
280
GraphX at Spark User Meetup
rxin
0
140
Shark SIGMOD research deck
rxin
2
500
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
インドネシアのQA事情を紹介するの
yujijs
0
190
ウッドスタックチャン:木材を用いた小型エージェントロボットの開発と印象評価 / ec75-sato
yumulab
1
290
SpectralMamba: Efficient Mamba for Hyperspectral Image Classification
satai
3
320
実行環境に中立なWebAssemblyライブマイグレーション機構/techtalk-2025spring
chikuwait
0
160
JSAI NeurIPS 2024 参加報告会(AI アライメント)
akifumi_wachi
5
980
Individual tree crown delineation in high resolution aerial RGB imagery using StarDist-based model
satai
3
210
言語モデルLUKEを経済の知識に特化させたモデル「UBKE-LUKE」について
petter0201
0
360
Mathematics in the Age of AI and the 4 Generation University
hachama
0
150
データサイエンティストの採用に関するアンケート
datascientistsociety
PRO
0
640
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
satai
3
440
Batch Processing Algorithm for Elliptic Curve Operations and Its AVX-512 Implementation
herumi
0
150
博士論文公聴会: Scaling Telemetry Workloads in Cloud Applications: Techniques for Instrumentation, Storage, and Mining / PhD Defence
yuukit
1
130
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
14
1.5k
Building an army of robots
kneath
305
45k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
120
52k
Designing for humans not robots
tammielis
253
25k
Embracing the Ebb and Flow
colly
85
4.7k
Fontdeck: Realign not Redesign
paulrobertlloyd
84
5.5k
Automating Front-end Workflow
addyosmani
1370
200k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
227
22k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
179
53k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
790
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
5
600
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com