Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
220
Advanced Spark @ Spark Summit 2014
rxin
4
310
Apache Spark: Easier and Faster Big Data
rxin
2
270
GraphX at Spark User Meetup
rxin
0
140
Shark SIGMOD research deck
rxin
2
490
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
TRIPOD+AI Expandedチェックリスト 有志翻訳による日本語版 version.1.1
shuntaros
0
120
Mathematics in the Age of AI and the 4 Generation University
hachama
0
140
IM2024
mamoruk
0
250
DeepSeek を利用する上でのリスクと安全性の考え方
schroneko
3
1.3k
90 分で学ぶ P 対 NP 問題
e869120
14
6.1k
インドネシアのQA事情を紹介するの
yujijs
0
180
AWS 音声基盤モデル トーク解析AI MiiTelの音声処理について
ken57
0
200
Data-centric AI勉強会 「ロボットにおけるData-centric AI」
haraduka
0
560
【NLPコロキウム】Stepwise Alignment for Constrained Language Model Policy Optimization (NeurIPS 2024)
akifumi_wachi
3
570
定性データ、どう活かす? 〜定性データのための分析基盤、はじめました〜 / How to utilize qualitative data? ~We have launched an analysis platform for qualitative data~
kaminashi
6
770
ドローンやICTを活用した持続可能なまちづくりに関する研究
nro2daisuke
0
200
Introduction of NII S. Koyama's Lab (AY2025)
skoyamalab
0
240
Featured
See All Featured
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Optimising Largest Contentful Paint
csswizardry
36
3.2k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
104
19k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
9
740
Stop Working from a Prison Cell
hatefulcrawdad
268
20k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
129
19k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
119
51k
Done Done
chrislema
183
16k
Art, The Web, and Tiny UX
lynnandtonic
298
20k
A designer walks into a library…
pauljervisheath
205
24k
Scaling GitHub
holman
459
140k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com