Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
120
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
110
Apache Spark: Unified Platform for Big Data
rxin
1
210
Advanced Spark @ Spark Summit 2014
rxin
4
290
Apache Spark: Easier and Faster Big Data
rxin
2
270
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
460
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
Streaming CityJSON datasets
hugoledoux
0
150
Weekly AI Agents News! 9月号 プロダクト/ニュースのアーカイブ
masatoto
1
110
ニューラルネットワークの損失地形
joisino
PRO
35
15k
Generative Predictive Model for Autonomous Driving 第61回 コンピュータビジョン勉強会@関東 (後編)
kentosasaki
0
200
大規模言語モデルを用いた日本語視覚言語モデルの評価方法とベースラインモデルの提案 【MIRU 2024】
kentosasaki
2
500
Как стать 10x экспертом
ikurochkin
1
180
RSJ2024「基盤モデルの実ロボット応用」チュートリアルA(河原塚)
haraduka
2
620
Weekly AI Agents News! 10月号 論文のアーカイブ
masatoto
1
140
3次元点群の分類における評価指標について
kentaitakura
0
320
LLM時代にLabは何をすべきか聞いて回った1年間
hargon24
1
480
システムから変える 自分と世界を変えるシステムチェンジの方法論 / Systems Change Approaches
dmattsun
3
840
クラウドソーシングによる学習データ作成と品質管理(セキュリティキャンプ2024全国大会D2講義資料)
takumi1001
0
250
Featured
See All Featured
Happy Clients
brianwarren
97
6.7k
Automating Front-end Workflow
addyosmani
1365
200k
Being A Developer After 40
akosma
86
590k
How To Stay Up To Date on Web Technology
chriscoyier
788
250k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5k
The World Runs on Bad Software
bkeepers
PRO
65
11k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
167
49k
GraphQLとの向き合い方2022年版
quramy
43
13k
KATA
mclloyd
29
13k
Building Better People: How to give real-time feedback that sticks.
wjessup
363
19k
Fireside Chat
paigeccino
32
3k
4 Signs Your Business is Dying
shpigford
180
21k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com