Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
120
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
110
Apache Spark: Unified Platform for Big Data
rxin
1
210
Advanced Spark @ Spark Summit 2014
rxin
4
290
Apache Spark: Easier and Faster Big Data
rxin
2
260
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
450
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
Weekly AI Agents News! 7月号 論文のアーカイブ
masatoto
1
190
「人間にAIはどのように辿り着けばよいのか?ー 系統的汎化からの第一歩 ー」@第22回 Language and Robotics研究会
maguro27
0
530
marukotenant01/tenant-20240826
marketing2024
0
500
システムから変える 自分と世界を変えるシステムチェンジの方法論 / Systems Change Approaches
dmattsun
3
750
初めての研究発表を成功させよう! スライド作成の基本
ayaco0
11
4.5k
外積やロドリゲスの回転公式を利用した点群の回転
kentaitakura
1
560
言語処理学会30周年記念事業留学支援交流会@YANS2024:「学生のための短期留学」
a1da4
1
210
SSII2024 [OS3] 企業における基盤モデル開発の実際
ssii
PRO
0
570
SSII2024 [OS2] 大規模言語モデルと基盤モデルの射程
ssii
PRO
0
480
20240626_金沢大学_新機能集積回路設計特論_配布用 #makelsi
takasumasakazu
0
150
JMED-LLM: 日本語医療LLM評価データセットの公開
fta98
4
1k
MIRU2024チュートリアル「様々なセンサやモダリティを用いたシーン状態推定」
miso2024
3
2k
Featured
See All Featured
Fontdeck: Realign not Redesign
paulrobertlloyd
81
5.2k
A Modern Web Designer's Workflow
chriscoyier
692
190k
Why You Should Never Use an ORM
jnunemaker
PRO
53
9k
What's new in Ruby 2.0
geeforr
341
31k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
37
1.7k
Making Projects Easy
brettharned
114
5.8k
Building a Scalable Design System with Sketch
lauravandoore
459
32k
The Brand Is Dead. Long Live the Brand.
mthomps
53
38k
Done Done
chrislema
181
16k
Designing with Data
zakiwarfel
98
5.1k
The Invisible Customer
myddelton
119
13k
Faster Mobile Websites
deanohume
304
30k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com