Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
110
Apache Spark: Unified Platform for Big Data
rxin
1
210
Advanced Spark @ Spark Summit 2014
rxin
4
300
Apache Spark: Easier and Faster Big Data
rxin
2
270
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
460
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
最近のVisual Odometryと Depth Estimation
sgk
1
270
12
0325
0
190
非ガウス性と非線形性に基づく統計的因果探索
sshimizu2006
0
370
Large Vision Language Model (LVLM) に関する最新知見まとめ (Part 1)
onely7
21
3.5k
[依頼講演] 適応的実験計画法に基づく効率的無線システム設計
k_sato
0
130
LLM時代にLabは何をすべきか聞いて回った1年間
hargon24
1
500
文書画像のデータ化における VLM活用 / Use of VLM in document image data conversion
sansan_randd
2
200
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
eumesy
PRO
7
1.2k
新規のC言語処理系を実装することによる 組込みシステム研究にもたらす価値 についての考察
zacky1972
0
140
多様かつ継続的に変化する環境に適応する情報システム/thesis-defense-presentation
monochromegane
1
540
[CV勉強会@関東 CVPR2024] Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation / kantocv 61th CVPR 2024
shunk031
1
460
MIRU2024_招待講演_RALF_in_CVPR2024
udonda
1
330
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
8
900
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
27
4.3k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
28
8.2k
Statistics for Hackers
jakevdp
796
220k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
Fantastic passwords and where to find them - at NoRuKo
philnash
50
2.9k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
191
16k
GraphQLとの向き合い方2022年版
quramy
43
13k
Intergalactic Javascript Robots from Outer Space
tanoku
269
27k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
28
2k
[RailsConf 2023] Rails as a piece of cake
palkan
52
4.9k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com