Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
110
Apache Spark: Unified Platform for Big Data
rxin
1
210
Advanced Spark @ Spark Summit 2014
rxin
4
300
Apache Spark: Easier and Faster Big Data
rxin
2
270
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
470
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
ECCV2024読み会: Minimalist Vision with Freeform Pixels
hsmtta
1
300
ミニ四駆AI用制御装置の事例紹介
aks3g
0
180
アプリケーションから知るモデルマージ
maguro27
0
180
メールからの名刺情報抽出におけるLLM活用 / Use of LLM in extracting business card information from e-mails
sansan_randd
2
270
The Fellowship of Trust in AI
tomzimmermann
0
150
熊本から日本の都市交通政策を立て直す~「車1割削減、渋滞半減、公共交通2倍」の実現へ~@公共交通マーケティング研究会リスタートセミナー
trafficbrain
0
180
Weekly AI Agents News! 9月号 プロダクト/ニュースのアーカイブ
masatoto
2
170
Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
satai
2
130
Weekly AI Agents News! 8月号 論文のアーカイブ
masatoto
1
220
リモートワークにおけるパッシブ疲労
matsumoto_r
PRO
4
2.7k
The Relevance of UX for Conversion and Monetisation
itasohaakhib1
0
120
Introducing Research Units of Matsuo-Iwasawa Laboratory
matsuolab
0
1.3k
Featured
See All Featured
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
95
17k
Side Projects
sachag
452
42k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
8
1.2k
Visualization
eitanlees
146
15k
Optimizing for Happiness
mojombo
376
70k
BBQ
matthewcrist
85
9.4k
Making the Leap to Tech Lead
cromwellryan
133
9k
Being A Developer After 40
akosma
87
590k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
59k
Writing Fast Ruby
sferik
628
61k
Building Better People: How to give real-time feedback that sticks.
wjessup
365
19k
Build The Right Thing And Hit Your Dates
maggiecrowley
33
2.4k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com