Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
110
Apache Spark: Unified Platform for Big Data
rxin
1
210
Advanced Spark @ Spark Summit 2014
rxin
4
300
Apache Spark: Easier and Faster Big Data
rxin
2
270
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
480
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
研究を支える拡張性の高い ワークフローツールの提案 / Proposal of highly expandable workflow tools to support research
linyows
0
250
医療支援AI開発における臨床と情報学の連携を円滑に進めるために
moda0
0
140
書き手はどこを訪れたか? - 言語モデルで訪問行動を読み取る -
hiroki13
0
110
地理空間情報と自然言語処理:「地球の歩き方旅行記データセット」の高付加価値化を通じて
hiroki13
1
160
20240918 交通くまもとーく 未来の鉄道網編(太田恒平)
trafficbrain
0
440
eAI (Engineerable AI) プロジェクトの全体像 / Overview of eAI Project
ishikawafyu
0
190
LLM 시대의 Compliance: Safety & Security
huffon
0
490
20240918 交通くまもとーく 未来の鉄道網編(こねくま)
trafficbrain
0
400
Human-Informed Machine Learning Models and Interactions
hiromu1996
2
550
Weekly AI Agents News! 11月号 プロダクト/ニュースのアーカイブ
masatoto
0
260
文書画像のデータ化における VLM活用 / Use of VLM in document image data conversion
sansan_randd
2
420
ダイナミックプライシング とその実例
skmr2348
3
530
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
53
13k
Fashionably flexible responsive web design (full day workshop)
malarkey
406
66k
Site-Speed That Sticks
csswizardry
3
270
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2k
How to Ace a Technical Interview
jacobian
276
23k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
192
16k
Why You Should Never Use an ORM
jnunemaker
PRO
54
9.1k
What's in a price? How to price your products and services
michaelherold
244
12k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
The Power of CSS Pseudo Elements
geoffreycrofte
74
5.4k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com