Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
160
1
Share
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
140
Apache Spark: Unified Platform for Big Data
rxin
1
250
Advanced Spark @ Spark Summit 2014
rxin
4
360
Apache Spark: Easier and Faster Big Data
rxin
2
310
GraphX at Spark User Meetup
rxin
0
170
Shark SIGMOD research deck
rxin
2
560
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
720
Other Decks in Research
See All in Research
社内データ分析AIエージェントを できるだけ使いやすくする工夫
fufufukakaka
1
1.1k
セマンティック通信勉強会 6Gに向けたデバイス間効率的な通信の技術紹介・課題・今後展望
satai
2
140
Research Engineerという仕事 / Research Engineering: Bridging Research and Business
chck
1
180
LLMアプリケーションの透明性について
fufufukakaka
0
230
討議:RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
0
930
Scalable dynamic origin-destination demand estimation enhanced by high-resolution satellite imagery data
satai
2
240
[BlackHatAsia2026] Hidden Telemetry: Uncovering TraceLogging ETW Providers You're Not Using (Yet)
asuna_jp
1
490
ペットのかわいい瞬間を撮影する オートシャッターAIアプリへの スマートラベリングの適用
mssmkmr
0
510
存立危機事態の再検討
jimboken
0
290
Dual Quadric表現を用いた動的物体追跡とRGB-D・IMU制約の密結合によるオドメトリ推定
nanoshimarobot
0
400
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
190
老舗ものづくり企業でリサーチが変革を起こすまで - 三菱重工DXの実践
skydats
0
170
Featured
See All Featured
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
350
For a Future-Friendly Web
brad_frost
183
10k
Git: the NoSQL Database
bkeepers
PRO
432
67k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.8k
The Mindset for Success: Future Career Progression
greggifford
PRO
0
350
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
130
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
270
Building Flexible Design Systems
yeseniaperezcruz
330
40k
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
600
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com