Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
130
Apache Spark: Unified Platform for Big Data
rxin
1
240
Advanced Spark @ Spark Summit 2014
rxin
4
340
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
150
Shark SIGMOD research deck
rxin
2
530
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
710
Other Decks in Research
See All in Research
Community Driveプロジェクト(CDPJ)の中間報告
smartfukushilab1
0
160
Can AI Generated Ambrotype Chain the Aura of Alternative Process? In SIGGRAPH Asia 2024 Art Papers
toremolo72
0
130
さまざまなAgent FrameworkとAIエージェントの評価
ymd65536
1
410
離散凸解析に基づく予測付き離散最適化手法 (IBIS '25)
taihei_oki
PRO
1
680
HoliTracer:Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
satai
3
610
AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性 / AI Supercomputer LLM Benchmarking and Observability
yuukit
1
630
2026-01-30-MandSL-textbook-jp-cos-lod
yegusa
0
150
ペットのかわいい瞬間を撮影する オートシャッターAIアプリへの スマートラベリングの適用
mssmkmr
0
250
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
200
その推薦システムの評価指標、ユーザーの感覚とズレてるかも
kuri8ive
1
310
一般道の交通量減少と速度低下についての全国分析と熊本市におけるケーススタディ(20251122 土木計画学研究発表会)
trafficbrain
0
160
J-RAGBench: 日本語RAGにおける Generator評価ベンチマークの構築
koki_itai
0
1.3k
Featured
See All Featured
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
75
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.3k
Ethics towards AI in product and experience design
skipperchong
2
190
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
220
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
3.6k
How to train your dragon (web standard)
notwaldorf
97
6.5k
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
More Than Pixels: Becoming A User Experience Designer
marktimemedia
3
320
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
0
430
The Language of Interfaces
destraynor
162
26k
Claude Code のすすめ
schroneko
67
210k
RailsConf 2023
tenderlove
30
1.3k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com