Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
260
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
230
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.7k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
障害対応指揮の意思決定と情報共有における価値観 / Waroom Meetup #2
arthur1
5
490
10XにおけるData Contractの導入について: Data Contract事例共有会
10xinc
6
660
VideoMamba: State Space Model for Efficient Video Understanding
chou500
0
190
CysharpのOSS群から見るModern C#の現在地
neuecc
2
3.5k
AWS Lambdaと歩んだ“サーバーレス”と今後 #lambda_10years
yoshidashingo
1
180
テストコード品質を高めるためにMutation Testingライブラリ・Strykerを実戦導入してみた話
ysknsid25
7
2.7k
AWS Lambda のトラブルシュートをしていて思うこと
kazzpapa3
2
180
Lambdaと地方とコミュニティ
miu_crescent
2
370
RubyのWebアプリケーションを50倍速くする方法 / How to Make a Ruby Web Application 50 Times Faster
hogelog
3
950
Why App Signing Matters for Your Android Apps - Android Bangkok Conference 2024
akexorcist
0
130
Application Development WG Intro at AppDeveloperCon
salaboy
0
190
データプロダクトの定義からはじめる、データコントラクト駆動なデータ基盤
chanyou0311
2
330
Featured
See All Featured
Typedesign – Prime Four
hannesfritz
40
2.4k
How STYLIGHT went responsive
nonsquared
95
5.2k
Site-Speed That Sticks
csswizardry
0
28
Into the Great Unknown - MozCon
thekraken
32
1.5k
GraphQLとの向き合い方2022年版
quramy
43
13k
Build The Right Thing And Hit Your Dates
maggiecrowley
33
2.4k
5 minutes of I Can Smell Your CMS
philhawksworth
202
19k
How to Ace a Technical Interview
jacobian
276
23k
GraphQLの誤解/rethinking-graphql
sonatard
67
10k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
1.9k
Optimizing for Happiness
mojombo
376
70k
Bash Introduction
62gerente
608
210k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None