Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
49
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
Evolving Architecture
rainerhahnekamp
3
260
Formal Development of Operating Systems in Rust
riru
1
420
20250116_JAWS_Osaka
takuyay0ne
2
200
2025年に挑戦したいこと
molmolken
0
160
AIアプリケーション開発でAzure AI Searchを使いこなすためには
isidaitc
1
120
新卒1年目、はじめてのアプリケーションサーバー【IBM WebSphere Liberty】
ktgrryt
0
130
GoogleのAIエージェント論 Authors: Julia Wiesinger, Patrick Marlow and Vladimir Vuskovic
customercloud
PRO
0
160
商品レコメンドでのexplicit negative feedbackの活用
alpicola
2
370
2025年の挑戦 コーポレートエンジニアの技術広報/techpr5
nishiuma
0
150
2024AWSで個人的にアツかったアップデート
nagisa53
1
110
深層学習と3Dキャプチャ・3Dモデル生成(土木学会応用力学委員会 応用数理・AIセミナー)
pfn
PRO
0
460
AWSサービスアップデート 2024/12 Part3
nrinetcom
PRO
0
140
Featured
See All Featured
Building Your Own Lightsaber
phodgson
104
6.2k
The Power of CSS Pseudo Elements
geoffreycrofte
74
5.4k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
3
240
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
7
570
Making the Leap to Tech Lead
cromwellryan
133
9k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
The Pragmatic Product Professional
lauravandoore
32
6.4k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Speed Design
sergeychernyshev
25
740
KATA
mclloyd
29
14k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None