Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
170
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
50
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
220
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
グループ ポリシー再確認 (2)
murachiakira
0
230
Part1 GitHubってなんだろう?その1
tomokusaba
2
540
意思決定を支える検索体験を目指してやってきたこと
hinatades
PRO
0
400
データベース04: SQL (1/3) 単純質問 & 集約演算
trycycle
PRO
0
720
Dataverseの検索列について
miyakemito
1
180
Global Azure2025(GitHub Copilot ハンズオン)
tomokusaba
1
530
Aspire をカスタマイズしよう & Aspire 9.2
nenonaninu
0
380
社会人力と研究力ー博士号をキャリアの武器にするー
kentaro
2
110
Асинхронная коммуникация в Go: от понятного к душному. Дима Некрасов, Otello, 2ГИС
lamodatech
0
2k
Как мы автоматизировали интеграционное тестирование с Gonkey и не пожалели. Паша Егорычев, Кирилл Поляков
lamodatech
0
2k
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
5.5k
Microsoft の SSE の現在地
skmkzyk
0
300
Featured
See All Featured
Facilitating Awesome Meetings
lara
54
6.3k
Designing for humans not robots
tammielis
253
25k
Embracing the Ebb and Flow
colly
85
4.7k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5.4k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Intergalactic Javascript Robots from Outer Space
tanoku
271
27k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.8k
Testing 201, or: Great Expectations
jmmastey
42
7.5k
KATA
mclloyd
29
14k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
Why You Should Never Use an ORM
jnunemaker
PRO
56
9.3k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None