Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
170
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
50
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
220
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
Amebaにおける Platform Engineeringの実践
kumorn5s
6
900
Android는 어떻게 화면을 그릴까?
davidkwon7
0
100
Classmethod AI Talks(CATs) #20 司会進行スライド(2025.04.10) / classmethod-ai-talks-aka-cats_moderator-slides_vol20_2025-04-10
shinyaa31
0
130
Ops-JAWS_Organizations小ネタ3選.pdf
chunkof
2
120
こんなデータマートは嫌だ。どんな? / waiwai-data-meetup-202504
shuntak
6
1.7k
ブラウザのレガシー・独自機能を愛でる-Firefoxの脆弱性4選- / Browser Crash Club #1
masatokinugawa
1
390
AIと開発者の共創: エージェント時代におけるAIフレンドリーなDevOpsの実践
bicstone
1
250
ウォンテッドリーにおける Platform Engineering
bgpat
0
190
システムとの会話から生まれる先手のDevOps
kakehashi
PRO
0
220
SDカードフォレンジック
su3158
0
260
Рекомендации с нуля: как мы в Lamoda превратили главную страницу в ключевую точку входа для персонализированного шоппинга. Данил Комаров, Data Scientist, Lamoda Tech
lamodatech
0
340
Lakeflow Connectのご紹介
databricksjapan
0
100
Featured
See All Featured
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
331
21k
BBQ
matthewcrist
88
9.6k
The Art of Programming - Codeland 2020
erikaheidi
53
13k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
119
51k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
19
1.1k
Java REST API Framework Comparison - PWX 2021
mraible
30
8.5k
Statistics for Hackers
jakevdp
798
220k
Rails Girls Zürich Keynote
gr2m
94
13k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
5
520
How to train your dragon (web standard)
notwaldorf
91
6k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None