Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
260
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
240
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
230
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.7k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
いまさらのStorybook
ikumatadokoro
0
150
新卒1年目が向き合う生成AI事業の開発を加速させる技術選定 / ai-web-launcher
cyberagentdevelopers
PRO
7
1.5k
10分でわかるfreeeのQA
freee
1
3.4k
MAMを軸とした動画ハンドリングにおけるAI活用前提の整備と次世代ビジョン / abema-ai-mam
cyberagentdevelopers
PRO
1
120
Gradle: The Build System That Loves To Hate You
aurimas
2
150
IaC運用を楽にするためにCDK Pipelinesを導入したけど、思い通りにいかなかった話
smt7174
1
120
急成長中のWINTICKETにおける品質と開発スピードと向き合ったQA戦略と今後の展望 / winticket-autify
cyberagentdevelopers
PRO
1
160
マネジメント視点でのre:Invent参加 ~もしCEOがre:Inventに行ったら~
kojiasai
0
470
pandasはPolarsに性能面で追いつき追い越せるのか
vaaaaanquish
4
4.7k
Forget efficiency – Become more productive without the stress
ufried
0
150
チームを主語にしてみる / Making "Team" the Subject
ar_tama
4
310
【若手エンジニア応援LT会】AWS Security Hubの活用に苦労した話
kazushi_ohata
0
170
Featured
See All Featured
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
RailsConf 2023
tenderlove
29
880
Building Better People: How to give real-time feedback that sticks.
wjessup
363
19k
Why Our Code Smells
bkeepers
PRO
334
57k
Side Projects
sachag
452
42k
Code Review Best Practice
trishagee
64
17k
The Art of Programming - Codeland 2020
erikaheidi
51
13k
Adopting Sorbet at Scale
ufuk
73
9k
The World Runs on Bad Software
bkeepers
PRO
65
11k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
226
22k
GraphQLの誤解/rethinking-graphql
sonatard
66
10k
Designing Experiences People Love
moore
138
23k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None