Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
170
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
51
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
220
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
名刺メーカーDevグループ 紹介資料
sansan33
PRO
0
740
カンファレンスのつくりかた / The Conference Code: What Makes It All Work
tomzoh
7
870
【5分でわかる】セーフィー エンジニア向け会社紹介
safie_recruit
0
24k
Cloud Run を解剖して コンテナ監視を考える / Breaking Down Cloud Run to Rethink Container Monitoring
aoto
PRO
0
110
Azure Developer CLI と Azure Deployment Environment / Azure Developer CLI and Azure Deployment Environment
nnstt1
1
110
Oracle Database オプティマイザ・ヒントの活用
oracle4engineer
PRO
1
130
金融システムをモダナイズするためのAmazon Elastic Kubernetes Service(EKS)ノウハウ大全
daitak
0
120
Oracle Cloud Infrastructure:2025年5月度サービス・アップデート
oracle4engineer
PRO
0
320
Streamline Cloud-Native App Development Using CDEs
saeedzf
0
660
TypeScript と歩む OpenAPI の discriminator / OpenAPI discriminator with TypeScript
kaminashi
1
120
令和トラベルQAのAI活用
seigaitakahiro
0
470
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
8
65k
Featured
See All Featured
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
How STYLIGHT went responsive
nonsquared
100
5.6k
Building a Modern Day E-commerce SEO Strategy
aleyda
40
7.3k
The Cult of Friendly URLs
andyhume
78
6.4k
Building Adaptive Systems
keathley
41
2.6k
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
Build The Right Thing And Hit Your Dates
maggiecrowley
35
2.7k
The Language of Interfaces
destraynor
158
25k
Building Better People: How to give real-time feedback that sticks.
wjessup
368
19k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Raft: Consensus for Rubyists
vanstee
137
7k
Agile that works and the tools we love
rasmusluckow
329
21k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None