Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
190
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
800
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
大規模プロダクトで実践するAI活用の仕組みづくり
k1tikurisu
4
1.6k
AI エージェントを評価するための温故知新と Spec Driven Evaluation
icoxfog417
PRO
1
350
"おまじない"はもう卒業! デバッガで探るSpring Bootの裏側と「学び方」の学び方
takeuchi_132917
0
190
LINEヤフー バックエンド組織・体制の紹介
lycorptech_jp
PRO
0
810
Lazy Constant - finalフィールドの遅延初期化
skrb
0
240
なぜインフラコードのモジュール化は難しいのか - アプリケーションコードとの本質的な違いから考える
mizzy
59
20k
旧から新へ: 大規模ウェブクローラの Perl から Go への移行 / YAPC::Fukuoka 2025
motemen
3
1.1k
Spring Boot利用を前提としたJavaライブラリ開発方法の提案
kokihoshihara
PRO
2
240
ABEJA FIRST GUIDE for Software Engineers
abeja
0
3.2k
【M3】攻めのセキュリティの実践!プロアクティブなセキュリティ対策の実践事例
axelmizu
0
170
Bedrock のコスト監視設計
fohte
2
180
身近なCSVを活用する!AWSのデータ分析基盤アーキテクチャ
koosun
0
1.9k
Featured
See All Featured
Measuring & Analyzing Core Web Vitals
bluesmoon
9
670
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.2k
Rebuilding a faster, lazier Slack
samanthasiow
84
9.3k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Fireside Chat
paigeccino
41
3.7k
Making the Leap to Tech Lead
cromwellryan
135
9.6k
A Modern Web Designer's Workflow
chriscoyier
697
190k
How STYLIGHT went responsive
nonsquared
100
5.9k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
36
6.1k
Documentation Writing (for coders)
carmenintech
76
5.1k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
37
2.6k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None