Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
320
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
370
High Performance RPC with Finagle
samklr
1
220
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
820
Datageeks_27-05.pdf
samklr
0
73
Big data and Machine learning APIs
samklr
4
280
Scalable Machine Learning
samklr
2
260
mesos.devoxx.2014
samklr
2
290
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
開発チームとQAエンジニアの新しい協業モデル -年末調整開発チームで実践する【QAリード施策】-
kaomi_wombat
0
260
出版記念イベントin大阪「書籍紹介&私がよく使うMCPサーバー3選と社内で安全に活用する方法」
kintotechdev
0
110
Blue/Green Deployment を用いた PostgreSQL のメジャーバージョンアップ
kkato1
0
160
データマネジメント戦略Night - 4社のリアルを語る会
ktatsuya
1
430
スケールアップ企業でQA組織が機能し続けるための組織設計と仕組み〜ボトムアップとトップダウンを両輪としたアプローチ〜
qa
0
370
BFCacheを活用して無限スクロールのUX を改善した話
apple_yagi
0
130
MCPで決済に楽にする
mu7889yoon
0
160
「お金で解決」が全てではない!大規模WebアプリのCI高速化 #phperkaigi
stefafafan
5
2.4k
なぜarray_firstとarray_lastは採用、 array_value_firstとarray_value_lastは 見送りだったか / Why array_value_first and array_value_last was declined, then why array_first and array_last was accpeted?
cocoeyes02
0
230
互換性のある(らしい)DBへの移行など考えるにあたってたいへんざっくり
sejima
PRO
0
260
PostgreSQL 18のNOT ENFORCEDな制約とDEFERRABLEの関係
yahonda
0
140
TUNA Camp 2026 京都Stage ヒューリスティックアルゴリズム入門
terryu16
0
600
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.8k
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
150
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.3k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
52k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
The Cost Of JavaScript in 2023
addyosmani
55
9.8k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.2k
The untapped power of vector embeddings
frankvandijk
2
1.6k
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
150
Building AI with AI
inesmontani
PRO
1
820
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
120
Heart Work Chapter 1 - Part 1
lfama
PRO
5
35k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None