Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
200
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
57
Big data and Machine learning APIs
samklr
4
270
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
270
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
新 Security HubがついにGA!仕組みや料金を深堀り #AWSreInvent #regrowth / AWS Security Hub Advanced GA
masahirokawahara
1
1.5k
エンジニアリングマネージャー はじめての目標設定と評価
halkt
0
260
会社紹介資料 / Sansan Company Profile
sansan33
PRO
11
390k
Haskell を武器にして挑む競技プログラミング ─ 操作的思考から意味モデル思考へ
naoya
6
1.1k
Ruby で作る大規模イベントネットワーク構築・運用支援システム TTDB
taketo1113
1
220
Noを伝える技術2025: 爆速合意形成のためのNICOフレームワーク速習 #pmconf2025
aki_iinuma
2
2.1k
手動から自動へ、そしてその先へ
moritamasami
0
290
エンジニアリングをやめたくないので問い続ける
estie
0
200
[デモです] NotebookLM で作ったスライドの例
kongmingstrap
0
110
AI時代の開発フローとともに気を付けたいこと
kkamegawa
0
2.5k
Lessons from Migrating to OpenSearch: Shard Design, Log Ingestion, and UI Decisions
sansantech
PRO
1
100
エンジニアとPMのドメイン知識の溝をなくす、 AIネイティブな開発プロセス
applism118
4
1.1k
Featured
See All Featured
[RailsConf 2023] Rails as a piece of cake
palkan
58
6.1k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
390
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
RailsConf 2023
tenderlove
30
1.3k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
34k
Agile that works and the tools we love
rasmusluckow
331
21k
GitHub's CSS Performance
jonrohan
1032
470k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
970
How GitHub (no longer) Works
holman
316
140k
Visualization
eitanlees
150
16k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None