Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Sam Bessalah
April 06, 2016
Technology
0
300
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
210
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
63
Big data and Machine learning APIs
samklr
4
280
Scalable Machine Learning
samklr
2
250
mesos.devoxx.2014
samklr
2
280
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
Databricks Free Edition講座 データサイエンス編
taka_aki
0
230
Amazon ElastiCacheのコスト最適化を考える/Elasticache Cost Optimization
quiver
0
220
サイボウズ 開発本部採用ピッチ / Cybozu Engineer Recruit
cybozuinsideout
PRO
10
72k
フロントエンド開発者のための「厄払い」
optim
0
170
「AIでできますか?」から「Agentを作ってみました」へ ~「理論上わかる」と「やってみる」の隔たりを埋める方法
applism118
9
6.5k
AWS Devops Agent ~ 自動調査とSlack統合をやってみた! ~
kubomasataka
2
250
漸進的過負荷の原則
sansantech
PRO
3
410
それぞれのペースでやっていく Bet AI / Bet AI at Your Own Pace
yuyatakeyama
1
650
VRTと真面目に向き合う
hiragram
1
490
かわいい身体と声を持つ そういうものに私はなりたい
yoshimura_datam
0
550
Zephyr RTOS の発表をOpen Source Summit Japan 2025で行った件
iotengineer22
0
280
Exadata Database Service ソフトウェアのアップデートとアップグレードの概要
oracle4engineer
PRO
1
1.2k
Featured
See All Featured
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
66
36k
What's in a price? How to price your products and services
michaelherold
247
13k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
From π to Pie charts
rasagy
0
120
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.5k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Between Models and Reality
mayunak
1
170
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
0
300
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
150
Designing for Performance
lara
610
70k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None