Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
280
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
250
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
AI時代の経営、Bet AI Vision #BetAIDay
layerx
PRO
0
360
増え続ける脆弱性に立ち向かう: 事前対策と優先度づけによる 持続可能な脆弱性管理 / Confronting the Rise of Vulnerabilities: Sustainable Management Through Proactive Measures and Prioritization
nttcom
1
230
完璧を目指さない小さく始める信頼性向上
kakehashi
PRO
0
130
LLM開発を支えるエヌビディアの生成AIエコシステム
acceleratedmu3n
0
350
バクラクによるコーポレート業務の自動運転 #BetAIDay
layerx
PRO
0
240
AIエージェントを支える設計
tkikuchi1002
12
2.6k
メモ整理が苦手な者による頑張らないObsidian活用術
optim
1
160
MCPと認可まわりの話 / mcp_and_authorization
convto
2
340
LLMをツールからプラットフォームへ〜Ai Workforceの戦略〜 #BetAIDay
layerx
PRO
0
200
FAST導入1年間のふりかえり〜現実を直視し、さらなる進化を求めて〜 / Review of the first year of FAST implementation
wooootack
1
220
Power Automate のパフォーマンス改善レシピ / Power Automate Performance Improvement Recipes
karamem0
0
280
【CEDEC2025】『Shadowverse: Worlds Beyond』二度目のDCG開発でゲームをリデザインする~遊びやすさと競技性の両立~
cygames
PRO
1
160
Featured
See All Featured
Designing for humans not robots
tammielis
253
25k
It's Worth the Effort
3n
185
28k
Automating Front-end Workflow
addyosmani
1370
200k
Building Applications with DynamoDB
mza
95
6.5k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
283
13k
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.5k
The Art of Programming - Codeland 2020
erikaheidi
54
13k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Visualization
eitanlees
146
16k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None