Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
320
0
Share
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
380
High Performance RPC with Finagle
samklr
1
220
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
830
Datageeks_27-05.pdf
samklr
0
76
Big data and Machine learning APIs
samklr
4
290
Scalable Machine Learning
samklr
2
260
mesos.devoxx.2014
samklr
2
290
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
310
Other Decks in Technology
See All in Technology
GCASアップデート(202603-202605)
techniczna
0
220
インプロセスQAのための要因から捉えるプロジェクトリスクマネジメントnano #1 開発リソース効率状態への対処 #jasstnano
barus_qa
0
180
Redmine次期バージョン7.0の注目新機能解説 — UI/UX強化と連携強化を中心に
vividtone
1
170
エンタープライズの厳格な制約を開発者に意識させない:クラウドネイティブ開発基盤設計/cloudnative-kaigi-golden-path
mhrtech
0
450
分断された OT と IT を繋ぐ架け橋 -Kubernetes が切り拓く 産業用組み込み製品の現在地 -
yudaiono
1
120
サイボウズ、プラットフォームエンジニアリング始めるってよ ― プラットフォームチームの事業貢献と組織アラインメントの強化
ueokande
0
120
20260515 ログイン機能だけではないアカウント管理を全体で考える~サービス設計者向け~
oidfj
1
770
マンション備え付けのネットワークとLTE回線を組み合わせた ネットワークの安定化の考案
harutiro
1
140
可視化から活用へ — Mesh化・Segmentation・アライメントの研究動向
gpuunite_official
0
230
続 運用改善、不都合な真実 〜 物理制約のない運用改善はほとんど無価値 / 20260518-ssmjp-kaizen-no-value-without-physical-constraints
opelab
2
250
アプリブロック機能のつくりかたと、AIとHTMLの不合理な相性の良さについて
kumamotone
1
260
おいらのAWSアップデートの追い方〜Slack×AgentCore〜
yakumo
1
110
Featured
See All Featured
Joys of Absence: A Defence of Solitary Play
codingconduct
1
360
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.6k
The Art of Programming - Codeland 2020
erikaheidi
57
14k
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
210
Un-Boring Meetings
codingconduct
0
290
Game over? The fight for quality and originality in the time of robots
wayneb77
1
170
The SEO Collaboration Effect
kristinabergwall1
1
450
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
250
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
170
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.6k
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.5k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None