Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Sam Bessalah
April 06, 2016
Technology
0
310
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
360
High Performance RPC with Finagle
samklr
1
210
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
810
Datageeks_27-05.pdf
samklr
0
66
Big data and Machine learning APIs
samklr
4
280
Scalable Machine Learning
samklr
2
250
mesos.devoxx.2014
samklr
2
280
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
衛星画像即時マッピングサービスの実現に向けて
lehupa
1
250
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
1
1k
Oracle Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
3
1.4k
ClickHouseはどのように大規模データを活用したAIエージェントを全社展開しているのか
mikimatsumoto
0
330
React 19時代のコンポーネント設計ベストプラクティス
uhyo
9
3.5k
LLMOpsのこれまでとこれからを学ぶ
nsakki55
2
570
Claude_CodeでSEOを最適化する_AI_Ops_Community_Vol.2__マーケティングx_AIはここまで進化した.pdf
riku_423
2
650
Claude Code for NOT Programming
kawaguti
PRO
1
270
量子クラウドサービスの裏側 〜Deep Dive into OQTOPUS〜
oqtopus
0
330
量子クラウドシステムと運用
oqtopus
0
170
広告の効果検証を題材にした因果推論の精度検証について
zozotech
PRO
0
230
Context Engineeringが企業で不可欠になる理由
hirosatogamo
PRO
3
810
Featured
See All Featured
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
61
53k
Joys of Absence: A Defence of Solitary Play
codingconduct
1
290
Game over? The fight for quality and originality in the time of robots
wayneb77
1
120
Ruling the World: When Life Gets Gamed
codingconduct
0
150
The Spectacular Lies of Maps
axbom
PRO
1
540
Reflections from 52 weeks, 52 projects
jeffersonlam
356
21k
Accessibility Awareness
sabderemane
0
63
Music & Morning Musume
bryan
47
7.1k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
It's Worth the Effort
3n
188
29k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
180
Fireside Chat
paigeccino
41
3.8k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None