Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
280
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
250
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
自社製CMSからmicroCMSへのリプレースがプロダクトグロースを加速させた話
nextbeatdev
0
130
「守る」から「進化させる」セキュリティへ ~AWS re:Inforce 2025参加報告~ / AWS re:Inforce 2025 Participation Report
yuj1osm
1
110
新卒(ほぼ)専業Kagglerという選択肢
nocchi1
1
2.1k
認知戦の理解と、市民としての対抗策
hogehuga
0
310
あなたの知らない OneDrive
murachiakira
0
230
イオン店舗一覧ページのパフォーマンスチューニング事例 / Performance tuning example for AEON store list page
aeonpeople
2
260
広島銀行におけるAWS活用の取り組みについて
masakimori
0
130
Figma + Storybook + PlaywrightのMCPを使ったフロントエンド開発
yug1224
2
110
LLMエージェント時代に適応した開発フロー
hiragram
1
410
ソフトウェア エンジニアとしての 姿勢と心構え
recruitengineers
PRO
2
590
AIエージェントの開発に必須な「コンテキスト・エンジニアリング」とは何か──プロンプト・エンジニアリングとの違いを手がかりに考える
masayamoriofficial
0
360
Yahoo!ニュースにおけるソフトウェア開発
lycorptech_jp
PRO
0
330
Featured
See All Featured
Unsuck your backbone
ammeep
671
58k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1.1k
A Tale of Four Properties
chriscoyier
160
23k
Building Better People: How to give real-time feedback that sticks.
wjessup
367
19k
Designing for humans not robots
tammielis
253
25k
Writing Fast Ruby
sferik
628
62k
Producing Creativity
orderedlist
PRO
347
40k
How to Think Like a Performance Engineer
csswizardry
25
1.8k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.6k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
33
2.4k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
131
19k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
358
30k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None