Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
ブロックテーマ時代における、テーマの CSS について考える Toro_Unit / 2025.09.13 @ Shinshu WordPress Meetup
torounit
0
130
人工衛星のファームウェアをRustで書く理由
koba789
15
8.3k
職種の壁を溶かして開発サイクルを高速に回す~情報透明性と職種越境から考えるAIフレンドリーな職種間連携~
daitasu
0
190
S3アクセス制御の設計ポイント
tommy0124
3
210
Codeful Serverless / 一人運用でもやり抜く力
_kensh
7
460
slog.Handlerのよくある実装ミス
sakiengineer
4
480
「Linux」という言葉が指すもの
sat
PRO
4
140
いま注目のAIエージェントを作ってみよう
supermarimobros
0
360
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
8.8k
Unlocking the Power of AI Agents with LINE Bot MCP Server
linedevth
0
120
RSCの時代にReactとフレームワークの境界を探る
uhyo
11
3.5k
Bedrock で検索エージェントを再現しようとした話
ny7760
2
110
Featured
See All Featured
Done Done
chrislema
185
16k
Designing for humans not robots
tammielis
253
25k
Mobile First: as difficult as doing things right
swwweet
224
9.9k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1.1k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
BBQ
matthewcrist
89
9.8k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
A better future with KSS
kneath
239
17k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None