Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
280
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
170
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
790
Datageeks_27-05.pdf
samklr
0
52
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
230
mesos.devoxx.2014
samklr
2
250
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
Uniadex__公開版_20250617-AIxIoTビジネス共創ラボ_ツナガルチカラ_.pdf
iotcomjpadmin
0
160
急成長を支える基盤作り〜地道な改善からコツコツと〜 #cre_meetup
stefafafan
0
110
Javaで作る RAGを活用した Q&Aアプリケーション
recruitengineers
PRO
1
100
A2Aのクライアントを自作する
rynsuke
1
170
強化されたAmazon Location Serviceによる新機能と開発者体験
dayjournal
2
200
Oracle Audit Vault and Database Firewall 20 概要
oracle4engineer
PRO
3
1.7k
_第3回__AIxIoTビジネス共創ラボ紹介資料_20250617.pdf
iotcomjpadmin
0
150
BigQuery Remote FunctionでLooker Studioをインタラクティブ化
cuebic9bic
3
260
AWS Summit Japan 2025 Community Stage - App workflow automation by AWS Step Functions
matsuihidetoshi
1
220
CI/CD/IaC 久々に0から環境を作ったらこうなりました
kaz29
1
150
Navigation3でViewModelにデータを渡す方法
mikanichinose
0
220
[TechNight #90-1] 本当に使える?ZDMの新機能を実践検証してみた
oracle4engineer
PRO
3
160
Featured
See All Featured
Optimising Largest Contentful Paint
csswizardry
37
3.3k
Building Applications with DynamoDB
mza
95
6.5k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
33
5.9k
Docker and Python
trallard
44
3.4k
Producing Creativity
orderedlist
PRO
346
40k
Git: the NoSQL Database
bkeepers
PRO
430
65k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
181
53k
The Cost Of JavaScript in 2023
addyosmani
51
8.4k
Product Roadmaps are Hard
iamctodd
PRO
53
11k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
31
1.2k
Facilitating Awesome Meetings
lara
54
6.4k
Optimizing for Happiness
mojombo
379
70k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None