Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
260
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
新機能VPCリソースエンドポイント機能検証から得られた考察
duelist2020jp
0
230
Amazon VPC Lattice 最新アップデート紹介 - PrivateLink も似たようなアップデートあったけど違いとは
bigmuramura
0
200
複雑性の高いオブジェクト編集に向き合う: プラガブルなReactフォーム設計
righttouch
PRO
0
120
統計データで2024年の クラウド・インフラ動向を眺める
ysknsid25
2
850
KubeCon NA 2024 Recap / Running WebAssembly (Wasm) Workloads Side-by-Side with Container Workloads
z63d
1
250
どちらを使う?GitHub or Azure DevOps Ver. 24H2
kkamegawa
0
870
Microsoft Azure全冠になってみた ~アレを使い倒した者が試験を制す!?~/Obtained all Microsoft Azure certifications Those who use "that" to the full will win the exam! ?
yuj1osm
2
110
プロダクト開発を加速させるためのQA文化の築き方 / How to build QA culture to accelerate product development
mii3king
1
270
マイクロサービスにおける容易なトランザクション管理に向けて
scalar
0
140
5分でわかるDuckDB
chanyou0311
10
3.2k
社外コミュニティで学び社内に活かす共に学ぶプロジェクトの実践/backlogworld2024
nishiuma
0
270
フロントエンド設計にモブ設計を導入してみた / 20241212_cloudsign_TechFrontMeetup
bengo4com
0
1.9k
Featured
See All Featured
Measuring & Analyzing Core Web Vitals
bluesmoon
4
170
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
48
2.2k
We Have a Design System, Now What?
morganepeng
51
7.3k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
29
2k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
5
450
Speed Design
sergeychernyshev
25
670
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
A better future with KSS
kneath
238
17k
Agile that works and the tools we love
rasmusluckow
328
21k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
330
21k
Building Adaptive Systems
keathley
38
2.3k
Building Better People: How to give real-time feedback that sticks.
wjessup
365
19k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None