Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
800
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
可観測性は開発環境から、開発環境にもオブザーバビリティ導入のススメ
layerx
PRO
4
1.8k
From Natural Language to K8s Operations: The MCP Architecture and Practice of kubectl-ai
appleboy
0
340
アウトプットから始めるOSSコントリビューション 〜eslint-plugin-vueの場合〜 #vuefes
bengo4com
3
1.8k
プロダクト開発と社内データ活用での、BI×AIの現在地 / Data_Findy
sansan_randd
1
620
コンパウンド組織のCRE #cre_meetup
layerx
PRO
1
290
SRE × マネジメントレイヤーが挑戦した組織・会社のオブザーバビリティ改革 ― ビジネス価値と信頼性を両立するリアルな挑戦
coconala_engineer
0
290
仕様駆動開発を実現する上流工程におけるAIエージェント活用
sergicalsix
7
3.6k
プレイドのユニークな技術とインターンのリアル
plaidtech
PRO
1
490
QA業務を変える(!?)AIを併用した不具合分析の実践
ma2ri
0
160
個人でデジタル庁の デザインシステムをVue.jsで 作っている話
nishiharatsubasa
3
5.2k
Retrospectiveを振り返ろう
nakasho
0
130
AWS DMS で SQL Server を移行してみた/aws-dms-sql-server-migration
emiki
0
260
Featured
See All Featured
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.7k
Building Applications with DynamoDB
mza
96
6.7k
Gamification - CAS2011
davidbonilla
81
5.5k
Facilitating Awesome Meetings
lara
57
6.6k
The Invisible Side of Design
smashingmag
302
51k
Navigating Team Friction
lara
190
15k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
What's in a price? How to price your products and services
michaelherold
246
12k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.7k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None