Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
290
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
350
High Performance RPC with Finagle
samklr
1
180
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
800
Datageeks_27-05.pdf
samklr
0
53
Big data and Machine learning APIs
samklr
4
260
Scalable Machine Learning
samklr
2
240
mesos.devoxx.2014
samklr
2
260
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.9k
Algebra for analytics
samklr
1
290
Other Decks in Technology
See All in Technology
「Verify with Wallet API」を アプリに導入するために
hinakko
1
240
LLM時代にデータエンジニアの役割はどう変わるか?
ikkimiyazaki
1
560
リーダーになったら未来を語れるようになろう/Speak the Future
sanogemaru
0
280
Function calling機能をPLaMo2に実装するには / PFN LLMセミナー
pfn
PRO
0
940
Why Governance Matters: The Key to Reducing Risk Without Slowing Down
sarahjwells
0
110
AI Agentと MCP Serverで実現する iOSアプリの 自動テスト作成の効率化
spiderplus_cb
0
500
AIが書いたコードをAIが検証する!自律的なモバイルアプリ開発の実現
henteko
1
350
AI駆動開発を推進するためにサービス開発チームで 取り組んでいること
noayaoshiro
0
190
スタートアップにおけるこれからの「データ整備」
shomaekawa
0
150
英語は話せません!それでも海外チームと信頼関係を作るため、対話を重ねた2ヶ月間のまなび
niioka_97
0
120
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
2
5.4k
ACA でMAGI システムを社内で展開しようとした話
mappie_kochi
1
270
Featured
See All Featured
A better future with KSS
kneath
239
17k
Large-scale JavaScript Application Architecture
addyosmani
514
110k
Documentation Writing (for coders)
carmenintech
75
5k
Visualization
eitanlees
148
16k
A Tale of Four Properties
chriscoyier
160
23k
Speed Design
sergeychernyshev
32
1.1k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
The Art of Programming - Codeland 2020
erikaheidi
56
14k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
9
850
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None