Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
49
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
運用しているアプリケーションのDBのリプレイスをやってみた
miura55
1
740
SA Night #2 FinatextのSA思想/SA Night #2 Finatext session
satoshiimai
1
140
全文検索+セマンティックランカー+LLMの自然文検索サ−ビスで得られた知見
segavvy
2
110
デスクトップだけじゃないUbuntu
mtyshibata
0
160
2025-02-21 ゆるSRE勉強会 Enhancing SRE Using AI
yoshiiryo1
1
370
自動テストの世界に、この5年間で起きたこと
autifyhq
10
8.6k
君も受託系GISエンジニアにならないか
sudataka
2
440
リーダブルテストコード 〜メンテナンスしやすい テストコードを作成する方法を考える〜 #DevSumi #DevSumiB / Readable test code
nihonbuson
11
7.3k
抽象化をするということ - 具体と抽象の往復を身につける / Abstraction and concretization
soudai
20
8.1k
Goで作って学ぶWebSocket
ryuichi1208
3
1.6k
偶然 × 行動で人生の可能性を広げよう / Serendipity × Action: Discover Your Possibilities
ar_tama
1
1.1k
Larkご案内資料
customercloud
PRO
0
650
Featured
See All Featured
Docker and Python
trallard
44
3.3k
Code Review Best Practice
trishagee
67
18k
Designing for Performance
lara
604
68k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.7k
The Language of Interfaces
destraynor
156
24k
Faster Mobile Websites
deanohume
306
31k
Side Projects
sachag
452
42k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
GraphQLの誤解/rethinking-graphql
sonatard
68
10k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.1k
Agile that works and the tools we love
rasmusluckow
328
21k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None