Going beyond Apache Parquet's default settings

Going beyond Parquet’s default settings Uwe Korn – QuantCo –
April 2024 🔎

About me • Uwe Korn  https://mastodon.social/@xhochy / @xhochy • CTO
at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer

Apache Parquet 1. Data Frame storage? CSV? Why? 2. Use
Parquet

Photo by Hansjörg Keller on Unsplash

Apache Parquet 1. Columnar, on-disk storage format 2. Started in
2012 by Cloudera and Twitter 3. Later, it became Apache Parquet 4. Fall 2016 brought full Python & C++ Support 5. State-of-the-art since the Hadoop era, still going strong

Clear bene fi ts 1. Columnar makes vectorized operations fast
2. E ffi cient encodings and compression make it small 3. Predicate-pushdown brings computation to the I/O layer 4. Language-independent and widespread; common exchange format

Constructing Parquet Files

Parquet with pandas

Parquet with polars

Anatomy of a fi le

Photo by Gabriel Dias Pimenta on Unsplash Tuning

Knobs to tune 1. Compression Algorithm 2. Compression Level 3.
RowGroup size 4. Encodings

Data Types!? Photo by Patrick Fore on Unsplash

Data Types?

Data Types? • Well, actually…

Data Types? • Well, actually… • …it doesn’t save much
on disk.

on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips:

on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips: Saves 963 bytes 😥 of 20.6 MiB

Compression Photo by cafeconcetto on Unsplash

Compression Algorithm

Compression Algorithm • Datasets:

Compression Algorithm • Datasets: • New York Yellow Taxi Trips
2021-01

2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction

2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset

2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology

2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology • Time measurements: Pick the median of fi ve runs

Compression Algorithm

Compression Level 1. For Brotli, ZStandard and GZIP, we can
tune the level 2. Snappy and „none“ have a fi xed compression level.

Brotli

ZStandard

ZStandard 🔬

ZStandard & Brotli 🔬

Compression

Compression 1. Let’s stick for now with ZStandard, as it
seems a good tradeo ff between speed and size.

seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli

seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli • …but Brotli is relatively slow to decompress.

RowGroup size 1. If you plan to partially access the
data, RowGroups are the common place to fi lter. 2. If you want to read the whole data, less are better. 3. Compression & encoding also works better.

Single RowGroup

Encodings 1. https://parquet.apache.org/docs/ fi le-format/data-pages/encodings/ 2. We have been using
RLE_DICTIONARY for all columns 3. DELTA_* encodings not implemented in pyarrow 4. Byte Stream Split a recent addition

Dictionary Encoding

RLE Encoding

Byte Stream Split Encoding

Encodings 1. Byte Stream Split sometimes is faster than dictionary
encoding, but not signi fi cantly 2. For high entropy columns, BSS shines

Hand-Crafted Delta

Hand-Crafted Delta 1. Let’s take the timestamps in NYC Taxi
Trip 2. Sort by pickup date 3. Compute a delta column for both dates 4. 17.5% saving on the whole fi le.

Order your data 1. With our hand-crafted delta, it was
worth sorting the data 2. This can help, but only worked for the Price Paid dataset in tests, there it saved 25%, all others actually got larger

Summary 1. Adjusting your data types is helpful for in-memory,
but have no signi fi cant e ff ect on-disk 2. Store high-entropy fl oats as Byte Stream Split encoded columns 3. Check whether sorting has an e ff ect 4. Delta Encoding in Parquet would be useful, use handcrafted for now 5. Zstd on level 3/4 seems like a good default compression setting

Cost Function for compression

What do we get? 1. Run once with the default
settings 2. Test all compression settings, but also… 1. … use hand-crafted delta. 2. … use Byte Stream Split on predictions.

Cost Function for compression

Code example available at https://github.com/xhochy/pyconde24-parquet

Questions?

Going beyond Apache Parquet's default settings

Going beyond Apache Parquet's default settings

More Decks by Uwe L. Korn

Other Decks in Programming

Featured

Transcript