ODSC: Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine

Pandas 2, Polars or Dask? ODSC 2023 Talk @IanOzsvald –
ianozsvald.com @GilesWeaver

Interim Chief Data Scientist We are Ian Ozsvald & Giles
Weaver By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 2nd Edition! Data Scientist

Lots of change in the ecosystem in recent years Which
library should you use? What do you use? We learned Polars in 2 weeks We benchmark. All benchmarks are lies 3 interesting DataFrame libraries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Ian - “Let’s do something silly” September 2023 (4 mo)
2,000 mile round trip <£1k car Ideally it shouldn’t explode Motoscape Charity Rally By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan

17 years of roadtest pass or fails 30M vehicles/year, [C|T]SV
text files Text→Parquet made easy with Dask 600M rows in total Car Test Data (UK DVLA) By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Pandas 15 years old, NumPy based PyArrow first class alongside
NumPy Internal clean-ups so less RAM used Copy on Write (off by default) Pandas 2 – what’s new? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

PyArrow vs NumPy – which to use? By [ian]@ianozsvald[.com] and
[email protected] @gilesweaver Ian Ozsvald NumExpr & bottleneck both installed Checks for identical results in notebook String dtype Nullable integer dtype Backend NumPy strings expensive in RAM e.g. 82M rows 39GB NumPy, 11GB Arrow

Pandas+Arrow, query, Seaborn By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
You can optimise by hand – mask, then choose columns to go faster

Rust based, Python front-end, 3 years old Arrow (not NumPy)
Inherently multi-core and parallelised Eager and Lazy API (+Query Planner) Beta out-of-core (medium data) support Polars – what’s in it? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Polars – same query & Seaborn By [ian]@ianozsvald[.com] and [email protected]
@gilesweaver Ian Ozsvald (Lazy df is even faster)

Manual Query Planning By [ian]@ianozsvald[.com] Ian Ozsvald

A more advanced query By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian
Ozsvald Polars eager (no “lazy() / collect()” call) takes 6s Pandas+NumPy takes 25s (i.e. slower) Possibly we can further optimise this by hand (?) Enables the Query Planner optimisations

Pandas+Arrow probably faster than Pandas+NumPy Polars seems to be faster
than Pandas+Arrow Maybe you can make Pandas “as fast”, but you have to experiment – Polars is “just fast” All benchmarks are lies – your mileage will vary First conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

BULLET Should I buy Volvo V50 – mileage? By [ian]@ianozsvald[.com]
and [email protected] @gilesweaver Ian Ozsvald

BULLET Volvo v50 lasts <24 hours By [ian]@ianozsvald[.com] and [email protected]
@gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan

Resampling a timeseries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
This dataset is in-RAM (2021-2022) There’s a limit to how much we can instantiate into memory, even if we’re careful with sub- selection and dtypes

BULLET Scanning 640M rows of larger dataset By [ian]@ianozsvald[.com] and
[email protected] @gilesweaver Ian Ozsvald Implicit Lazy DataFrame 11 seconds, 640M rows, circa 850 partitions (files)

April drop was due to lockdown By [ian]@ianozsvald[.com] and [email protected]
@gilesweaver Ian Ozsvald

Vehicle ownership increases, Hybrids growing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver
Ian Ozsvald We have to touch all parquet files, so we can’t easily use Pandas MOT after 3 years of age for all vehicles

Dask scales Pandas (and lots more) By [ian]@ianozsvald[.com] and [email protected]
@gilesweaver Ian Ozsvald

By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Adding columns=[...] saves
20s

For the rally we bought a ‘99 Passat By [ian]@ianozsvald[.com]
and [email protected] @gilesweaver Ian Ozsvald Dead before 2023 Still alive Us https://bit.ly/JustGivingIan

TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with
default 4 workers (*4 threads) 1min with 12 workers (*1 thr.) hand tuned Giles had to push directives to the Arrow read, set shuffle on set_index and agg

Issues encountered By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Haven’t checked to_numpy(), Numba, apply, rolling, writing partitioned Parquet (Polars)
NaN / Missing behaviour different Polars/Pandas sklearn partial support (sklearn assumes Pandas API) – but maybe Pandas+Arrow has copy issues too? Arrow timeseries/str different to Pandas NumPy? Thoughts on our testing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Arrow RAM usage great! By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian
Ozsvald Polars is similar as it uses Arrow

Polars easy to use, Pandas we all know Arrow in
both is great (fast+low RAM footprint) Differences in Polars API (day of week starts at 1 not 0, no `sample` on LazyDF, different verb names) Clear Polars API design makes thinking easier Pandas vs Polars conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Dask ddf and Polars can perform similarly Dask learning curve
harder, especially for performance Dask does a lot more (e.g. Bag, ML, NumPy, clusters, diagnostics) Medium-data conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Experiment, we have options! I love receiving postcards (email me)
Follow our journey-> I’m happy to discuss after Summary By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan

Appendix By [ian]@ianozsvald[.com] Ian Ozsvald

TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with
default 4 workers (*4 threads) 1min with 12 works (*1 thread) – hand tuned Giles had to sort the Parquet (6 mins) & change groupby agg shuffle, else performance much worse

ODSC: Pandas 2, Dask or Polars? Quickly tacklin...

ODSC: Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine

ianozsvald

More Decks by ianozsvald

Other Decks in Science

Featured

Transcript

Pandas 2, Polars or Dask? ODSC 2023 Talk @IanOzsvald –

Interim Chief Data Scientist We are Ian Ozsvald & Giles

Lots of change in the ecosystem in recent years Which

Ian - “Let’s do something silly” September 2023 (4 mo)

17 years of roadtest pass or fails 30M vehicles/year, [C|T]SV

Pandas 15 years old, NumPy based PyArrow first class alongside

PyArrow vs NumPy – which to use? By [ian]@ianozsvald[.com] and

Pandas+Arrow, query, Seaborn By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Rust based, Python front-end, 3 years old Arrow (not NumPy)

Polars – same query & Seaborn By [ian]@ianozsvald[.com] and [email protected]

Manual Query Planning By [ian]@ianozsvald[.com] Ian Ozsvald

A more advanced query By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian

Pandas+Arrow probably faster than Pandas+NumPy Polars seems to be faster

BULLET Should I buy Volvo V50 – mileage? By [ian]@ianozsvald[.com]

BULLET Volvo v50 lasts <24 hours By [ian]@ianozsvald[.com] and [email protected]

Resampling a timeseries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

BULLET Scanning 640M rows of larger dataset By [ian]@ianozsvald[.com] and

April drop was due to lockdown By [ian]@ianozsvald[.com] and [email protected]

Vehicle ownership increases, Hybrids growing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver

Dask scales Pandas (and lots more) By [ian]@ianozsvald[.com] and [email protected]

By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Adding columns=[...] saves

For the rally we bought a ‘99 Passat By [ian]@ianozsvald[.com]

TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with

Issues encountered By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Haven’t checked to_numpy(), Numba, apply, rolling, writing partitioned Parquet (Polars)

Arrow RAM usage great! By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian

Polars easy to use, Pandas we all know Arrow in

Dask ddf and Polars can perform similarly Dask learning curve

Experiment, we have options! I love receiving postcards (email me)

Appendix By [ian]@ianozsvald[.com] Ian Ozsvald

TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with