Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ODSC: Pandas 2, Dask or Polars? Quickly tacklin...

ianozsvald
June 06, 2023

ODSC: Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine

Pandas 2 brings new Arrow data types, faster calculations and better scalability. Dask scales Pandas across cores. Polars is a new competitor to Pandas designed around Arrow with native multicore support. Which should you choose for modern research workflows? We'll solve a "just about fits in ram" data task using the 3 solutions, talking about the pros and cons so you can make the best choice for your research workflow. You'll leave with a clear idea of whether Pandas 2, Dask or Polars is the tool to invest in.
Do you still need 5x working RAM for Pandas operations (probably not!)? Can Pandas string operations actually be fast (sure)? Since Polars uses Arrow data structures, can we easily use tools like Scikit-learn and matplotlib (yes-maybe)? What limits do we still face?

ianozsvald

June 06, 2023
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Interim Chief Data Scientist We are Ian Ozsvald & Giles

    Weaver By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 2nd Edition! Data Scientist
  2. Lots of change in the ecosystem in recent years Which

    library should you use? What do you use? We learned Polars in 2 weeks We benchmark. All benchmarks are lies 3 interesting DataFrame libraries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  3. Ian - “Let’s do something silly” September 2023 (4 mo)

    2,000 mile round trip <£1k car Ideally it shouldn’t explode Motoscape Charity Rally By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan
  4. 17 years of roadtest pass or fails 30M vehicles/year, [C|T]SV

    text files Text→Parquet made easy with Dask 600M rows in total Car Test Data (UK DVLA) By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  5. Pandas 15 years old, NumPy based PyArrow first class alongside

    NumPy Internal clean-ups so less RAM used Copy on Write (off by default) Pandas 2 – what’s new? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  6. PyArrow vs NumPy – which to use? By [ian]@ianozsvald[.com] and

    [email protected] @gilesweaver Ian Ozsvald NumExpr & bottleneck both installed Checks for identical results in notebook String dtype Nullable integer dtype Backend NumPy strings expensive in RAM e.g. 82M rows 39GB NumPy, 11GB Arrow
  7. Pandas+Arrow, query, Seaborn By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    You can optimise by hand – mask, then choose columns to go faster
  8. Rust based, Python front-end, 3 years old Arrow (not NumPy)

    Inherently multi-core and parallelised Eager and Lazy API (+Query Planner) Beta out-of-core (medium data) support Polars – what’s in it? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  9. A more advanced query By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian

    Ozsvald Polars eager (no “lazy() / collect()” call) takes 6s Pandas+NumPy takes 25s (i.e. slower) Possibly we can further optimise this by hand (?) Enables the Query Planner optimisations
  10. Pandas+Arrow probably faster than Pandas+NumPy Polars seems to be faster

    than Pandas+Arrow Maybe you can make Pandas “as fast”, but you have to experiment – Polars is “just fast” All benchmarks are lies – your mileage will vary First conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  11. BULLET Volvo v50 lasts <24 hours By [ian]@ianozsvald[.com] and [email protected]

    @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan
  12. Resampling a timeseries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    This dataset is in-RAM (2021-2022) There’s a limit to how much we can instantiate into memory, even if we’re careful with sub- selection and dtypes
  13. BULLET Scanning 640M rows of larger dataset By [ian]@ianozsvald[.com] and

    [email protected] @gilesweaver Ian Ozsvald Implicit Lazy DataFrame 11 seconds, 640M rows, circa 850 partitions (files)
  14. Vehicle ownership increases, Hybrids growing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver

    Ian Ozsvald We have to touch all parquet files, so we can’t easily use Pandas MOT after 3 years of age for all vehicles
  15. For the rally we bought a ‘99 Passat By [ian]@ianozsvald[.com]

    and [email protected] @gilesweaver Ian Ozsvald Dead before 2023 Still alive Us https://bit.ly/JustGivingIan
  16. TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with

    default 4 workers (*4 threads) 1min with 12 workers (*1 thr.) hand tuned Giles had to push directives to the Arrow read, set shuffle on set_index and agg
  17. Haven’t checked to_numpy(), Numba, apply, rolling, writing partitioned Parquet (Polars)

    NaN / Missing behaviour different Polars/Pandas sklearn partial support (sklearn assumes Pandas API) – but maybe Pandas+Arrow has copy issues too? Arrow timeseries/str different to Pandas NumPy? Thoughts on our testing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  18. Polars easy to use, Pandas we all know Arrow in

    both is great (fast+low RAM footprint) Differences in Polars API (day of week starts at 1 not 0, no `sample` on LazyDF, different verb names) Clear Polars API design makes thinking easier Pandas vs Polars conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  19. Dask ddf and Polars can perform similarly Dask learning curve

    harder, especially for performance Dask does a lot more (e.g. Bag, ML, NumPy, clusters, diagnostics) Medium-data conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
  20. Experiment, we have options! I love receiving postcards (email me)

    Follow our journey-> I’m happy to discuss after Summary By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan
  21. TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with

    default 4 workers (*4 threads) 1min with 12 works (*1 thread) – hand tuned Giles had to sort the Parquet (6 mins) & change groupby agg shuffle, else performance much worse