Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sprinting Pandas (London Python)

Avatar for ianozsvald ianozsvald
October 22, 2020

Sprinting Pandas (London Python)

Avatar for ianozsvald

ianozsvald

October 22, 2020
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  2.  Pandas – Saving RAM to fit in more data

    – Calculating faster by dropping to Numpy  Advice for “being highly performant”  Has Covid 19 affected UK Company Registrations? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3. Categoricals – over 10x speed up (on this data)! By

    [ian]@ianozsvald[.com] Ian Ozsvald
  4. Make choices to save RAM By [ian]@ianozsvald[.com] Ian Ozsvald Including

    the index (previously we ignored it) we still save circa 50% RAM so you can fit in more rows of data
  5. Drop to NumPy if you know you can By [ian]@ianozsvald[.com]

    Ian Ozsvald Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean which is slower – see my blog or PyDataAmsterdam 2020 talk for details
  6. NumPy vs Pandas overhead (ser.sum()) By [ian]@ianozsvald[.com] Ian Ozsvald 25

    files, 83 functions Very few NumPy calls! Thanks! https://github.com/ianozsvald/callgraph_james_powell
  7. Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51

    functions Many fewer Pandas calls (but still a lot!)
  8. Is Pandas unnecessarily slow – NO! By [ian]@ianozsvald[.com] Ian Ozsvald

    https://github.com/pandas-dev/pandas/issues/34773 - the truth is a bit complicated!
  9.  Install optional (but great!) Pandas dependencies – bottleneck –

    numexpr  Investigate https://github.com/ianozsvald/dtype_diet  Investigate my ipython_memory_usage (PyPI/Conda) Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
  10. Pure Python is “slow” and expressive By [ian]@ianozsvald[.com] Ian Ozsvald

    Deliberately poor function – pretend this is clever but slow!
  11. Parallelise with Dask for multi-core By [ian]@ianozsvald[.com] Ian Ozsvald 

    Make plain-Python code multi-core  Note I had to drop text index column due to speed-hit  Data copy cost can overwhelm any benefits so (always) profile & time
  12.  Mistakes slow us down (PAY ATTENTION!) – Try nullable

    Int64 & boolean, forthcoming Float64 – Write tests (unit & end-to-end) – Lots more material & my newsletter on my blog IanOzsvald.com – Time saving docs: Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald
  13.  Memory mapped & lazy computation – New string dtype

    (RAM efficient)  Modin sits on Pandas, new “algebra” for dfs – Drop in replacement, easy to try Vaex / Modin By [ian]@ianozsvald[.com] Ian Ozsvald See talks on my blog:
  14.  You have a huge dataset on a single harddrive

     Memory mapped files (HDF5) are best  Numpy types and simpler Pandas-like functions  Investment – similar but different API to Pandas When to try Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/vaexio/vaex/issues/968
  15.  You want Pandas but ran out of RAM on

    1 machine  You want multi-machine cluster scalability  You want multi-core support for operations like groupby on parallelisable datasets  Investment – quick start then a learning curve When to try Dask By [ian]@ianozsvald[.com] Ian Ozsvald
  16.  You want all of Pandas  You have lots

    of RAM and many CPUs  You’re doing groupby operations on many columns  Investment – easy to try When to try Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390
  17. Covid 19’s effect on UK Economy? By [ian]@ianozsvald[.com] Ian Ozsvald

    Sharp decline in corporate registration after Lockdown – then apparent surge (perhaps just backed-up paperwork?). Will the recovery “last”? All open data, you can do similar things!
  18.  Make it right then make it fast  Think

    about being performant  See blog for my classes  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald