Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering SG Meetup

Data Engineering SG Meetup

Polars DataFrame Library

Sivakumar

April 12, 2025
Tweet

Other Decks in Programming

Transcript

  1. 5 Polars Open-Source Library for data manipulation Written in Rust

    Language Python, R & NodeJS Support Started in 2020 Created by Ritchie Vink
  2. 7 ▪ Fastest data processing on a single machine ▪

    Supports both eager (default) & lazy execution ▪ Uses Apache Arrow for Query Engine ▪ Multi-threaded ▪ Query optimizer ▪ Expression API ▪ Lazy execution with streaming capabilities (when datasets do not fit in memory, query engine executes query in batches) ▪ GPU Support Polars – Features
  3. 8 Polars – I/O Formats CSV Excel AVRO Parquet Database

    Hugging Face Google Big Query Cloud Storage Systems DeltaLake Iceberg OpenOffice JSON
  4. 9 ▪ Numeric ▪ String ▪ Temporal ▪ Boolean ▪

    Enums ▪ Lists, Arrays ▪ Struct ▪ Objects Polars – Data Types
  5. 10 Polars – Interoperability Library Description Pandas Convert data to/from

    pandas dataframes/series Numpy Convert data to/from Numpy arrays PyArrow Convert data to/from PyArrow tables/arrays PyDantic Convert data from PyDantic Models to Polars
  6. 13 ▪ 2-dimensional data structure, useful for data manipulation and

    analytics ▪ Default Polars installation support up to 4.3 billion rows in a DataFrame ▪ When installed with polars-u64-idx, it can support maximum of ~18 quintillion rows Polars – DataFrame import polars as pl import datetime as dt df = pl.DataFrame({ "name": ["Alice Archer", "Ben Brown"], "birthdate": [ dt.date(1997, 1, 10), dt.date(1985, 2, 15)], "weight": [57.9, 72.5], "height": [1.56, 1.77], }) print(df)
  7. 14 ▪ Expression is a Lazy representation of a data

    transformation ▪ Modular and flexible ▪ No computations takes place until called from Contexts ▪ Commonly used contexts: select, with_columns, filter, group_by Polars – Expressions and Contexts import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) print(bmi_expr)
  8. 15 ▪ Applies expressions over columns ▪ May produce new

    columns, combinations of other columns or literals ▪ Result DataFrame must be of same length as source DataFrame or a scalar Polars – select Context import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) result = df.select( bmi=bmi_expr, avg_bmi=bmi_expr.mean(), ideal_max_bmi=25, ) print(result)
  9. 16 ▪ Creates new DataFrame contains all columns from source

    DataFrame and the new columns according to expressions ▪ It can overwrite existing column’s data according to expressions ▪ Result DataFrame must be of same length as source DataFrame Polars – with_columns Context import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) result = df.with_columns(bmi=bmi_expr,avg_bmi=bmi_expr.mean(),ideal_max_bmi=25) print(result)
  10. 17 ▪ Filters the rows of a DataFrame ▪ Based

    on one or more expressions ▪ Evaluates to Boolean data type Polars – filter Context import polars as pl result = df.filter( pl.col("birthdate").is_between (date(1982, 12, 31), date(1996, 1, 1)), pl.col("height") > 1.7 ) print(result)
  11. 18 ▪ Rows are grouped according to unique values of

    the grouping expressions ▪ Use agg() to apply aggregating expressions to the groups Polars – group_by Context import polars as pl result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10) .alias("decade"), (pl.col("height") < 1.7).alias("short?"), ).agg(pl.col("name")) print(result)
  12. 19 ▪ Shorthand notation for applying same transformation to multiple

    columns ▪ The above expression is equivalent to: Polars – Expression Expansion pl.col("weight", "height").min().name.prefix(“min_") [ pl.col("weight").min().alias(“min_weight"), pl.col("height").min().alias(“min_height"), ]
  13. 20 ▪ Combines columns from one or more DataFrames into

    a new DataFrame ▪ Supports all types (inner, full, left outer, right outer etc) of join operations Polars – Join import polars as pl df_emp = pl.read_csv(“employees.csv") df_dept = pl.read_csv(“dept.csv") result = df_emp.join(df_dept, on=“department_id") print(result)
  14. 21 ▪ Concatenate two DataFrames with non-overlapping columns horizontally Polars

    – Horizontal Concatenation import polars as pl df_h1 = pl.DataFrame({ "l1": [1, 2], "l2": [3, 4], }) df_h2 = pl.DataFrame({ "r1": [5, 6], "r2": [7, 8], "r3": [9, 10], }) df_horizontal_concat = pl.concat( [ df_h1, df_h2, ], how="horizontal", ) print(df_horizontal_concat)
  15. 22 ▪ Concatenate two DataFrames with overlapping columns vertically Polars

    – Vertical Concatenation import polars as pl df_v1 = pl.DataFrame({ "a": [1], "b": [3], }) df_v2 = pl.DataFrame({ "a": [2], "b": [4], }) df_vertical_concat = pl.concat( [ df_v1, df_v2, ], how="vertical", ) print(df_vertical_concat)
  16. 23 ▪ Automatic query optimization. Query planner perform various optimizations

    (Predicate, Projection, Slice etc) ▪ pl.scan_*() functions available for various file types to load file lazily & returns LazyFrame ▪ Query evaluated only it is collected using collect() function ▪ Allows users to work with larger than memory datasets using streaming option (call collect() with argument engine=streaming) ▪ Use the function explain() to get the query plan that will be used by Polars ▪ Call .lazy() to convert DataFrame to a LazyFrame Polars – Lazy API
  17. 24 ▪ Uses SQLContext object to manage SQL Queries ▪

    Register LazyFrame and DataFrame objects in global namespace or via dictionary mapping ▪ SQL Queries are always executed in Lazy mode ▪ Polars translates SQL queries into Expressions and then gets executed Polars – SQL Support import polars as pl df_emp = pl.read_csv(“employees.csv") with pl.SQLContext(register_globals=True, eager=True) as ctx: df_top = ctx.execute("SELECT * from df_emp LIMIT 5") print(df_top)
  18. 26 Demo Use case #1: ▪ Analyzing Singapore HDB dataset

    (data.gov.sg) having ~1 million records from the year 1990-till date ▪ Get the min & max resale values for each flat types ▪ Using Pivot & Join to get min & max resale values from each town & decades (1990-1999, 2000-2009, 2010- 2019, 2020-Till Date) & flat types (3RM, 4RM, 5RM, EXEC & MULTI-GEN) SG HDB Dataset Filter Records Data Standardization Data Aggregation Pivot & Join Write output
  19. 27 Demo Use case #2: ▪ Generated synthetic customer data

    (having 14 columns) of various size (1 Million, 5 Million, 10 Million, 25 Million & 50 Million) using ficto library ▪ Simple data pipeline (Filter, Standardize & Aggregate) implementation in Polars, Pandas & PySpark 0.62 1.67 3.16 7.48 51.7 1.9 8.71 17 51.25 134.39 15.9 12.82 15.44 26.29 47.26 0 20 40 60 80 100 120 140 160 1 Million 5 Million 10 Million 25 Million 50 Million Execution Time (in Seconds) Dataset Size Customer Data Processing (lower is better) Polars Pandas PySpark Environment: Windows 11 i7 Processor 8 Cores & 16 GB RAM Polars v1.27.1 Pandas v2.2.0 PySpark v3.5.0