Data Engineering SG Meetup

Polars lightning fast dataframe library

2 whoami Sivakumar Technical Architect @ cacib Passionate about Rust
& Data Engineering

Agenda 01. What is Polars? 02. Why Polars? 03. Key
Concepts 04. Demo 05. Q & A

01 What is Polars?

5 Polars Open-Source Library for data manipulation Written in Rust
Language Python, R & NodeJS Support Started in 2020 Created by Ritchie Vink

02 Why Polars?

7 ▪ Fastest data processing on a single machine ▪
Supports both eager (default) & lazy execution ▪ Uses Apache Arrow for Query Engine ▪ Multi-threaded ▪ Query optimizer ▪ Expression API ▪ Lazy execution with streaming capabilities (when datasets do not fit in memory, query engine executes query in batches) ▪ GPU Support Polars – Features

8 Polars – I/O Formats CSV Excel AVRO Parquet Database
Hugging Face Google Big Query Cloud Storage Systems DeltaLake Iceberg OpenOffice JSON

9 ▪ Numeric ▪ String ▪ Temporal ▪ Boolean ▪
Enums ▪ Lists, Arrays ▪ Struct ▪ Objects Polars – Data Types

10 Polars – Interoperability Library Description Pandas Convert data to/from
pandas dataframes/series Numpy Convert data to/from Numpy arrays PyArrow Convert data to/from PyArrow tables/arrays PyDantic Convert data from PyDantic Models to Polars

11 Polars – Performance Comparison Source: https://pola.rs/posts/benchmarks/

03 Key Concepts

13 ▪ 2-dimensional data structure, useful for data manipulation and
analytics ▪ Default Polars installation support up to 4.3 billion rows in a DataFrame ▪ When installed with polars-u64-idx, it can support maximum of ~18 quintillion rows Polars – DataFrame import polars as pl import datetime as dt df = pl.DataFrame({ "name": ["Alice Archer", "Ben Brown"], "birthdate": [ dt.date(1997, 1, 10), dt.date(1985, 2, 15)], "weight": [57.9, 72.5], "height": [1.56, 1.77], }) print(df)

14 ▪ Expression is a Lazy representation of a data
transformation ▪ Modular and flexible ▪ No computations takes place until called from Contexts ▪ Commonly used contexts: select, with_columns, filter, group_by Polars – Expressions and Contexts import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) print(bmi_expr)

15 ▪ Applies expressions over columns ▪ May produce new
columns, combinations of other columns or literals ▪ Result DataFrame must be of same length as source DataFrame or a scalar Polars – select Context import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) result = df.select( bmi=bmi_expr, avg_bmi=bmi_expr.mean(), ideal_max_bmi=25, ) print(result)

16 ▪ Creates new DataFrame contains all columns from source
DataFrame and the new columns according to expressions ▪ It can overwrite existing column’s data according to expressions ▪ Result DataFrame must be of same length as source DataFrame Polars – with_columns Context import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) result = df.with_columns(bmi=bmi_expr,avg_bmi=bmi_expr.mean(),ideal_max_bmi=25) print(result)

17 ▪ Filters the rows of a DataFrame ▪ Based
on one or more expressions ▪ Evaluates to Boolean data type Polars – filter Context import polars as pl result = df.filter( pl.col("birthdate").is_between (date(1982, 12, 31), date(1996, 1, 1)), pl.col("height") > 1.7 ) print(result)

18 ▪ Rows are grouped according to unique values of
the grouping expressions ▪ Use agg() to apply aggregating expressions to the groups Polars – group_by Context import polars as pl result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10) .alias("decade"), (pl.col("height") < 1.7).alias("short?"), ).agg(pl.col("name")) print(result)

19 ▪ Shorthand notation for applying same transformation to multiple
columns ▪ The above expression is equivalent to: Polars – Expression Expansion pl.col("weight", "height").min().name.prefix(“min_") [ pl.col("weight").min().alias(“min_weight"), pl.col("height").min().alias(“min_height"), ]

20 ▪ Combines columns from one or more DataFrames into
a new DataFrame ▪ Supports all types (inner, full, left outer, right outer etc) of join operations Polars – Join import polars as pl df_emp = pl.read_csv(“employees.csv") df_dept = pl.read_csv(“dept.csv") result = df_emp.join(df_dept, on=“department_id") print(result)

21 ▪ Concatenate two DataFrames with non-overlapping columns horizontally Polars
– Horizontal Concatenation import polars as pl df_h1 = pl.DataFrame({ "l1": [1, 2], "l2": [3, 4], }) df_h2 = pl.DataFrame({ "r1": [5, 6], "r2": [7, 8], "r3": [9, 10], }) df_horizontal_concat = pl.concat( [ df_h1, df_h2, ], how="horizontal", ) print(df_horizontal_concat)

22 ▪ Concatenate two DataFrames with overlapping columns vertically Polars
– Vertical Concatenation import polars as pl df_v1 = pl.DataFrame({ "a": [1], "b": [3], }) df_v2 = pl.DataFrame({ "a": [2], "b": [4], }) df_vertical_concat = pl.concat( [ df_v1, df_v2, ], how="vertical", ) print(df_vertical_concat)

23 ▪ Automatic query optimization. Query planner perform various optimizations
(Predicate, Projection, Slice etc) ▪ pl.scan_*() functions available for various file types to load file lazily & returns LazyFrame ▪ Query evaluated only it is collected using collect() function ▪ Allows users to work with larger than memory datasets using streaming option (call collect() with argument engine=streaming) ▪ Use the function explain() to get the query plan that will be used by Polars ▪ Call .lazy() to convert DataFrame to a LazyFrame Polars – Lazy API

24 ▪ Uses SQLContext object to manage SQL Queries ▪
Register LazyFrame and DataFrame objects in global namespace or via dictionary mapping ▪ SQL Queries are always executed in Lazy mode ▪ Polars translates SQL queries into Expressions and then gets executed Polars – SQL Support import polars as pl df_emp = pl.read_csv(“employees.csv") with pl.SQLContext(register_globals=True, eager=True) as ctx: df_top = ctx.execute("SELECT * from df_emp LIMIT 5") print(df_top)

04 Demo

26 Demo Use case #1: ▪ Analyzing Singapore HDB dataset
(data.gov.sg) having ~1 million records from the year 1990-till date ▪ Get the min & max resale values for each flat types ▪ Using Pivot & Join to get min & max resale values from each town & decades (1990-1999, 2000-2009, 2010- 2019, 2020-Till Date) & flat types (3RM, 4RM, 5RM, EXEC & MULTI-GEN) SG HDB Dataset Filter Records Data Standardization Data Aggregation Pivot & Join Write output

27 Demo Use case #2: ▪ Generated synthetic customer data
(having 14 columns) of various size (1 Million, 5 Million, 10 Million, 25 Million & 50 Million) using ficto library ▪ Simple data pipeline (Filter, Standardize & Aggregate) implementation in Polars, Pandas & PySpark 0.62 1.67 3.16 7.48 51.7 1.9 8.71 17 51.25 134.39 15.9 12.82 15.44 26.29 47.26 0 20 40 60 80 100 120 140 160 1 Million 5 Million 10 Million 25 Million 50 Million Execution Time (in Seconds) Dataset Size Customer Data Processing (lower is better) Polars Pandas PySpark Environment: Windows 11 i7 Processor 8 Cores & 16 GB RAM Polars v1.27.1 Pandas v2.2.0 PySpark v3.5.0

29 Download this deck Code Samples

Data Engineering SG Meetup

Data Engineering SG Meetup

Sivakumar

Other Decks in Programming

Featured

Transcript

Polars lightning fast dataframe library

2 whoami Sivakumar Technical Architect @ cacib Passionate about Rust

Agenda 01. What is Polars? 02. Why Polars? 03. Key

01 What is Polars?

5 Polars Open-Source Library for data manipulation Written in Rust

02 Why Polars?

7 ▪ Fastest data processing on a single machine ▪

8 Polars – I/O Formats CSV Excel AVRO Parquet Database

9 ▪ Numeric ▪ String ▪ Temporal ▪ Boolean ▪

10 Polars – Interoperability Library Description Pandas Convert data to/from

11 Polars – Performance Comparison Source: https://pola.rs/posts/benchmarks/

03 Key Concepts

13 ▪ 2-dimensional data structure, useful for data manipulation and

14 ▪ Expression is a Lazy representation of a data

15 ▪ Applies expressions over columns ▪ May produce new

16 ▪ Creates new DataFrame contains all columns from source

17 ▪ Filters the rows of a DataFrame ▪ Based

18 ▪ Rows are grouped according to unique values of

19 ▪ Shorthand notation for applying same transformation to multiple

20 ▪ Combines columns from one or more DataFrames into

21 ▪ Concatenate two DataFrames with non-overlapping columns horizontally Polars

22 ▪ Concatenate two DataFrames with overlapping columns vertically Polars

23 ▪ Automatic query optimization. Query planner perform various optimizations

24 ▪ Uses SQLContext object to manage SQL Queries ▪

04 Demo

26 Demo Use case #1: ▪ Analyzing Singapore HDB dataset

27 Demo Use case #2: ▪ Generated synthetic customer data

Q & A

29 Download this deck Code Samples