The PyArrow revolution in Pandas

The PyArrow revolution In Pandas Reuven M. Lerner • PyCon
US 2025 https://LernerPython.com

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com
• Corporate training • Online courses at LernerPython.com • Newsletters, including Bamboo Weekly (Pandas puzzles on current events) • YouTube I teach Python and Pandas! 2

My books 3

• Do you have the same code on multiple lines? • Don't repeat yourself: Use a loop! • Do you have the same code in multiple places in a program? • Don't repeat yourself: Use a function! DRY: Don’t repeat yourself! 4

• Do you have the same code in multiple programs? • Don’t repeat yourself: Use a library • In Python, we call this a "module" or “package” • A module helps the future you • It also helps other people avoid repeating your solution DRY: Don’t repeat yourself! 5

• Don’t implement your own data-analysis routines • If you use Pandas, the hard stuff is done for you • Reading data • Cleaning data • Analyzing data • Visualizing data • Writing data • Pandas is extremely convenient — and also popular Pandas 6

• Wes McKinney invented Pandas in 2008 • He built it on top of NumPy • Stable • Fast • Handles 1D and 2D data • Numerous data types Pandas used a package, too 7

• Automatic vs. manual transmission • Pandas series • A wrapper around a 1D NumPy array • Pandas data frames • A wrapper around a 2D NumPy array • Or, if you prefer: A dictionary of Pandas series Pandas and NumPy 8

• NumPy’s storage is in C • Much faster than
Python • Much less memory usage than Python • Vectorization • Lots of analysis methods • Used by many people and projects, so you know it’s stable Lots of good news

• Storing data in Pandas (via NumPy) uses lots of memory • Storage in rows, vs. in columns • We store all of the data precisely as it is • No compression • No use of zero-copy techniques • Not designed for batch processing or streaming • No complex data types • Strings • Dates and times • Nested types The bad news 10

• Let’s read a 2.2 GB CSV f ile (NYC parking violations in 2020) • df.shape • (12495734, 43) # 12.5 million rows • df.info() • Memory usage: 4.0+ GB • df.info(memory_usage='deep') • Memory usage: 15.6 GB Memory usage 11

• Many languages and frameworks work with 2D data • What if we had a single library that everyone could rely on? • Don’t reinvent the wheel; use a single, working data structure • Use columns, rather than rows, for fast retrieval • Reduce the overhead of exchanging data among systems • Take advantage of modern processors • Arrow was f irst released in 2016 • Latest version, 20.0.0, was released in April Arrow: DRY for data 12

• Python bindings for Arrow • You can use PyArrow in your programs! • Create arrays (1D) and tables (2D) • Retrieve particular rows and columns • Sorting and grouping PyArrow 13

• Primitive types • Integers (signed and unsigned) • Floats • Date, time, and timestamp • String, binary • Dict (like Pandas categories) • Map (like Python dicts) • Complex types • Array (like Pandas series) • Table (like Pandas data frame) Some of Arrow’s data types 14

So what? 15

• Pandas is moving toward PyArrow • Some functionality is already here • Much more is coming in the future • Using PyArrow can save you time and memory • And get ready: It’ll be required in Pandas 3.0 The PyArrow revolution 16

PyArrow revolution, part 1: Faster CSV reading/writing 17

• “read_csv” is very flexible, useful, and popular • Also: Can be very slow! • Speed up by specifying dtypes • Speed up by setting low_memory=False • … but it’s still really slow • Solution: Use PyArrow for reading/writing CSV f iles • How? Specify engine=‘pyarrow’ We use lots of CSV f iles 18

def load_with_time(): start_time = time.perf_counter() df = pd.read_csv(filename, low_memory=False) end_time = time.perf_counter() total_time = end_time - start_time print(f'{total_time:0.2f}') def pyarrow_load_with_time(): start_time = time.perf_counter() df = pd.read_csv(filename, engine='pyarrow') end_time = time.perf_count() total_time = end_time - start_time print(f'{total_time:0.2f}') Time comparison 19

load_with_time() 103.71 pyarrow_load_with_time() 8.66 The results? 20

• Good: • It’s 10x faster! Does anything else really matter?!? • PyArrow reads the whole thing; no more low_memory=False • PyArrow (usually) detects datetime columns, so there’s less need for parse_dates • Bad: • Some CSV f iles are too weird for PyArrow • If the f ile is small, then PyArrow isn’t worthwhile Differences 21

• I now use PyArrow to load CSV f iles by default • It doesn’t always work • It usually does, and is way faster Use this today! 22

PyArrow revolution, part 2: Faster f ile formats 23

• Most data is in: • CSV (text-based, slow, poorly speci f ied) • Excel (handles dtypes, slow, proprietary) • Arrow de f ined two new columnar, binary formats • Feather • Fast reads and writes • No compression • Parquet • Slower reads and writes • Highly compressed What formats do we use? 24

• The same data, in three formats: • CSV: 2.2G • Feather: 1.4G • Parquet: 379M • Not only smaller! • Much faster to load • Binary format • No dtype guessing/hints • Other systems/languages support them, too Size comparison 25

• Let’s load the same data, in three different formats: • CSV, Python engine: 55.8 s • CSV, PyArrow engine: 11.8 s • Feather: 10.6 s • Parquet: 9.1 s How much faster? 26

• You can use these formats today! • Do a one-time translation from CSV to Feather/ Parquet • Then read from the binary format Store data in Feather/Parquet 27

28 PyArrow revolution, part 3: Swapping out NumPy

• It’s also experimental • It’ll eventually be preferred or default • This will take time! This is big! (Or it will be) 29

• Choose a PyArrow dtype, rather than one from NumPy • Usually, that just means putting [pyarrow] after the name s = Series(np.random.randint(-50, 50, 10), index=list('abcdefghij'), dtype='int64[pyarrow]') df = DataFrame(np.random.randint(-50, 50, [3,4]), index=list('abc'), columns=list('wxyz'), dtype='int64[pyarrow]') Using PyArrow on the back end 30

df['Vehicle Color'].memory_usage(deep=True) 635123659 # 635M df['Vehicle Color'] = df['Vehicle Color'].astype('string[pyarrow]') df['Vehicle Color'].memory_usage(deep=True) 134160082 # 134M f'{(134160082 / 635123659):.02%}' '21.12%' Convert one column 31

df_pa = pd.read_csv(filename, engine='pyarrow', dtype_backend='pyarrow') Use PyArrow when reading a CSV 32

Summons Number int64[pyarrow] Plate ID string[pyarrow] Registration State string[pyarrow] Plate Type string[pyarrow] Issue Date string[pyarrow] Violation Code int64[pyarrow] Vehicle Body Type string[pyarrow] Vehicle Make string[pyarrow] Issuing Agency string[pyarrow] Street Code1 int64[pyarrow] Street Code2 int64[pyarrow] Street Code3 int64[pyarrow] Vehicle Expiration Date int64[pyarrow] Violation Location int64[pyarrow] Violation Precinct int64[pyarrow] Issuer Precinct int64[pyarrow] Issuer Code int64[pyarrow] Issuer Command string[pyarrow] Issuer Squad string[pyarrow] Violation Time string[pyarrow] Time First Observed string[pyarrow] Violation County string[pyarrow] Violation In Front Of Or Opposite string[pyarrow] House Number string[pyarrow] Street Name string[pyarrow] Intersecting Street string[pyarrow] Date First Observed int64[pyarrow] Law Section int64[pyarrow] Sub Division string[pyarrow] Violation Legal Code string[pyarrow] Days Parking In Effect string[pyarrow] From Hours In Effect string[pyarrow] To Hours In Effect string[pyarrow] Vehicle Color string[pyarrow] Unregistered Vehicle? int64[pyarrow] Vehicle Year int64[pyarrow] Meter Number string[pyarrow] Feet From Curb int64[pyarrow] Violation Post Code string[pyarrow] Violation Description string[pyarrow] No Standing or Stopping Violation null[pyarrow] Hydrant Violation null[pyarrow] Double Parking Violation null[pyarrow] df.dtypes 33

df.info() dtypes: int64[pyarrow](15), null[pyarrow](3), string[pyarrow](25) ( df['Hydrant Violation'] .isna() .value_counts(normalize=True) ) Hydrant Violation True 1.0 Name: proportion, dtype: float64 Or, just use df.info() 34

• Using NumPy: 15.0 GB • Using PyArrow: 3.7 GB And the memory usage? 35

How fast is it? 36

%timeit df_np['Vehicle Color'].value_counts().head(5) 216 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Vehicle Color'].value_counts().head(5) 107 ms ± 716 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) • PyArrow is about 2x faster Top 5 values in a column 37

%timeit df_np['Vehicle Color'].str.contains('[BZ]', regex=True, case=False).value_counts().head(5) 2 s ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Vehicle Color'].str.contains('[BZ]', regex=True, case=False).value_counts().head(5) 365 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) • PyArrow is about 5.5x faster Searching in strings with regex=True 38

%timeit df_np.loc[lambda df_: df_['Vehicle Color'] == 'BLUE', 'Registration State'].value_counts().head(5) 306 ms ± 7.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_: df_['Vehicle Color'] == 'BLUE', 'Registration State'].value_counts().head(5) 36.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) • PyArrow is about 10x faster Most common states with blue cars 39

%timeit df_np['Issue Date'].dt.month .value_counts().head(5) 148 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit df_pa['Issue Date'].dt.month .value_counts().head(5) 151 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Is PyArrow always faster? 40

%timeit df_np.loc[lambda df_: df_['Issue Date'].dt.month.isin([3, 7]), 'Registration State'].value_counts().head(5) 205 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_: df_['Issue Date'].dt.month.isin([3, 7]), 'Registration State'].value_counts().head(5) 179 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Most common states in March/July 41

%timeit df_np.groupby('Registration State')['Feet From Curb'].mean() 263 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.groupby('Registration State')['Feet From Curb'].mean() 173 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Grouping 42

%timeit df_np.iloc[[0, 100, 100_000, -10_000]] 43.2 μs ± 2.85 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %timeit df_pa.iloc[[0, 100, 100_000, -10_000]] 366 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 366 ms == 366,000 µs Retrieve rows with .iloc 43

%timeit df_np.loc[[0, 100, 100_000]] 82.8 μs ± 1.65 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %timeit df_pa.loc[[0, 100, 100_000]] 406 ms ± 33.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 406 ms == 406,000 µs Retrieve rows with .loc 44

df_np.loc[lambda df_np_: df_np_['Registration State'] == 'NY']['Feet From Curb'].mean() 3.28 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_pa_: df_pa_['Registration State'] == 'NY']['Feet From Curb’].mean() 2.45 s ± 141 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Rows with .loc + lambda 45

%timeit df_np.join(df_np, rsuffix='_r') 11.2 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.join(df_pa, rsuffix='_r') 102 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Joining (self-join) 46

Overall 47

New and different behavior 48

s = Series([10, 20, 30, 40, 50], dtype='int64') s.loc[2] = np.nan s 0 10.0 1 20.0 2 NaN 3 40.0 4 50.0 dtype: float64 Recognize this problem? 49

s = Series([10, 20, 30, 40, 50], dtype='int64[pyarrow]') s.loc[2] = np.nan s 0 10 1 20 2 <NA> 3 40 4 50 dtype: int64[pyarrow] Nullable types 50

s = Series('hello out there'.split(), dtype='string[pyarrow]') s.loc[1] = np.nan s 0 hello 1 <NA> 2 there dtype: string Nullable types — not just ints 51

s_np = Series([10, 70, 100], dtype='int8') s_np + 100 0 110 1 -86 2 -56 dtype: int8 s_np + 1000 OverflowError: Python integer 1000 out of bounds for int NumPy 2.0 over f low behavior 52

s_pa = Series([10, 70, 100], dtype=‘int8[pyarrow]') s_pa + 100 0 110 1 170 2 200 dtype: int64[pyarrow] PyArrow over f low behavior 53

s_pa + 1000 0 1010 1 1070 2 1100 dtype: int64[pyarrow] PyArrow over f low behavior 54

• There is another, separate way to use a backend that isn't NumPy, namely "extension types.” • Their main advantage: They’re nullable • Otherwise, they have the same issues as NumPy dtypes: • Row oriented storage • Python strings • No compression • Not interoperable with other systems Different from extension types! 55

• In the future, strings will be handled by PyArrow • What if you want that now? pd.options.future.infer_string = True • Now, all of your strings will be in PyArrow! • Faster creation/loading time • Far less memory usage Want PyArrow strings without a PyArrow backend? 56

• You can, of course, use PyArrow directly • It’s a fast, smart, capable data structure • If and when you want, you can convert it to a Pandas data frame: t.to_pandas() • You can also =import a data frame into PyArrow: pa.Table.from_pandas(df_pa) • Also, when our backend uses PyArrow: • s.values returns a PyArrow array • df_pa[‘column’].values returns a PyArrow array • df_pa.values returns a NumPy array, for compatibility purposes Using raw PyArrow 57

• Right now, Pandas is a powerful package • It’s becoming a powerful platform • Swappable back ends (NumPy and PyArrow) • It’s setting the standard for data-analysis API • Other libraries (e.g., Polars) are partly emulating it • It’s becoming something that other software can work with • Via PyArrow, R and Apache Spark • In memory, DuckDB can query Pandas data frames The real Pandas revolution 58

• PyArrow is revolutionizing Pandas • Faster f ile loading today • Faster, more ef f icient back-end storage tomorrow • (Or you can try it today!) • Pandas is becoming a platform • PyArrow is part of that move • You’ll be able to choose how much complex ef f iciency you want vs. simple, inef f icient clarity Summary 59

• Courses: https://LernerPython.com • YouTube: https://YouTube.com/reuvenlerner • Bluesky: https://bsky.app/pro f ile/lernerpython.com • Deep-dive Pandas challenges: https:// BambooWeekly.com • Stop by and say “hi” at my booth! Questions? 60

The PyArrow revolution in Pandas

The PyArrow revolution in Pandas

More Decks by Reuven M. Lerner

Other Decks in Technology

Featured

Transcript