PyData Sofia May 2024 - Intro to Apache Arrow

Apache Arrow  Exploring the tech that powers the modern data
(science) stack Uwe Korn – QuantCo – May 2024

About me • Uwe Korn  https://mastodon.social/@xhochy / @xhochy  https://www.linkedin.com/in/uwekorn/ •
CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer

Agenda 1. Why do we need this? 2. What is
it? 3. What’s its impact?

Why do we need this? • Di ff erent Ecosystems
• PyData / R space • Java/Scala „Big Data“ • SQL Databases • Di ff erent technologies • Pandas / SQLite

Why solve it? • We build pipelines to move data
• We want to use all tools we can leverage • Avoid working on converters or waiting for the data to be converted

Introducing Apache Arrow • Columnar representation of data in main
memory • Provide libraries to access the data structures • Building blocks for various ecosystems to use them • Implements adopters for existing structures

Columnar?

All the languages! 1. „Pure“ implementations in  C++, Java, Go,
JavaScript, C#, Rust, Julia, Swift, C(nanoarrow) 2. Wrappers on-top of them in  Python, R, Ruby, C/GLib, Matlab

There is a social component 1. A standard is only
as good as its usage 2. Di ff erent communities came together to form Arrow 3. Nowadays even more use it to connect

Arrow Basics 1. Array: a sequence of values of the
same type in contiguous bu ff ers 2. ChunkedArray: a sequence of arrays of the same type 3. Table: a sorted dictionary of ChunkedArrays of the same length

Arrow Basics: valid masks 1. Track null_count per Array 2.
Each array has a bu ff er of bits indicating whether a value is valid,  i.e. non-null

Arrow Basics: int array Python array: [1, null, 2, 4,
8] Length: 5, Null count: 1 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00011101 | 0 (padding) | Value Buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | |-------------|-------------|-------------|-------------|-------------|-----------------------| | 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |

Arrow Basics: string array Python array: ['joe', null, null, 'mark']
Length: 4, Null count: 2 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001001 | 0 (padding) | Offsets buffer: | Bytes 0-19 | Bytes 20-63 | |----------------|-----------------------| | 0, 3, 3, 3, 7 | unspecified (padding) | Value buffer: | Bytes 0-6 | Bytes 7-63 | |----------------|-----------------------| | joemark | unspecified (padding) |

Impact! Arrow is now used in all „edges“ where data
passes through: • Databases, either in clients or in UDFs • Data Engineering tooling • Machine Learning libraries • Dashboarding and BI applications

Examples of Arrow’s massive Impact If it ain’t a 10x-100x+
speedup, it ain’t worth it.

Parquet

Anatomy of a Parquet fi le

Parquet 1. This was the fi rst exposure of Arrow
to the Python world 2. End-users only see pandas.read_parquet 3. Actually, it is: A. C++ Parquet->Arrow reader B. C++ Pandas<->Arrow Adapter C. Small Python shim to connect both and give a nice API

DuckDB Interop 1. Load data in Arrow 2. Process in
DuckDB 3. Convert back to Arrow 4. Hand over to another tool All the above happened without any serialization overhead

DuckDB Interop

Fast Database Access

Fast Database Access Nowadays, you get even more speed with
• ADBC – Arrow DataBase Connector • arrow-odbc

Should you use Arrow? 1. Actually, No. 2. Not directly,
but make sure it is used in the backend. 3. If you need performance, but the current exchange is slow; then dive deeper. 4. If you want to write high-performance, framework-agnostic code.

The ecosystem …and many more.…  https://arrow.apache.org/powered_by/

Questions? Follow me:  https://mastodon.social/@xhochy / @xhochy  https://www.linkedin.com/in/uwekorn/

PyData Sofia May 2024 - Intro to Apache Arrow

PyData Sofia May 2024 - Intro to Apache Arrow

Uwe L. Korn

More Decks by Uwe L. Korn

Other Decks in Programming

Featured

Transcript

Apache Arrow  Exploring the tech that powers the modern data

About me • Uwe Korn  https://mastodon.social/@xhochy / @xhochy  https://www.linkedin.com/in/uwekorn/ •

Agenda 1. Why do we need this? 2. What is

Why do we need this? • Di ff erent Ecosystems

Why solve it? • We build pipelines to move data

Introducing Apache Arrow • Columnar representation of data in main

Columnar?

All the languages! 1. „Pure“ implementations in  C++, Java, Go,

There is a social component 1. A standard is only

Arrow Basics 1. Array: a sequence of values of the

Arrow Basics: valid masks 1. Track null_count per Array 2.

Arrow Basics: int array Python array: [1, null, 2, 4,

Arrow Basics: string array Python array: ['joe', null, null, 'mark']

Impact! Arrow is now used in all „edges“ where data

Examples of Arrow’s massive Impact If it ain’t a 10x-100x+

Parquet

Anatomy of a Parquet fi le

Parquet 1. This was the fi rst exposure of Arrow

DuckDB Interop 1. Load data in Arrow 2. Process in

DuckDB Interop

Fast Database Access

Fast Database Access Nowadays, you get even more speed with

Should you use Arrow? 1. Actually, No. 2. Not directly,

The ecosystem …and many more.…  https://arrow.apache.org/powered_by/

Questions? Follow me:  https://mastodon.social/@xhochy / @xhochy  https://www.linkedin.com/in/uwekorn/