CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer
memory • Provide libraries to access the data structures • Building blocks for various ecosystems to use them • Implements adopters for existing structures
same type in contiguous bu ff ers 2. ChunkedArray: a sequence of arrays of the same type 3. Table: a sorted dictionary of ChunkedArrays of the same length
to the Python world 2. End-users only see pandas.read_parquet 3. Actually, it is: A. C++ Parquet->Arrow reader B. C++ Pandas<->Arrow Adapter C. Small Python shim to connect both and give a nice API
but make sure it is used in the backend. 3. If you need performance, but the current exchange is slow; then dive deeper. 4. If you want to write high-performance, framework-agnostic code.