Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE / PyData Karlsruhe 2017 – Connecting Py...

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

Uwe L. Korn

October 25, 2017
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1 Connecting PyData to other Big Data Landscapes using Arrow

    and Parquet Uwe L. Korn, PyCon.DE 2017
  2. 2 • Data Scientist & Architect at Blue Yonder (@BlueYonderTech)

    • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas user About me xhochy [email protected]
  3. 3 Python is a good companion for a Data Scientist

    …but there are other ecosystems out there.
  4. • Large set of files on distributed filesystem • Non-uniform

    schema • Execute query • Only a subset is interesting 4 Why do I care? not in Python
  5. 5 All are amazing but… How to get my data

    out of Python and back in again? …but there was no fast Parquet access 2 years ago. Use Parquet!
  6. 6 A general problem • Great interoperability inside ecosystems •

    Often based on a common backend (e.g. NumPy) • Poor integration to other systems • CSV is your only resort • „We need to talk!“ • Memory copy is about 10GiB/s • (De-)serialisation comes on top
  7. 9 About Parquet 1. Columnar on-disk storage format 2. Started

    in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  8. 10 Why use Parquet? 1. Columnar format
 —> vectorized operations

    2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Predicate push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  9. Compression 1. Shrink data size independent of its content 2.

    More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  10. Predicate pushdown 1. Only load used data • skip columns

    that are not needed • skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded Which products are sold in $?
  11. Read & Write Parquet 15 Pandas 0.21 will bring pd.read_parquet(…)

    df.write_parquet(…) http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
  12. 19 Apache Arrow • Specification for in-memory columnar data layout

    • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp
  13. 20 Dissecting Arrow C++ • General zero-copy memory management •

    jemalloc as the base allocator • Columnar memory format & metadata • Schema & DataType • Columns & Table
  14. 21 Dissecting Arrow C++ • Structured data IPC (inter-process communication)

    • used in Spark for JVM<->Python • future extensions include: GRPC backend, shared memory communication, … • Columnar in-memory analytics • be the backbone of Pandas 2.0
  15. 0.05s Converting 1 million longs from Spark to PySpark 22

    with Arrow https://github.com/apache/spark/pull/15821#issuecomment-282175163
  16. 23 Apache Arrow – Real life improvement Real life example!

    Retrieve a dataset from an MPP database and analyze it in Pandas 1. Run a query in the DB 2. Pass it in columnar form to the DB driver 3. The OBDC layer transform it into row-wise form 4. Pandas makes it columnar again Ugly real-life solution: export as CSV, bypass ODBC
  17. 24 Better solution: Turbodbc with Arrow support 1. Retrieve columnar

    results 2. Pass them in a columnar fashion to Pandas More systems in the future (without the ODBC overhead) See also Michael’s talk tomorrow: Turbodbc: Turbocharged database access for data scientists Apache Arrow – Real life improvement
  18. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]

    • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github: https://github.com/apache/arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • Github: https://github.com/apache/parquet- cpp 27 Get Involved!
  19. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721

    383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 28