Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE / PyData Karlsruhe 2017 – Connecting Py...

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

Avatar for Uwe L. Korn

Uwe L. Korn

October 25, 2017
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1 Connecting PyData to other Big Data Landscapes using Arrow

    and Parquet Uwe L. Korn, PyCon.DE 2017
  2. 2 • Data Scientist & Architect at Blue Yonder (@BlueYonderTech)

    • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas user About me xhochy [email protected]
  3. 3 Python is a good companion for a Data Scientist

    …but there are other ecosystems out there.
  4. • Large set of files on distributed filesystem • Non-uniform

    schema • Execute query • Only a subset is interesting 4 Why do I care? not in Python
  5. 5 All are amazing but… How to get my data

    out of Python and back in again? …but there was no fast Parquet access 2 years ago. Use Parquet!
  6. 6 A general problem • Great interoperability inside ecosystems •

    Often based on a common backend (e.g. NumPy) • Poor integration to other systems • CSV is your only resort • „We need to talk!“ • Memory copy is about 10GiB/s • (De-)serialisation comes on top
  7. 9 About Parquet 1. Columnar on-disk storage format 2. Started

    in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  8. 10 Why use Parquet? 1. Columnar format
 —> vectorized operations

    2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Predicate push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  9. Compression 1. Shrink data size independent of its content 2.

    More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  10. Predicate pushdown 1. Only load used data • skip columns

    that are not needed • skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded Which products are sold in $?
  11. Read & Write Parquet 15 Pandas 0.21 will bring pd.read_parquet(…)

    df.write_parquet(…) http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
  12. 19 Apache Arrow • Specification for in-memory columnar data layout

    • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp
  13. 20 Dissecting Arrow C++ • General zero-copy memory management •

    jemalloc as the base allocator • Columnar memory format & metadata • Schema & DataType • Columns & Table
  14. 21 Dissecting Arrow C++ • Structured data IPC (inter-process communication)

    • used in Spark for JVM<->Python • future extensions include: GRPC backend, shared memory communication, … • Columnar in-memory analytics • be the backbone of Pandas 2.0
  15. 0.05s Converting 1 million longs from Spark to PySpark 22

    with Arrow https://github.com/apache/spark/pull/15821#issuecomment-282175163
  16. 23 Apache Arrow – Real life improvement Real life example!

    Retrieve a dataset from an MPP database and analyze it in Pandas 1. Run a query in the DB 2. Pass it in columnar form to the DB driver 3. The OBDC layer transform it into row-wise form 4. Pandas makes it columnar again Ugly real-life solution: export as CSV, bypass ODBC
  17. 24 Better solution: Turbodbc with Arrow support 1. Retrieve columnar

    results 2. Pass them in a columnar fashion to Pandas More systems in the future (without the ODBC overhead) See also Michael’s talk tomorrow: Turbodbc: Turbocharged database access for data scientists Apache Arrow – Real life improvement
  18. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]

    • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github: https://github.com/apache/arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • Github: https://github.com/apache/parquet- cpp 27 Get Involved!
  19. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721

    383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 28