Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData London 2017 – Efficient and portable Dat...

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

Apache Parquet is the most used columnar data format in the big data processing space and recently gained Pandas support. It leverages various techniques to store data in a CPU and I/O efficient way and provides capabilities to push-down queries to the I/O layer. In this talk, it is shown how to use it in Python, detail its structure and present the portable usage with other tools.

Uwe L. Korn

May 07, 2017
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache

    {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy [email protected]
  2. 4 About Parquet 1. Columnar on-disk storage format 2. Started

    in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  3. 5 Why use Parquet? 1. Columnar format
 —> vectorized operations

    2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Query push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  4. 6 Who uses Parquet? • Query Engines • Hive •

    Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask
  5. Encodings • Know the data • Exploit the knowledge •

    Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml
  6. Encodings — PLAIN • Simply write the binary representation to

    disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB
  7. Encodings — RLE & Bit Packing • bit-packing: only use

    the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels
  8. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value

    is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)
  9. Compression 1. Shrink data size independent of its content 2.

    More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  10. Query pushdown 1. Only load used data 1. skip columns

    that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded
  11. 18 Apache Arrow? • Specification for in-memory columnar data layout

    • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3
  12. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]

    • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!
  13. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721

    383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20