Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blosc/bcolz: Comprimiendo mas allá de los límit...

Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

Blosc es un compresor extremadamente rápido, mientras que bcolz es un contenedor de datos columnares que soporta compresión. Juntos pueden cambiar las reglas del juego actuales en almacenamiento y procesamiento de datos.

FrancescAlted

November 22, 2015
Tweet

More Decks by FrancescAlted

Other Decks in Programming

Transcript

  1. Blosc / bcolz Comprimiendo datos más allá de los límites

    de la memoria Francesc Alted Consultor Freelance (Departmento de Geo-ciencias, Universidad de Oslo) Charla para la PyConES 2015, Valencia, 2015
  2. Sobre Mi • Creador de librerías como PyTables, Blosc, bcolz.

    Mantengo Numexpr desde hace años. • Desarrollador y enseñante en áreas como: • Python (casi 15 años de experiencia) • Computación y almacenamiento de altas prestaciones. • Consultor en proyectos de procesamiento de datos.
  3. The MovieLens Dataset • Datasets for movie ratings • Different

    sizes: 100K, 1M, 10M ratings (the 10M will be used in benchmarks ahead) • The datasets were collected over various periods of time
  4. Querying the MovieLens Dataset import pandas as pd
 import bcolz

    # Parse and load CSV files using pandas # Merge some files in a single dataframe
 lens = pd.merge(movies, ratings) # The pandas way of querying
 result = lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)”)['user_id'] zlens = bcolz.ctable.fromdataframe(lens) # The bcolz way of querying (notice the use of the `where` iterator)
 result = [r.user_id for r in dblens.where(
 "(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['user_id'])]
  5. ¿Qué? ¿Consultas sobre datos comprimidos yendo más rápido que sobre

    datos sin comprimir? ¿En serio? Data input Data output Decompression Compression Data process
  6. Ver mi artículo: “Why Modern CPUs Are Starving And What

    You Can Do About It” Enorme diferencia de velocidad entre CPUs y memoria!
  7. Hierarchy of Memory
 By 2017 (Educated Guess) SSD SATA (persistent)

    L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)
  8. The same data take less storage Transmission + decompression faster

    than direct transfer? Disk or Memory Bus Decompression Persistent (disk) or ephemeral (RAM) storage CPU Cache Original
 Dataset Compressed
 Dataset
  9. Blosc Outstanding Features • Uses multi-threading • The shuffle part

    is accelerated using SSE2 and AVX2 (if available) • Supports different compressor backends: blosclz, lz4, snappy and zlib • Fine-tuned for using internal caches (mainly L1 and L2)
  10. Blosc: (de-)compressing faster than memory Reads from Blosc chunks up

    to 5x faster than memcpy() (on synthetic data)
  11. Compression matters! “Blosc compressors are the fastest ones out there

    at this point; there is no better publicly available option that I'm aware of. That's not just ‘yet another compressor library’ case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)
  12. Blosc ecosystem Small, but with big impact
 (thanks mainly to

    PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)
  13. –Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation “Blosc

    compresses almost as well as ZLIB, but it is much faster” Blosc In OpenVDB And Houdini
  14. What is bcolz? • Provides a storage layer that is

    both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Containers come with two flavors: carray (multidimensional, homogeneous arrays) and ctable (tabular data, made of carrays)
  15. carray: Multidimensional Container for Homogeneous Data . . . NumPy

    container carray container chunk 1 chunk 2 chunk N Contiguous Memory Discontiguous Memory
  16. –Alistair Miles Head of Epidemiological Informatics for the Kwiatkowski group.

    Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”
  17. The ctable Object . . . . . . .

    . . . . . chunk carray new rows to append • Chunks follow column order • Very efficient for querying • Adding or removing columns is cheap too
  18. Persistency • carray and ctable objects can live on disk,

    not only in memory • bcolz allows every operation to be executed either in-memory or on-disk (out-of-core operations) • The recipe is to provide high performance iterators for carray and ctable, and then implement operations with these iterators
  19. bcolz And The Memory Hierarchical Model • All the components

    of bcolz (including Blosc) are designed with the memory hierarchy in mind to get the best performance • Basically, bcolz uses the blocking technique extensively so as to leverage the temporal and spatial localities all along the hierarchy
  20. Streaming analytics with bcolz bcolz is meant to be simple:

    note the modular approach! map(), filter(), groupby(), sortby(), reduceby(),
 join() itertools, Dask, bquery, … bcolz container (disk or memory) iter(), iterblocks(),
 where(), whereblocks(), __getitem__() bcolz
 iterators/filters with blocking
  21. bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby

    “Switching to bcolz enabled us to have a much better scalable
 architecture yet with near in-memory performance”
 — Carst Vaartjes, co-founder visualfabriq
  22. Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1

    (Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)
  23. Blosc2 • Blosc1 only works with fixed-length, equal-sized, chunks (blocks)

    • This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks
  24. ARM/NEON: a first-class citizen for Blosc2 • At 3 GB/s,

    Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON
  25. Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta

    filter) •Support for more codecs and filters •Serialized version of the super-chunk (disk, network) …
  26. Resumen • Debido a la evolución en las arquitectura modernas,

    la compresión puede ser efectiva por dos razones: • Se puede trabajar con más datos usando los mismos recursos • Se puede llegar a reducir el coste de la compresión a cero, e incluso más allá!