Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

Blosc / bcolz Comprimiendo datos más allá de los límites
de la memoria Francesc Alted Consultor Freelance (Departmento de Geo-ciencias, Universidad de Oslo) Charla para la PyConES 2015, Valencia, 2015

Sobre Mi • Creador de librerías como PyTables, Blosc, bcolz.
Mantengo Numexpr desde hace años. • Desarrollador y enseñante en áreas como: • Python (casi 15 años de experiencia) • Computación y almacenamiento de altas prestaciones. • Consultor en proyectos de procesamiento de datos.

Motivación:  El conjunto de datos MovieLens   Materiales en:  https://github.com/FrancescAlted/
PyConES2015

The MovieLens Dataset • Datasets for movie ratings • Different
sizes: 100K, 1M, 10M ratings (the 10M will be used in benchmarks ahead) • The datasets were collected over various periods of time

Querying the MovieLens Dataset import pandas as pd  import bcolz
# Parse and load CSV files using pandas # Merge some files in a single dataframe  lens = pd.merge(movies, ratings) # The pandas way of querying  result = lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)”)['user_id'] zlens = bcolz.ctable.fromdataframe(lens) # The bcolz way of querying (notice the use of the `where` iterator)  result = [r.user_id for r in dblens.where(  "(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['user_id'])]

bcolz vs pandas (size) bcolz puede almacenar hasta 20x más
cantidad de datos que pandas

Query Times 3-year old laptop (Intel Ivy-Bridge, 2 cores) Compression
speeds things up

¿Qué? ¿Consultas sobre datos comprimidos yendo más rápido que sobre
datos sin comprimir? ¿En serio? Data input Data output Decompression Compression Data process

Query Times 5-year old laptop (Intel Core2, 2 cores) Compression
still slow things down

Porqué?

Ver mi artículo: “Why Modern CPUs Are Starving And What
You Can Do About It” Enorme diferencia de velocidad entre CPUs y memoria!

Hierarchy of Memory  By 2017 (Educated Guess) SSD SATA (persistent)
L4 RAM (addressable) XPoint (persistent) HDD (persistent) L3 L2 L1 9 levels will be common! SSD PCIe (persistent)

¿Cómo puede ayudar la compresión?

The same data take less storage Transmission + decompression faster
than direct transfer? Disk or Memory Bus Decompression Persistent (disk) or ephemeral (RAM) storage CPU Cache Original  Dataset Compressed  Dataset

Conociendo Blosc: Un Compresor Diseñado Para CPU’s Modernas

Blosc Outstanding Features • Uses multi-threading • The shufﬂe part
is accelerated using SSE2 and AVX2 (if available) • Supports different compressor backends: blosclz, lz4, snappy and zlib • Fine-tuned for using internal caches (mainly L1 and L2)

Blosc: (de-)compressing faster than memory Reads from Blosc chunks up
to 5x faster than memcpy() (on synthetic data)

Multithreading & SIMD at work! Figure attr: Valentin Haenel How
Blosc Works

How Shufﬂing Works

Compression matters! “Blosc compressors are the fastest ones out there
at this point; there is no better publicly available option that I'm aware of. That's not just ‘yet another compressor library’ case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)

Blosc ecosystem Small, but with big impact  (thanks mainly to
PyTables/pandas) Blosc PyTables pandas bcolz Castra h5py Bloscpack scikit-allel bquery C / C++ world (e.g. OpenVDB)

–Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation “Blosc
compresses almost as well as ZLIB, but it is much faster” Blosc In OpenVDB And Houdini

What is bcolz? • Provides a storage layer that is
both chunked and is compressible • It is meant for both memory and persistent storage (disk) • Containers come with two ﬂavors: carray (multidimensional, homogeneous arrays) and ctable (tabular data, made of carrays)

carray: Multidimensional Container for Homogeneous Data . . . NumPy
container carray container chunk 1 chunk 2 chunk N Contiguous Memory Discontiguous Memory

–Alistair Miles Head of Epidemiological Informatics for the Kwiatkowski group.
Author of scikit-allel. “The future for me clearly involves lots of block-wise processing of multidimensional bcolz carrays"”

The ctable Object . . . . . . .
. . . . . chunk carray new rows to append • Chunks follow column order • Very efﬁcient for querying • Adding or removing columns is cheap too

Persistency • carray and ctable objects can live on disk,
not only in memory • bcolz allows every operation to be executed either in-memory or on-disk (out-of-core operations) • The recipe is to provide high performance iterators for carray and ctable, and then implement operations with these iterators

bcolz And The Memory Hierarchical Model • All the components
of bcolz (including Blosc) are designed with the memory hierarchy in mind to get the best performance • Basically, bcolz uses the blocking technique extensively so as to leverage the temporal and spatial localities all along the hierarchy

Streaming analytics with bcolz bcolz is meant to be simple:
note the modular approach! map(), ﬁlter(), groupby(), sortby(), reduceby(),  join() itertools, Dask, bquery, … bcolz container (disk or memory) iter(), iterblocks(),  where(), whereblocks(), __getitem__() bcolz  iterators/ﬁlters with blocking

bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby
“Switching to bcolz enabled us to have a much better scalable  architecture yet with near in-memory performance”  — Carst Vaartjes, co-founder visualfabriq

Introducing Blosc2 Next generation for Blosc Blosc2 Header Chunk 1
(Blosc1) Chunk 2 (Blosc1) Chunk 3 (Blosc1) Chunk N (Blosc1)

Blosc2 • Blosc1 only works with ﬁxed-length, equal-sized, chunks (blocks)
• This can lead to a poor use of space to accommodate variable-length data (potentially large zero-paddings) • Blosc2 addresses this shortcoming by using superchunks of variable-length chunks

ARM/NEON: a ﬁrst-class citizen for Blosc2 • At 3 GB/s,
Blosc2 on ARM achieves one of the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON

Other planned features for Blosc2 •Looking into inter-chunk redundancies (delta
ﬁlter) •Support for more codecs and ﬁlters •Serialized version of the super-chunk (disk, network) …

Resumen • Debido a la evolución en las arquitectura modernas,
la compresión puede ser efectiva por dos razones: • Se puede trabajar con más datos usando los mismos recursos • Se puede llegar a reducir el coste de la compresión a cero, e incluso más allá!

¿Preguntas? [email protected]

Blosc/bcolz: Comprimiendo mas allá de los límit...

Blosc/bcolz: Comprimiendo mas allá de los límites de la memoria

More Decks by FrancescAlted

Other Decks in Programming

Featured

Transcript