into account not only the computer as it is, but the computer as it will be.” — My own rephrasing “No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.” — Isaac Asimov
of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Open Source Projects? • Nice way to realize yourself while helping others
for speed: storing and processing as much data as possible with your existing resources • Blosc & bcolz as examples of compressor and data containers for large datasets that follow the principles of the newer computer architectures
ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms But essentially, a blocked data access is mandatory for speed! The slower the media, the larger the block that is worth to transmit
many data containers focused on blocking access • No silver bullet: we won’t be able to find a single container that makes everybody happy; it’s all about tradeoffs • With blocked access we can use persistent media (disk) as it is ephemeral (memory) and the other way around -> independency of media!
transmitted to the CPU Disk bus Decompression Disk CPU Cache Original Dataset Compressed Dataset Transmission + decompression faster than direct transfer?
2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
at this point; there is no better publicly available option that I'm aware of. That's not just ‘yet another compressor library’ case.” — Ivan Smirnov (advocating for Blosc inclusion in h5py)
be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous • Two flavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar
Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU! Less entropy so much more compressible!
“Switching to bcolz enabled us to have a much better scalable architecture yet with near in-memory performance” — Carst Vaartjes, co-founder visualfabriq
the best bandwidth/Watt ratios in the market • Profound implications for the density of data storage devices (e.g. arrays of disks driven by ARM) Not using NEON Using NEON
the compression can be effective for two reasons: • We can work with more data using the same resources. • We can reduce the overhead of compression to near zero, and even beyond than that!