It is increasingly important to understand the architecture of computers in order to design efficient data structures (or containers) for hosting large datasets, The bcolz case.
of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Dreams And Reality • Doing Open Source is a nice way to fulfill yourself while helping others
as much data as possible with your existing resources • New trends in computer hardware • bcolz: an example of data container for large datasets following the principles of newer computer architectures
Big Data is a no go • Designing code for data storage performance depends very much on the computer’s architecture • IMO, existing Python libraries need to invest more effort in getting the most out of existing and future computer architectures
2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms This has profound implications for how you access storage! The slower the media, the larger the block that should be transmitted
be used in a similar way as the ones in NumPy, Pandas, DyND or others • In bcolz data storage is chunked not contiguous, and chunks can be compressed! • Two flavors: • carray: homogenous types, n-dim data • ctable: heterogeneous types, columnar
in chunks, so why bother? • Efficient enlarging and shrinking • Compression is feasible • Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)
2 new chunk(s) carray to be enlarged chunk 1 chunk 2 data to append X compression Only compression on new data required! Blosc Less memory travels to CPU!
the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original Dataset Compressed Dataset Transmission + decompression faster than direct transfer?
“Switching to bcolz enabled us to have a much better scalable architecture yet with near in-memory performance” — Carst Vaartjes, co-founder visualfabriq
container that fits your needs already out there (NumPy, DyND, Pandas, PyTables, bcolz…) • Pay attention to hardware and software trends and make informed decisions about your current development (which, btw, will be deployed in the future :) • Compression is a useful feature, not only to store more data, but to also process data faster under the right conditions.
dominant factor in Computer Sciences. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — My own paraphrase of a quote by Isaac Asimov