New Trends In Storing Large Data Silos With Python
Continued changes in computer architectures requires that data containers must be rethinked so as to leverage them. In this talk, modern architectures are described and a new data container is introduced.
services for funding and research institutions • Over 20 development partners, and clients globally, from smallest non-profits to large government agencies • Portfolio company of Digital Science (Macmillan Publishers), the younger sibling of the Nature Publishing Group http://www.uberresearch.com/
of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Free/Libre Projects? • Nice way to realize yourself while helping others
much data as possible with your existing resources • Recent trends in computer hardware • bcolz: an example of data container for large datasets following the principles of newer computer architectures
Big Data is a no go • Designing code for data storage performance depends very much on computer architecture • IMO, existing Python libraries need more effort in getting the most out of existing and future computer architectures
2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms This has profound implications on how you access storage! The slower the media, the larger the block that is worth to transmit
be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous! • Two flavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar
why bother? • Efficient enlarging and shrinking • Compression is possible • Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)
2 new chunk(s) carray to be enlarged chunk 1 chunk 2 data to append X compression Only compression on new data is required! Blosc Less memory travels to CPU!
the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original Dataset Compressed Dataset Transmission + decompression faster than direct transfer?
“Switching to bcolz enabled us to have a much better scalable architecture yet with near in-memory performance” — Carst Vaartjes, co-founder visualfabriq
fits your needs, look for already nice libraries out there (NumPy, DyND, Pandas, PyTables, bcolz…) • Pay attention to hardware and software trends and make informed decisions in your current developments (which, btw, will be deployed in the future :) • Performance is needed for improving interactivity, so do not hesitate to optimize the hot spots in C if needed (via Cython or other means)
dominant factor in Computer Sciences today. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — Based on a quote by Isaac Asimov