It's the Memory Stupid!

It´s The Memory, Stupid! or: How I Learned to Stop
Worrying about CPU Speed and Love Memory Access Francesc Alted Software Architect Big Data Spain 2012, Madrid (Spain) November 16, 2012

About Continuum Analytics • Develop new ways on how data
is stored, computed, and visualized. • Provide open technologies for Data Integration on a massive scale. • Provide software tools, training, and integration/consulting services to corporate, government, and educational clients worldwide.

Overview • The Era of ‘Big Data’ • A few
words about Python and NumPy • The Starving CPU problem • Choosing optimal containers for Big Data

The Dawn of ‘Big Data’ “A wind of streaming data,
social data and unstructured data is knocking at the door, and we're starting to let it in. It's a scary place at the moment.” -- Unidentiﬁed bank IT executive, as quoted by “The American Banker”

Challenges • We have to deal with as much data
as possible by using limited resources • So, we must use our computational resources optimally to be able to get the most out of Big Data

Interactivity and Big Data • Interactivity is crucial for handling
data • Interactivity and performance are crucial for handling Big Data

Python and ‘Big Data’ • Python is an interpreted language
and hence, it offers interactivity • Myth: “Python is slow, so why on the hell are you going to use it for Big Data?” • Answer: Python has access to an incredibly powerful range of libraries that boost its performance far beyond your expectations • ...and during this talk I will prove it!

NumPy: A Standard ‘De Facto’ Container

    

• array[2]; array[1,1:5, :]; array[[3,6,10]] • (array1**3 / array2) -
sin(array3) • numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions • and much more... Operating with NumPy

Nothing Is Perfect • NumPy is just great for many
use cases • However, it also has its own deﬁciencies: • Follows the Python evaluation order in complex expressions like : (a * b) + c • Does not have support for multiprocessors (except for BLAS computations)

Numexpr: Dealing with Complex Expressions • It comes with a
specialized virtual machine for evaluating expressions • It accelerates computations mainly by making a more efﬁcient memory usage • It supports extremely easy to use multithreading (active by default)

Exercise (I) Evaluate the next polynomial: in the range [-1,
1] with a step size of 2*10-7, using both NumPy and numexpr. Note: use a single processor for numexpr numexpr.set_num_threads(1) 0.25x3 + 0.75x2 + 1.5x - 2

Exercise (II) Rewrite the polynomial in this notation: and redo
the computations. What happens? ((0.25x + 0.75)x + 1.5)x - 2

0,301 0,11 x 0,052 0,045 sin(x)**2+cos(x)**2 0,715 0,559 ((.25*x +
.75)*x - 1.5)*x – 2 1,8 NumPy vs Numexpr (1 thread) .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 Time to evaluate polynomial (1 thread) NumPy Numexpr Time (s)

Power Expansion Numexpr expands expression: 0.25x3 + 0.75x2 + 1.5x
- 2 to: 0.25x*x*x + 0.75x*x + 1.5x*x - 2 so, no need to use transcendental pow()

Pending question • Why numexpr continues to be 3x faster
than NumPy, even when both are executing exactly the *same* number of operations?

The Starving CPU Problem “Across the industry, today’s chips are
largely able to execute code faster than we can feed them with instructions and data.” – Richard Sites, after his article “It’s The Memory, Stupid!”, Microprocessor Report, 10(10),1996

Memory Access Time vs CPU Cycle Time

Book in 2009

The Status of CPU Starvation in 2012 • Memory latency
is much slower (between 250x and 500x) than processors. • Memory bandwidth is improving at a better rate than memory latency, but it is also slower than processors (between 30x and 100x).

CPU Caches to the Rescue • CPU cache latency and
throughput are much better than memory • However: the faster they run the smaller they must be

CPU Cache Evolution Up to end 80’s 90’s and 2000’s
2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks. Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)

When CPU Caches Are Effective? Mainly in a couple of
scenarios: • Time locality: when the dataset is reused • Spatial locality: when the dataset is accessed sequentially

The Blocking Technique       
Use this extensively to leverage spatial and temporal localities When accessing disk or memory, get a contiguous block that ﬁts in CPU cache, operate upon it and reuse it as much as possible.

Time To Answer Pending Questions NumPy .25*x**3 + .75*x**2 -
1.5*x – 2 1,613 0,138 0,301 0,11 x 0,052 0,045 sin(x)**2+cos(x)**2 0,715 0,559 NumPy Numexpr ((.25*x + .75)*x - 1.5)*x – 2 1,8 NumPy vs Numexpr (1 thread) .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 Time to evaluate polynomial (1 thread) NumPy Numexpr Time (s)

        

        

Beyond numexpr: Numba

Numexpr Limitations • Numexpr only implements element-wise operations, i.e. ‘a*b’
is evaluated as: for i in range(N): c[i] = a[i] * b[i] • In particular, it cannot deal with things like: for i in range(N): c[i] = a[i-1] + a[i] * b[i]

Numba: Overcoming numexpr Limitations • Numba is a JIT that
can translate a subset of the Python language into machine code • It uses LLVM infrastructure behind the scenes • Can achieve similar or better performance than numexpr, but with more ﬂexibility

LLVM 3.1 Intel Nvidia Apple AMD OpenCL ISPC CUDA CLANG
OpenMP LLVM-PY Python Function Machine Code How Numba Works

Numba Example: Computing the Polynomial import numpy as np import
numba as nb N = 10*1000*1000 x = np.linspace(-1, 1, N) y = np.empty(N, dtype=np.float64) @nb.jit(arg_types=[nb.f8[:], nb.f8[:]]) def poly(x, y): for i in range(N): # y[i] = 0.25*x[i]**3 + 0.75*x[i]**2 + 1.5*x[i] - 2 y[i] = ((0.25*x[i] + 0.75)*x[i] + 1.5)*x[i] - 2 poly(x, y) # run through Numba!

Times for Computing the Polynomial (In Seconds) Poly version (I)
(II) Numpy 1.086 0.505 numexpr 0.108 0.096 Numba 0.055 0.054 Pure C, OpenMP 0.215 0.054 • Compilation time for Numba: 0.019 sec • Run on Mac OSX, Core2 Duo @ 2.13 GHz

Numba: LLVM for Python Python code can reach C speed
without having to program in C itself (and without losing interactivity!)

Numba in SC 2012

Numba in SC2012 Awesome Python!

Optimal Containers for Big Data If a datastore requires all
data to ﬁt in memory, it isn't big data -- Alex Gaynor (in twitter)

The Need for a Good Data Container • Too many
times we are too focused on computing as fast as possible • But we have seen how important data access is • Hence, having an optimal data structure is critical for getting good performance when processing very large datasets

Appending Data in Large NumPy Objects Copy! New memory allocation
array to be enlarged ﬁnal array object new data to append • Normally a realloc() syscall will not succeed • Both memory areas have to exist simultaneously

Contiguous vs Chunked NumPy container Contiguous memory Blaze container chunk
1 chunk 2 Discontiguous memory chunk N . . .

Appending data in Blaze compress new chunk array to be
enlarged ﬁnal array object new data to append Only a small amount of data has to be compressed X chunk 1 chunk 2 chunk 1 chunk 2

Blosc: (de)compressing faster than memcpy() Transmission + decompression faster than
direct transfer?

Source: Howison, M. (in press). High-throughput compression of FASTQ data
with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics. TABLE 1 Test Data Sets # Source Identiﬁer Sequencer Read Count Read Length ID Lengths FASTQ Size 1 1000 Genomes ERR000018 Illumina GA 9,280,498 36 bp 40–50 1,105 MB 2 1000 Genomes SRR493233 1 Illumina HiSeq 2000 43,225,060 100 bp 51–61 10,916 MB 3 1000 Genomes SRR497004 1 AB SOLiD 4 122,924,963 51 bp 78–91 22,990 MB g. 1. In-memory throughputs for several compression schemes applied to increasing block sizes (where each equence is 256 bytes long). to a memory buffer, timed the compression of block consistent throughput across both compression and Example of How Blosc Accelerates Genomics I/O: SeqPack (backed by Blosc)

How Blaze Does Out- Of-Core Computations    
                                                    Virtual Machine : Python, numexpr, Numba

Last Message for Today Big data is tricky to manage:
Look for the optimal containers for your data Spending some time choosing your appropriate data container can be a big time saver in the long run

Summary • Python is a perfect language for Big Data
• Nowadays you should be aware of the memory system for getting good performance • Choosing appropriate data containers is of the utmost importance when dealing with Big Data

“El éxito del Big Data lo conseguirán aquellos desarrolladores que
sean capaces de mirar más allá del standard y sean capaces de entender los recursos hardware subyacentes y la variedad de algoritmos disponibles.” -- Oscar de Bustos, HPC Line of Business Manager at BULL

¡Gracias!

It's the Memory Stupid!

It's the Memory Stupid!

More Decks by Stephen Diehl

Other Decks in Technology

Featured

Transcript