Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's the Memory Stupid!

Stephen Diehl
January 09, 2013

It's the Memory Stupid!

Or How I Learned to Stop Worrying about CPU Speed and Love Memory Access

Stephen Diehl

January 09, 2013
Tweet

More Decks by Stephen Diehl

Other Decks in Technology

Transcript

  1. It´s The Memory, Stupid! or: How I Learned to Stop

    Worrying about CPU Speed and Love Memory Access Francesc Alted Software Architect Big Data Spain 2012, Madrid (Spain) November 16, 2012
  2. About Continuum Analytics • Develop new ways on how data

    is stored, computed, and visualized. • Provide open technologies for Data Integration on a massive scale. • Provide software tools, training, and integration/consulting services to corporate, government, and educational clients worldwide.
  3. Overview • The Era of ‘Big Data’ • A few

    words about Python and NumPy • The Starving CPU problem • Choosing optimal containers for Big Data
  4. The Dawn of ‘Big Data’ “A wind of streaming data,

    social data and unstructured data is knocking at the door, and we're starting to let it in. It's a scary place at the moment.” -- Unidentified bank IT executive, as quoted by “The American Banker”
  5. Challenges • We have to deal with as much data

    as possible by using limited resources • So, we must use our computational resources optimally to be able to get the most out of Big Data
  6. Interactivity and Big Data • Interactivity is crucial for handling

    data • Interactivity and performance are crucial for handling Big Data
  7. Python and ‘Big Data’ • Python is an interpreted language

    and hence, it offers interactivity • Myth: “Python is slow, so why on the hell are you going to use it for Big Data?” • Answer: Python has access to an incredibly powerful range of libraries that boost its performance far beyond your expectations • ...and during this talk I will prove it!
  8. • array[2]; array[1,1:5, :]; array[[3,6,10]] • (array1**3 / array2) -

    sin(array3) • numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions • and much more... Operating with NumPy
  9. Nothing Is Perfect • NumPy is just great for many

    use cases • However, it also has its own deficiencies: • Follows the Python evaluation order in complex expressions like : (a * b) + c • Does not have support for multiprocessors (except for BLAS computations)
  10. Numexpr: Dealing with Complex Expressions • It comes with a

    specialized virtual machine for evaluating expressions • It accelerates computations mainly by making a more efficient memory usage • It supports extremely easy to use multithreading (active by default)
  11. Exercise (I) Evaluate the next polynomial: in the range [-1,

    1] with a step size of 2*10-7, using both NumPy and numexpr. Note: use a single processor for numexpr numexpr.set_num_threads(1) 0.25x3 + 0.75x2 + 1.5x - 2
  12. Exercise (II) Rewrite the polynomial in this notation: and redo

    the computations. What happens? ((0.25x + 0.75)x + 1.5)x - 2
  13. 0,301 0,11 x 0,052 0,045 sin(x)**2+cos(x)**2 0,715 0,559 ((.25*x +

    .75)*x - 1.5)*x – 2 1,8 NumPy vs Numexpr (1 thread) .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 Time to evaluate polynomial (1 thread) NumPy Numexpr Time (s)
  14. Power Expansion Numexpr expands expression: 0.25x3 + 0.75x2 + 1.5x

    - 2 to: 0.25x*x*x + 0.75x*x + 1.5x*x - 2 so, no need to use transcendental pow()
  15. Pending question • Why numexpr continues to be 3x faster

    than NumPy, even when both are executing exactly the *same* number of operations?
  16. The Starving CPU Problem “Across the industry, today’s chips are

    largely able to execute code faster than we can feed them with instructions and data.” – Richard Sites, after his article “It’s The Memory, Stupid!”, Microprocessor Report, 10(10),1996
  17. The Status of CPU Starvation in 2012 • Memory latency

    is much slower (between 250x and 500x) than processors. • Memory bandwidth is improving at a better rate than memory latency, but it is also slower than processors (between 30x and 100x).
  18. CPU Caches to the Rescue • CPU cache latency and

    throughput are much better than memory • However: the faster they run the smaller they must be
  19. CPU Cache Evolution Up to end 80’s 90’s and 2000’s

    2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks. Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
  20. When CPU Caches Are Effective? Mainly in a couple of

    scenarios: • Time locality: when the dataset is reused • Spatial locality: when the dataset is accessed sequentially
  21. The Blocking Technique       

    Use this extensively to leverage spatial and temporal localities When accessing disk or memory, get a contiguous block that fits in CPU cache, operate upon it and reuse it as much as possible.
  22. Time To Answer Pending Questions NumPy .25*x**3 + .75*x**2 -

    1.5*x – 2 1,613 0,138 0,301 0,11 x 0,052 0,045 sin(x)**2+cos(x)**2 0,715 0,559 NumPy Numexpr ((.25*x + .75)*x - 1.5)*x – 2 1,8 NumPy vs Numexpr (1 thread) .25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 Time to evaluate polynomial (1 thread) NumPy Numexpr Time (s)
  23. Numexpr Limitations • Numexpr only implements element-wise operations, i.e. ‘a*b’

    is evaluated as: for i in range(N): c[i] = a[i] * b[i] • In particular, it cannot deal with things like: for i in range(N): c[i] = a[i-1] + a[i] * b[i]
  24. Numba: Overcoming numexpr Limitations • Numba is a JIT that

    can translate a subset of the Python language into machine code • It uses LLVM infrastructure behind the scenes • Can achieve similar or better performance than numexpr, but with more flexibility
  25. LLVM 3.1 Intel Nvidia Apple AMD OpenCL ISPC CUDA CLANG

    OpenMP LLVM-PY Python Function Machine Code How Numba Works
  26. Numba Example: Computing the Polynomial import numpy as np import

    numba as nb N = 10*1000*1000 x = np.linspace(-1, 1, N) y = np.empty(N, dtype=np.float64) @nb.jit(arg_types=[nb.f8[:], nb.f8[:]]) def poly(x, y): for i in range(N): # y[i] = 0.25*x[i]**3 + 0.75*x[i]**2 + 1.5*x[i] - 2 y[i] = ((0.25*x[i] + 0.75)*x[i] + 1.5)*x[i] - 2 poly(x, y) # run through Numba!
  27. Times for Computing the Polynomial (In Seconds) Poly version (I)

    (II) Numpy 1.086 0.505 numexpr 0.108 0.096 Numba 0.055 0.054 Pure C, OpenMP 0.215 0.054 • Compilation time for Numba: 0.019 sec • Run on Mac OSX, Core2 Duo @ 2.13 GHz
  28. Numba: LLVM for Python Python code can reach C speed

    without having to program in C itself (and without losing interactivity!)
  29. Optimal Containers for Big Data If a datastore requires all

    data to fit in memory, it isn't big data -- Alex Gaynor (in twitter)
  30. The Need for a Good Data Container • Too many

    times we are too focused on computing as fast as possible • But we have seen how important data access is • Hence, having an optimal data structure is critical for getting good performance when processing very large datasets
  31. Appending Data in Large NumPy Objects Copy! New memory allocation

    array to be enlarged final array object new data to append • Normally a realloc() syscall will not succeed • Both memory areas have to exist simultaneously
  32. Appending data in Blaze compress new chunk array to be

    enlarged final array object new data to append Only a small amount of data has to be compressed X chunk 1 chunk 2 chunk 1 chunk 2
  33. Source: Howison, M. (in press). High-throughput compression of FASTQ data

    with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics. TABLE 1 Test Data Sets # Source Identifier Sequencer Read Count Read Length ID Lengths FASTQ Size 1 1000 Genomes ERR000018 Illumina GA 9,280,498 36 bp 40–50 1,105 MB 2 1000 Genomes SRR493233 1 Illumina HiSeq 2000 43,225,060 100 bp 51–61 10,916 MB 3 1000 Genomes SRR497004 1 AB SOLiD 4 122,924,963 51 bp 78–91 22,990 MB g. 1. In-memory throughputs for several compression schemes applied to increasing block sizes (where each equence is 256 bytes long). to a memory buffer, timed the compression of block consistent throughput across both compression and Example of How Blosc Accelerates Genomics I/O: SeqPack (backed by Blosc)
  34. How Blaze Does Out- Of-Core Computations    

                                                        Virtual Machine : Python, numexpr, Numba
  35. Last Message for Today Big data is tricky to manage:

    Look for the optimal containers for your data Spending some time choosing your appropriate data container can be a big time saver in the long run
  36. Summary • Python is a perfect language for Big Data

    • Nowadays you should be aware of the memory system for getting good performance • Choosing appropriate data containers is of the utmost importance when dealing with Big Data
  37. “El éxito del Big Data lo conseguirán aquellos desarrolladores que

    sean capaces de mirar más allá del standard y sean capaces de entender los recursos hardware subyacentes y la variedad de algoritmos disponibles.” -- Oscar de Bustos, HPC Line of Business Manager at BULL