High Performance Python Landscape (PyDataLondon Sept 2014)

www.morconsulting.c The High Performance Python Landscape - profiling and fast
calculation Ian Ozsvald @IanOzsvald MorConsulting.com

[email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? •
Profiling to understand system behaviour • We often ignore this step... • Speeding up the bottleneck • Keeps you on 1 machine (if possible) • Keeping team speed high

[email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical
Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com

[email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time
Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 1 6870 6870.0 0.0 output = [0] * len(zs) 13 1000001 781959 0.8 0.8 for i in range(len(zs)): 14 1000000 767224 0.8 0.8 n = 0 15 1000000 843432 0.8 0.8 z = zs[i] 16 1000000 786013 0.8 0.8 c = cs[i] 17 34219980 36492596 1.1 36.2 while abs(z) < 2 and n < maxiter: 18 33219980 32869046 1.0 32.6 z = z * z + c 19 33219980 27371730 0.8 27.2 n += 1 20 1000000 890837 0.9 0.9 output[i] = n 21 1 4 4.0 0.0 return output

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage
Increment Line Contents ================================================ 9 89.934 MiB 0.000 MiB @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 97.566 MiB 7.633 MiB output = [0] * len(zs) 13 130.215 MiB 32.648 MiB for i in range(len(zs)): 14 130.215 MiB 0.000 MiB n = 0 15 130.215 MiB 0.000 MiB z = zs[i] 16 130.215 MiB 0.000 MiB c = cs[i] 17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2: 18 130.215 MiB 0.000 MiB z = z * z + c 19 130.215 MiB 0.000 MiB n += 1 20 130.215 MiB 0.000 MiB output[i] = n 21 122.582 MiB 7.633 MiB return output

[email protected] @IanOzsvald PyDataLondon February 2014 Don't sacrifice unit tests •
It is possible (but not trivial) to maintain unit tests whilst profiling • See my book for examples (you make no-op @profile decorators)

[email protected] @IanOzsvald PyDataLondon February 2014 ipython_memory_watcher.py # approx 750MB per
matrix In [2]: a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8) 'a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8)' used 2288.8750 MiB RAM in 1.02s, peaked 0.00 MiB above current, total RAM usage 2338.06 MiB In [3]: d=a*b+c 'd=a*b+c' used 762.9453 MiB RAM in 0.91s, peaked 667.91 MiB above current, total RAM usage 3101.01 MiB

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler mprof https://github.com/scikit-learn/scikit-l earn/pull/2248 Before
& After an improvement

[email protected] @IanOzsvald PyDataLondon February 2014 Transforming memory_profiler into a resource
profiler?

[email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities • CPU (line
by line or by function) • Memory (line by line) • Disk read/write (with some hacking) • Network read/write (with some hacking) • mmaps, File handles, Network connections • Why not watch memory flows on machine?

[email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython:
boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s

[email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP
nogil #cython: boundscheck=False from cython.parallel import prange import numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil: for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s

[email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int,
complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs): … Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) … #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s

[email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) •
“It just works” on Python 2.7 code • Clever list strategies (e.g. unboxed, uniform) • Software Transactional Memory LOOKS INTERESTING • Pure-py libs (e.g. pymysql) work fine • Python list code runtime: 0.3s, faster on second run (if in same session) • No support cost if pypy is in PATH

[email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.13 from numba import
jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s (0.2s on subsequent runs) Some Python 3 support, some GPU Not a golden bullet yet but might be...

[email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs • Always profile
first – maybe you just need a better alg? • Never sacrifice unit tests in the name of profiling • PyPy no learning curve - easy (non-numpy) win • Cython pure Py hours to learn – team cost low (and lots of online help) • Cython numpy OMP days+ to learn – heavy team cost? • [R&D?] Numba trivial to learn when it works (Anaconda only!) • [R&D?] Pythran trivial to learn, OMP easy additional win, increases support cost

[email protected] @IanOzsvald PyDataLondon February 2014 Wrap up • Our profiling
options should be richer • 4-12 physical CPU cores commonplace • JITs/AST compilers are getting fairly good, manual intervention still gives best results • Automation should be embraced as CPUs cost less than humans and team velocity is probably higher

[email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected] •
@IanOzsvald • ModelInsight.io / MorConsulting.com • GitHub/IanOzsvald • • I'm training on this in October in London!

High Performance Python Landscape (PyDataLondon...

High Performance Python Landscape (PyDataLondon Sept 2014)

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

www.morconsulting.c The High Performance Python Landscape - profiling and fast

[email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? •

[email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical

[email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage

[email protected] @IanOzsvald PyDataLondon February 2014 Don't sacrifice unit tests •

[email protected] @IanOzsvald PyDataLondon February 2014 ipython_memory_watcher.py # approx 750MB per

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler mprof https://github.com/scikit-learn/scikit-l earn/pull/2248 Before

[email protected] @IanOzsvald PyDataLondon February 2014 Transforming memory_profiler into a resource

[email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities • CPU (line

[email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython:

[email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP

[email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int,

[email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) •

[email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.13 from numba import

[email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs • Always profile

[email protected] @IanOzsvald PyDataLondon February 2014 Wrap up • Our profiling

[email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected] •