Profiling to understand system behaviour • We often ignore this step... • Speeding up the bottleneck • Keeps you on 1 machine (if possible) • Keeping team speed high
by line or by function) • Memory (line by line) • Disk read/write (with some hacking) • Network read/write (with some hacking) • mmaps, File handles, Network connections • Why not watch memory flows on machine?
boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s
nogil #cython: boundscheck=False from cython.parallel import prange import numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil: for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s
“It just works” on Python 2.7 code • Clever list strategies (e.g. unboxed, uniform) • Software Transactional Memory LOOKS INTERESTING • Pure-py libs (e.g. pymysql) work fine • Python list code runtime: 0.3s, faster on second run (if in same session) • No support cost if pypy is in PATH
jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s (0.2s on subsequent runs) Some Python 3 support, some GPU Not a golden bullet yet but might be...
first – maybe you just need a better alg? • Never sacrifice unit tests in the name of profiling • PyPy no learning curve - easy (non-numpy) win • Cython pure Py hours to learn – team cost low (and lots of online help) • Cython numpy OMP days+ to learn – heavy team cost? • [R&D?] Numba trivial to learn when it works (Anaconda only!) • [R&D?] Pythran trivial to learn, OMP easy additional win, increases support cost
options should be richer • 4-12 physical CPU cores commonplace • JITs/AST compilers are getting fairly good, manual intervention still gives best results • Automation should be embraced as CPUs cost less than humans and team velocity is probably higher