$30 off During Our Annual Pro Sale. View Details »

Seven Strategies for Optimizing Numerical Code

Seven Strategies for Optimizing Numerical Code

Talk given at PyCon 2018

Abstract:
Python provides a powerful platform for working with data, but often the most straightforward data analysis can be painfully slow. When used effectively, though, Python can be as fast as even compiled languages like C. This talk presents an overview of how to effectively approach optimization of numerical code in Python, touching on tools like numpy, pandas, scipy, cython, numba, and more.

Jake VanderPlas

May 11, 2018
Tweet

More Decks by Jake VanderPlas

Other Decks in Programming

Transcript

  1. Performance Python
    7 Strategies for Optimizing
    Your Numerical Code
    Jake VanderPlas @jakevdp
    PyCon 2018

    View Slide

  2. Python is Fast.
    Dynamic, interpreted, & flexible: fast Development

    View Slide

  3. Python is Slow.
    CPython has constant overhead per operation

    View Slide

  4. Python is Slow.
    CPython has constant overhead per operation

    View Slide

  5. Fortran is 100x faster for this simple task!
    Python is Slow.
    CPython has constant overhead per operation

    View Slide

  6. The best of both worlds?

    View Slide

  7. Seven Strategies
    For Optimizing Your
    Numerical Python Code

    View Slide

  8. Example:
    K-means
    Clustering

    View Slide

  9. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  10. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  11. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  12. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  13. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  14. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  15. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  16. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  17. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  18. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  19. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  20. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  21. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  22. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  23. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  24. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  25. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  26. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  27. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  28. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  29. Example:
    K-means
    Clustering
    Algorithm:
    1. Choose some Cluster Centers
    2. Repeat:
    a. Assign points to nearest center
    b. Update center to mean of points
    c. Check if Converged

    View Slide

  30. Implementing K Means in Python

    View Slide

  31. Python Implementation
    def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels

    View Slide

  32. Python Implementation
    def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels

    View Slide

  33. Python Implementation
    def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels

    View Slide

  34. Python Implementation
    def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels

    View Slide

  35. Python Implementation
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]

    View Slide

  36. Python Implementation
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]

    View Slide

  37. Python Implementation
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]

    View Slide

  38. Python Implementation
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]

    View Slide

  39. Python Implementation
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]

    View Slide

  40. Python Implementation
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:].tolist()
    while True:
    old_centers = centers
    labels = find_labels(points, centers)
    centers = compute_centers(points, labels)
    if centers == old_centers:
    break
    return labels
    %timeit kmeans(points, 10)
    7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)

    View Slide

  41. Python Implementation
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:].tolist()
    while True:
    old_centers = centers
    labels = find_labels(points, centers)
    centers = compute_centers(points, labels)
    if centers == old_centers:
    break
    return labels
    %timeit kmeans(points, 10)
    7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)

    View Slide

  42. Python Implementation
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:].tolist()
    while True:
    old_centers = centers
    labels = find_labels(points, centers)
    centers = compute_centers(points, labels)
    if centers == old_centers:
    break
    return labels
    %timeit kmeans(points, 10)
    7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)

    View Slide

  43. Python Implementation
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:].tolist()
    while True:
    old_centers = centers
    labels = find_labels(points, centers)
    centers = compute_centers(points, labels)
    if centers == old_centers:
    break
    return labels
    %timeit kmeans(points, 10)
    7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)

    View Slide

  44. Python Implementation
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:].tolist()
    while True:
    old_centers = centers
    labels = find_labels(points, centers)
    centers = compute_centers(points, labels)
    if centers == old_centers:
    break
    return labels
    %timeit kmeans(points, 10)
    7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)

    View Slide

  45. Python Implementation
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:].tolist()
    while True:
    old_centers = centers
    labels = find_labels(points, centers)
    centers = compute_centers(points, labels)
    if centers == old_centers:
    break
    return labels
    %timeit kmeans(points, 10)
    7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)

    View Slide

  46. What to do when Python is too slow?

    View Slide

  47. Seven Strategies:
    1. Line Profiling

    View Slide

  48. “Premature optimization is the root of all evil”
    - Donald Knuth
    Seven Strategies:
    1. Line Profiling

    View Slide

  49. Line Profiling
    %load_ext line_profiler
    %lprun -f kmeans kmeans(points, 10)
    Timer unit: 1e-06 s
    Total time: 11.8153 s
    File:
    Function: kmeans at line 27
    Line # Hits Time Per Hit % Time Line Contents
    ==============================================================
    27 def kmeans(points, n_clusters):
    28 1 16 16.0 0.0 centers = points[-n_clusters:].
    29 1 2 2.0 0.0 while True:
    30 54 55 1.0 0.0 old_centers = centers
    31 54 11012265 203930.8 93.2 labels = find_labels(points
    32 54 802873 14868.0 6.8 centers = compute_centers(p
    labels)
    33 54 116 2.1 0.0 if centers == old_centers:
    34 1 0 0.0 0.0 break
    35 1 1 1.0 0.0 return labels

    View Slide

  50. How can we optimize repeated
    operations on arrays?

    View Slide

  51. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization

    View Slide

  52. def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    Original Code

    View Slide

  53. def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    import numpy as np
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    Original Code
    Numpy Code

    View Slide

  54. def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    import numpy as np
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    Original Code
    Numpy Code

    View Slide

  55. def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    import numpy as np
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    Original Code
    Numpy Code

    View Slide

  56. def dist(x, y):
    return sum((xi - yi) ** 2
    for xi, yi in zip(x, y))
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center)
    for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    import numpy as np
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    Original Code
    Numpy Code

    View Slide

  57. Timer unit: 1e-06 s
    Total time: 0.960594 s
    File:
    Function: kmeans at line 23
    Line # Hits Time Per Hit % Time Line Contents
    ==============================================================
    23 def kmeans(points, n_clusters):
    24 1 11 11.0 0.0 centers = points[-n_clusters:].
    25 1 0 0.0 0.0 while True:
    26 54 50 0.9 0.0 old_centers = centers
    27 54 87758 1625.1 9.1 labels = find_labels(points
    28 54 872625 16159.7 90.8 centers = compute_centers(p
    29 54 149 2.8 0.0 if centers == old_centers:
    30 1 1 1.0 0.0 break
    31 1 0 0.0 0.0 return labels
    %lprun -f kmeans kmeans(points, 10)

    View Slide

  58. %lprun -f kmeans kmeans(points, 10)
    Timer unit: 1e-06 s
    Total time: 0.960594 s
    File:
    Function: kmeans at line 23
    Line # Hits Time Per Hit % Time Line Contents
    ==============================================================
    23 def kmeans(points, n_clusters):
    24 1 11 11.0 0.0 centers = points[-n_clusters:].
    25 1 0 0.0 0.0 while True:
    26 54 50 0.9 0.0 old_centers = centers
    27 54 87758 1625.1 9.1 labels = find_labels(points
    28 54 872625 16159.7 90.8 centers = compute_centers(p
    29 54 149 2.8 0.0 if centers == old_centers:
    30 1 1 1.0 0.0 break
    31 1 0 0.0 0.0 return labels

    View Slide

  59. def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]
    Original Code

    View Slide

  60. def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]
    Original Code
    Numpy Code
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    return np.array([points[labels == i].mean(0)
    for i in range(n_centers)])

    View Slide

  61. def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]
    Original Code
    Numpy Code
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    return np.array([points[labels == i].mean(0)
    for i in range(n_centers)])

    View Slide

  62. def compute_centers(points, labels):
    n_centers = len(set(labels))
    n_dims = len(points[0])
    centers = [[0 for i in range(n_dims)]
    for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]
    for label, point in zip(labels, points):
    counts[label] += 1
    centers[label] = [a + b for a, b in
    zip(centers[label], point)]
    return [[x / count for x in center]
    for center, count in zip(centers, counts)]
    Original Code
    Numpy Code
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    return np.array([points[labels == i].mean(0)
    for i in range(n_centers)])

    View Slide

  63. %timeit kmeans(points, 10)
    131 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs,
    10 loops each)
    Down from 7.44 seconds to 0.13 seconds!
    Key: repeated operations pushed into a compiled layer:
    Python overhead per array rather than per array element.

    View Slide

  64. Advantages:
    - Python overhead per array rather than per array
    element
    - Compact domain specific language for array
    operations
    - NumPy is widely available
    Disadvantages:
    - Batch operations can lead to excessive memory
    usage
    - Different way of thinking about writing code
    Recommendation: Use NumPy everywhere!

    View Slide

  65. Deeper dive into NumPy Vectorization
    “Losing your Loops” / PyCon 2015

    View Slide

  66. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    Scipy

    View Slide

  67. Scipy
    Numpy Code
    import numpy as np
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)

    View Slide

  68. Scipy
    from scipy.spatial import cKDTree
    def find_labels(points, centers):
    distances, labels = cKDTree(centers).query(points, 1)
    return labels
    KD-Tree Code Numpy Code
    import numpy as np
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    KD-Tree:
    Data structure designed for nearest neighbor searches

    View Slide

  69. Numpy Code
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    return np.array([points[labels == i].mean(0)
    for i in range(n_centers)])

    View Slide

  70. Pandas Code Numpy Code
    import pandas as pd
    def compute_centers(points, labels):
    df = pd.DataFrame(points)
    return df.groupby(labels).mean().values
    def compute_centers(points, labels):
    n_centers = len(set(labels))
    return np.array([points[labels == i].mean(0)
    for i in range(n_centers)])
    Pandas Dataframe:
    Efficient structure for group-wise operations

    View Slide

  71. %timeit kmeans(points, 10)
    102 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs,
    1 loop each)
    Compared to:
    - 7.44 Seconds in Python
    - 131 ms with NumPy
    Scipy

    View Slide

  72. Other Useful Data Structures
    scipy.spatial
    for spatial queries like distances, nearest neighbors, etc.
    pandas
    for SQL-like grouping & aggregation
    xarray
    for grouping across multiple dimensions
    scipy.sparse
    sparse matrices for 2-dimensional structured data
    sparse package
    for N-dimensional structured data
    scipy.sparse.csgraph
    for graph-like problems (e.g. finding shortest paths)

    View Slide

  73. Advantages:
    - Often fastest possible way to solve a particular
    problem
    Disadvantages:
    - Requires broad & deep understanding of both
    algorithms and their available implementations
    Recommendation: Use whenever possible!
    Scipy

    View Slide

  74. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    4. Cython

    View Slide

  75. def dist(x, y):
    dist = 0
    for i in range(len(x)):
    dist += (x[i] - y[i]) ** 2
    return dist
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center) for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    centers = points[:10]
    %timeit find_labels(points, centers)
    122 ms ± 5.82 ms per loop (mean ± std. dev. of 7
    runs, 10 loops each)

    View Slide

  76. %%cython
    cimport numpy as np
    cdef double dist(double[:] x, double[:] y):
    cdef double dist = 0
    for i in range(len(x)):
    dist += (x[i] - y[i]) ** 2
    return dist
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center) for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    centers = points[:10]
    %timeit find_labels(points, centers)
    97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7
    runs, 10 loops each)

    View Slide

  77. %%cython
    cimport numpy as np
    cdef double dist(double[:] x, double[:] y):
    cdef double dist = 0
    for i in range(len(x)):
    dist += (x[i] - y[i]) ** 2
    return dist
    def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center) for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    centers = points[:10]
    %timeit find_labels(points, centers)
    97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7
    runs, 10 loops each)

    View Slide

  78. def find_labels(points, centers):
    labels = []
    for point in points:
    distances = [dist(point, center) for center in centers]
    labels.append(distances.index(min(distances)))
    return labels
    centers = points[:10]
    %timeit find_labels(points, centers)
    97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7
    runs, 10 loops each)

    View Slide

  79. def find_labels(double[:, :] points, double[:, :] centers):
    cdef int n_points = points.shape[0]
    cdef int n_centers = centers.shape[0]
    cdef double[:] labels = np.zeros(n_points)
    cdef double distance, nearest_distance
    cdef int nearest_index
    for i in range(n_points):
    nearest_distance = np.inf
    for j in range(n_centers):
    distance = dist(points[i], centers[j])
    if distance < nearest_distance:
    nearest_distance = distance
    nearest_index = j
    labels[i] = nearest_index
    return np.asarray(labels)
    centers = points[:10]
    %timeit find_labels(points, centers)
    1.72 ms ± 27.3 µs per loop (mean ± std. dev. of 7
    runs, 1000 loops each)

    View Slide

  80. def find_labels(double[:, :] points, double[:, :] centers):
    cdef int n_points = points.shape[0]
    cdef int n_centers = centers.shape[0]
    cdef double[:] labels = np.zeros(n_points)
    cdef double distance, nearest_distance
    cdef int nearest_index
    for i in range(n_points):
    nearest_distance = np.inf
    for j in range(n_centers):
    distance = dist(points[i], centers[j])
    if distance < nearest_distance:
    nearest_distance = distance
    nearest_index = j
    labels[i] = nearest_index
    return np.asarray(labels)
    centers = points[:10]
    %timeit find_labels(points, centers)
    1.72 ms ± 27.3 µs per loop (mean ± std. dev. of 7
    runs, 1000 loops each)

    View Slide

  81. def find_labels(double[:, :] points, double[:, :] centers):
    cdef int n_points = points.shape[0]
    cdef int n_centers = centers.shape[0]
    cdef double[:] labels = np.zeros(n_points)
    cdef double distance, nearest_distance
    cdef int nearest_index
    for i in range(n_points):
    nearest_distance = np.inf
    for j in range(n_centers):
    distance = dist(points[i], centers[j])
    if distance < nearest_distance:
    nearest_distance = distance
    nearest_index = j
    labels[i] = nearest_index
    return np.asarray(labels)
    centers = points[:10]
    %timeit find_labels(points, centers)
    1.72 ms ± 27.3 µs per loop (mean ± std. dev. of 7
    runs, 1000 loops each)

    View Slide

  82. Advantages:
    - Python-like code at C-like speeds!
    Disadvantages:
    - Explicit type annotation can be cumbersome
    - Often requires restructuring code
    - Code build becomes more complicated
    Recommendation: use for operations that can’t easily
    be expressed in NumPy

    View Slide

  83. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    4. Cython
    5. Numba

    View Slide

  84. def dist(x, y):
    dist = 0
    for i in range(len(x)):
    dist += (x[i] - y[i]) ** 2
    return dist
    def find_labels(points, centers):
    labels = []
    min_dist = np.inf
    min_label = 0
    for i in range(len(points)):
    for j in range(len(centers)):
    distance = dist(points[i], centers[j])
    if distance < min_dist:
    min_dist
    , min_label = distance, j
    labels.append(min_label)
    return labels
    centers = points[:10]
    %timeit find_labels(points, centers)
    97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7
    runs, 10 loops each)

    View Slide

  85. import numba
    @numba.jit(nopython=True)
    def dist(x, y):
    dist = 0
    for i in range(len(x)):
    dist += (x[i] - y[i]) ** 2
    return dist
    @numba.jit(nopython=True)
    def find_labels(points, centers):
    labels = []
    min_dist = np.inf
    min_label = 0
    for i in range(len(points)):
    for j in range(len(centers)):
    distance = dist(points[i], centers[j])
    if distance < min_dist:
    min_dist
    , min_label = distance, j
    labels.append(min_label)
    return labels
    centers = points[:10]
    %timeit find_labels(points, centers)
    1.47 ms ± 14.2 µs per loop (mean ± std. dev. of 7
    runs, 1000 loops each)

    View Slide

  86. Advantages:
    - Python code JIT-compiled to fortran speeds!
    Disadvantages:
    - Heavy dependency chain (LLVM)
    - Some Python constructs not supported
    - Still a bit finnicky
    Recommendation: use for analysis scripts where
    dependencies are not a concern.
    See my blog post Optimizing Python in the Real World: NumPy, Numba, and the NUFFT
    http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

    View Slide

  87. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    4. Cython
    5. Numba
    6. Dask

    View Slide

  88. Parallel Computation:
    http://dask.pydata.org/
    import numpy as np
    a = np.random.randn(1000)
    b = a * 4
    b_min = b.min()
    print(b_min)
    -13.2982888603
    Typical data manipulation with NumPy:

    View Slide

  89. Parallel Computation:
    http://dask.pydata.org/
    import dask.array as da
    a2 = da.from_array(a, chunks=200)
    b2 = a2 * 4
    b2_min = b2.min()
    print(b2_min)
    dask.arraydtype=float64, chunksize=()>
    Same operation with dask

    View Slide

  90. Parallel Computation:
    http://dask.pydata.org/
    import dask.array as da
    a2 = da.from_array(a, chunks=200)
    b2 = a2 * 4
    b2_min = b2.min()
    print(b2_min)
    dask.arraydtype=float64, chunksize=()>
    Same operation with dask
    “Task Graph”

    View Slide

  91. Parallel Computation:
    http://dask.pydata.org/
    import dask.array as da
    a2 = da.from_array(a, chunks=200)
    b2 = a2 * 4
    b2_min = b2.min()
    print(b2_min)
    dask.arraydtype=float64, chunksize=()>
    Same operation with dask
    b2_min.compute()
    -13.298288860312757

    View Slide

  92. def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    labels = find_labels(points, centers)

    View Slide

  93. from dask import array as da
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers) ** 2
    distances = diff.sum(-1)
    return distances.argmin(1)
    points = da.from_array(points, chunks=1000)
    centers = da.from_array(centers, chunks=5)
    labels = find_labels(points, centers)

    View Slide

  94. from dask import array as da
    def find_labels(points, centers):
    diff = (points[:, None, :] - centers)
    distances = (diff ** 2).sum(-1)
    return distances.argmin(1)
    points_dask = da.from_array(points, chunks=1000)
    centers_dask = da.from_array(centers, chunks=5)
    labels = find_labels(points_dask, centers_dask)

    View Slide

  95. def find_labels(points, centers):
    diff = (points[:, None, :] - centers)
    distances = (diff ** 2).sum(-1)
    return distances.argmin(1)
    def compute_centers(points, labels):
    points_df = dd.from_dask_array(points)
    labels_df = dd.from_dask_array(labels)
    return points_df.groupby(labels_df).mean()
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:]
    points = da.from_array(points, 1000)
    while True:
    old_centers = centers
    labels = find_labels(points, da.from_array(centers, 5))
    centers = compute_centers(points, labels)
    centers = centers.compute().values
    if np.all(centers == old_centers):
    break
    return labels.compute()
    %timeit kmeans(points, 10)
    3.28 s ± 192 ms per loop (mean ± std. dev. of 7 runs)
    Full, Parallelized K-Means

    View Slide

  96. def find_labels(points, centers):
    diff = (points[:, None, :] - centers)
    distances = (diff ** 2).sum(-1)
    return distances.argmin(1)
    def compute_centers(points, labels):
    points_df = dd.from_dask_array(points)
    labels_df = dd.from_dask_array(labels)
    return points_df.groupby(labels_df).mean()
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:]
    points = da.from_array(points, 1000)
    while True:
    old_centers = centers
    labels = find_labels(points, da.from_array(centers, 5))
    centers = compute_centers(points, labels)
    centers = centers.compute().values
    if np.all(centers == old_centers):
    break
    return labels.compute()
    %timeit kmeans(points, 10)
    3.28 s ± 192 ms per loop (mean ± std. dev. of 7 runs)
    Full, Parallelized K-Means

    View Slide

  97. def find_labels(points, centers):
    diff = (points[:, None, :] - centers)
    distances = (diff ** 2).sum(-1)
    return distances.argmin(1)
    def compute_centers(points, labels):
    points_df = dd.from_dask_array(points)
    labels_df = dd.from_dask_array(labels)
    return points_df.groupby(labels_df).mean()
    def kmeans(points, n_clusters):
    centers = points[-n_clusters:]
    points = da.from_array(points, 1000)
    while True:
    old_centers = centers
    labels = find_labels(points, da.from_array(centers, 5))
    centers = compute_centers(points, labels)
    centers = centers.compute().values
    if np.all(centers == old_centers):
    break
    return labels.compute()
    %timeit kmeans(points, 10)
    3.28 s ± 192 ms per loop (mean ± std. dev. of 7 runs)
    Full, Parallelized K-Means

    View Slide

  98. Advantages:
    - Easy distributed coding, often with no change to
    NumPy or Pandas code!
    - Even works locally on out-of-core data
    Disadvantages:
    - High overhead, so not suitable for smaller problems
    Recommendation: use when data size or
    computation time warrants
    See my blog post Out of Core Dataframes in Python: Dask and OpenStreetMap
    http://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/

    View Slide

  99. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    4. Cython
    5. Numba
    6. Dask

    View Slide

  100. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    4. Cython
    5. Numba
    6. Dask

    View Slide

  101. Seven Strategies:
    1. Line Profiling
    2. Numpy Vectorization
    3. Specialized Data Structures
    4. Cython
    5. Numba
    6. Dask
    7. Find an Existing Implementation!

    View Slide

  102. from sklearn.cluster import KMeans
    %timeit KMeans(4).fit_predict(points)
    28.5 ms ± 701 µs per loop (mean ± std.
    dev. of 7 runs, 10 loops each)
    from dask_ml.cluster import KMeans
    %timeit KMeans(4).fit(points).predict(points)
    8.7 s ± 202 ms per loop (mean ± std.
    dev. of 7 runs, 1 loop each)
    ML
    Recommendation: resist the urge to reinvent the wheel.

    View Slide

  103. You can implement it yourself . . .
    you can make your numerical code fast!
    But the community is Python’s greatest strength.

    View Slide

  104. Email: [email protected]
    Twitter: @jakevdp
    Github: jakevdp
    Web: http://vanderplas.com/
    Blog: http://jakevdp.github.io/
    Thank You!
    Slides available at https://speakerdeck.com/jakevdp/
    Notebook with code from this talk: http:goo.gl/d8ZWwp

    View Slide