Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jake VanderPlas - Performance Python: Seven Str...

Jake VanderPlas - Performance Python: Seven Strategies for Optimizing Your Numerical Code

Python provides a powerful platform for working with data, but often the most straightforward data analysis can be painfully slow. When used effectively, though, Python can be as fast as even compiled languages like C. This talk presents an overview of how to effectively approach optimization of numerical code in Python, touching on tools like numpy, pandas, scipy, cython, numba, and more.

https://us.pycon.org/2018/schedule/presentation/112/

PyCon 2018

May 11, 2018
Tweet

More Decks by PyCon 2018

Other Decks in Programming

Transcript

  1. Fortran is 100x faster for this simple task! Python is

    Slow. CPython has constant overhead per operation
  2. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  3. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  4. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  5. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  6. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  7. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  8. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  9. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  10. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  11. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  12. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  13. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  14. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  15. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  16. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  17. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  18. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  19. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  20. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  21. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  22. Example: K-means Clustering Algorithm: 1. Choose some Cluster Centers 2.

    Repeat: a. Assign points to nearest center b. Update center to mean of points c. Check if Converged
  23. Python Implementation def dist(x, y): return sum((xi - yi) **

    2 for xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels
  24. Python Implementation def dist(x, y): return sum((xi - yi) **

    2 for xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels
  25. Python Implementation def dist(x, y): return sum((xi - yi) **

    2 for xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels
  26. Python Implementation def dist(x, y): return sum((xi - yi) **

    2 for xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels
  27. Python Implementation def compute_centers(points, labels): n_centers = len(set(labels)) n_dims =

    len(points[0]) centers = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)]
  28. Python Implementation def compute_centers(points, labels): n_centers = len(set(labels)) n_dims =

    len(points[0]) centers = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)]
  29. Python Implementation def compute_centers(points, labels): n_centers = len(set(labels)) n_dims =

    len(points[0]) centers = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)]
  30. Python Implementation def compute_centers(points, labels): n_centers = len(set(labels)) n_dims =

    len(points[0]) centers = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)]
  31. Python Implementation def compute_centers(points, labels): n_centers = len(set(labels)) n_dims =

    len(points[0]) centers = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)]
  32. Python Implementation def kmeans(points, n_clusters): centers = points[-n_clusters:].tolist() while True:

    old_centers = centers labels = find_labels(points, centers) centers = compute_centers(points, labels) if centers == old_centers: break return labels %timeit kmeans(points, 10) 7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  33. Python Implementation def kmeans(points, n_clusters): centers = points[-n_clusters:].tolist() while True:

    old_centers = centers labels = find_labels(points, centers) centers = compute_centers(points, labels) if centers == old_centers: break return labels %timeit kmeans(points, 10) 7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  34. Python Implementation def kmeans(points, n_clusters): centers = points[-n_clusters:].tolist() while True:

    old_centers = centers labels = find_labels(points, centers) centers = compute_centers(points, labels) if centers == old_centers: break return labels %timeit kmeans(points, 10) 7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  35. Python Implementation def kmeans(points, n_clusters): centers = points[-n_clusters:].tolist() while True:

    old_centers = centers labels = find_labels(points, centers) centers = compute_centers(points, labels) if centers == old_centers: break return labels %timeit kmeans(points, 10) 7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  36. Python Implementation def kmeans(points, n_clusters): centers = points[-n_clusters:].tolist() while True:

    old_centers = centers labels = find_labels(points, centers) centers = compute_centers(points, labels) if centers == old_centers: break return labels %timeit kmeans(points, 10) 7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  37. Python Implementation def kmeans(points, n_clusters): centers = points[-n_clusters:].tolist() while True:

    old_centers = centers labels = find_labels(points, centers) centers = compute_centers(points, labels) if centers == old_centers: break return labels %timeit kmeans(points, 10) 7.44 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  38. “Premature optimization is the root of all evil” - Donald

    Knuth Seven Strategies: 1. Line Profiling
  39. Line Profiling %load_ext line_profiler %lprun -f kmeans kmeans(points, 10) Timer

    unit: 1e-06 s Total time: 11.8153 s File: <ipython-input-1-0104920df0d9> Function: kmeans at line 27 Line # Hits Time Per Hit % Time Line Contents ============================================================== 27 def kmeans(points, n_clusters): 28 1 16 16.0 0.0 centers = points[-n_clusters:]. 29 1 2 2.0 0.0 while True: 30 54 55 1.0 0.0 old_centers = centers 31 54 11012265 203930.8 93.2 labels = find_labels(points 32 54 802873 14868.0 6.8 centers = compute_centers(p labels) 33 54 116 2.1 0.0 if centers == old_centers: 34 1 0 0.0 0.0 break 35 1 1 1.0 0.0 return labels
  40. def dist(x, y): return sum((xi - yi) ** 2 for

    xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels Original Code
  41. def dist(x, y): return sum((xi - yi) ** 2 for

    xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels import numpy as np def find_labels(points, centers): diff = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1) Original Code Numpy Code
  42. def dist(x, y): return sum((xi - yi) ** 2 for

    xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels import numpy as np def find_labels(points, centers): diff = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1) Original Code Numpy Code
  43. def dist(x, y): return sum((xi - yi) ** 2 for

    xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels import numpy as np def find_labels(points, centers): diff = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1) Original Code Numpy Code
  44. def dist(x, y): return sum((xi - yi) ** 2 for

    xi, yi in zip(x, y)) def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels import numpy as np def find_labels(points, centers): diff = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1) Original Code Numpy Code
  45. Timer unit: 1e-06 s Total time: 0.960594 s File: <ipython-input-5-72ba0feed022>

    Function: kmeans at line 23 Line # Hits Time Per Hit % Time Line Contents ============================================================== 23 def kmeans(points, n_clusters): 24 1 11 11.0 0.0 centers = points[-n_clusters:]. 25 1 0 0.0 0.0 while True: 26 54 50 0.9 0.0 old_centers = centers 27 54 87758 1625.1 9.1 labels = find_labels(points 28 54 872625 16159.7 90.8 centers = compute_centers(p 29 54 149 2.8 0.0 if centers == old_centers: 30 1 1 1.0 0.0 break 31 1 0 0.0 0.0 return labels %lprun -f kmeans kmeans(points, 10)
  46. %lprun -f kmeans kmeans(points, 10) Timer unit: 1e-06 s Total

    time: 0.960594 s File: <ipython-input-5-72ba0feed022> Function: kmeans at line 23 Line # Hits Time Per Hit % Time Line Contents ============================================================== 23 def kmeans(points, n_clusters): 24 1 11 11.0 0.0 centers = points[-n_clusters:]. 25 1 0 0.0 0.0 while True: 26 54 50 0.9 0.0 old_centers = centers 27 54 87758 1625.1 9.1 labels = find_labels(points 28 54 872625 16159.7 90.8 centers = compute_centers(p 29 54 149 2.8 0.0 if centers == old_centers: 30 1 1 1.0 0.0 break 31 1 0 0.0 0.0 return labels
  47. def compute_centers(points, labels): n_centers = len(set(labels)) n_dims = len(points[0]) centers

    = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)] Original Code
  48. def compute_centers(points, labels): n_centers = len(set(labels)) n_dims = len(points[0]) centers

    = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)] Original Code Numpy Code def compute_centers(points, labels): n_centers = len(set(labels)) return np.array([points[labels == i].mean(0) for i in range(n_centers)])
  49. def compute_centers(points, labels): n_centers = len(set(labels)) n_dims = len(points[0]) centers

    = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)] Original Code Numpy Code def compute_centers(points, labels): n_centers = len(set(labels)) return np.array([points[labels == i].mean(0) for i in range(n_centers)])
  50. def compute_centers(points, labels): n_centers = len(set(labels)) n_dims = len(points[0]) centers

    = [[0 for i in range(n_dims)] for j in range(n_centers)] counts = [0 for j in range(n_centers)] for label, point in zip(labels, points): counts[label] += 1 centers[label] = [a + b for a, b in zip(centers[label], point)] return [[x / count for x in center] for center, count in zip(centers, counts)] Original Code Numpy Code def compute_centers(points, labels): n_centers = len(set(labels)) return np.array([points[labels == i].mean(0) for i in range(n_centers)])
  51. %timeit kmeans(points, 10) 131 ms ± 3.68 ms per loop

    (mean ± std. dev. of 7 runs, 10 loops each) Down from 7.44 seconds to 0.13 seconds! Key: repeated operations pushed into a compiled layer: Python overhead per array rather than per array element.
  52. Advantages: - Python overhead per array rather than per array

    element - Compact domain specific language for array operations - NumPy is widely available Disadvantages: - Batch operations can lead to excessive memory usage - Different way of thinking about writing code Recommendation: Use NumPy everywhere!
  53. Scipy Numpy Code import numpy as np def find_labels(points, centers):

    diff = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1)
  54. Scipy from scipy.spatial import cKDTree def find_labels(points, centers): distances, labels

    = cKDTree(centers).query(points, 1) return labels KD-Tree Code Numpy Code import numpy as np def find_labels(points, centers): diff = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1) KD-Tree: Data structure designed for nearest neighbor searches
  55. Pandas Code Numpy Code import pandas as pd def compute_centers(points,

    labels): df = pd.DataFrame(points) return df.groupby(labels).mean().values def compute_centers(points, labels): n_centers = len(set(labels)) return np.array([points[labels == i].mean(0) for i in range(n_centers)]) Pandas Dataframe: Efficient structure for group-wise operations
  56. %timeit kmeans(points, 10) 102 ms ± 2.52 ms per loop

    (mean ± std. dev. of 7 runs, 1 loop each) Compared to: - 7.44 Seconds in Python - 131 ms with NumPy Scipy
  57. Other Useful Data Structures scipy.spatial for spatial queries like distances,

    nearest neighbors, etc. pandas for SQL-like grouping & aggregation xarray for grouping across multiple dimensions scipy.sparse sparse matrices for 2-dimensional structured data sparse package for N-dimensional structured data scipy.sparse.csgraph for graph-like problems (e.g. finding shortest paths)
  58. Advantages: - Often fastest possible way to solve a particular

    problem Disadvantages: - Requires broad & deep understanding of both algorithms and their available implementations Recommendation: Use whenever possible! Scipy
  59. def dist(x, y): dist = 0 for i in range(len(x)):

    dist += (x[i] - y[i]) ** 2 return dist def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels centers = points[:10] %timeit find_labels(points, centers) 122 ms ± 5.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  60. %%cython cimport numpy as np cdef double dist(double[:] x, double[:]

    y): cdef double dist = 0 for i in range(len(x)): dist += (x[i] - y[i]) ** 2 return dist def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels centers = points[:10] %timeit find_labels(points, centers) 97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  61. %%cython cimport numpy as np cdef double dist(double[:] x, double[:]

    y): cdef double dist = 0 for i in range(len(x)): dist += (x[i] - y[i]) ** 2 return dist def find_labels(points, centers): labels = [] for point in points: distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels centers = points[:10] %timeit find_labels(points, centers) 97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  62. def find_labels(points, centers): labels = [] for point in points:

    distances = [dist(point, center) for center in centers] labels.append(distances.index(min(distances))) return labels centers = points[:10] %timeit find_labels(points, centers) 97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  63. def find_labels(double[:, :] points, double[:, :] centers): cdef int n_points

    = points.shape[0] cdef int n_centers = centers.shape[0] cdef double[:] labels = np.zeros(n_points) cdef double distance, nearest_distance cdef int nearest_index for i in range(n_points): nearest_distance = np.inf for j in range(n_centers): distance = dist(points[i], centers[j]) if distance < nearest_distance: nearest_distance = distance nearest_index = j labels[i] = nearest_index return np.asarray(labels) centers = points[:10] %timeit find_labels(points, centers) 1.72 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  64. def find_labels(double[:, :] points, double[:, :] centers): cdef int n_points

    = points.shape[0] cdef int n_centers = centers.shape[0] cdef double[:] labels = np.zeros(n_points) cdef double distance, nearest_distance cdef int nearest_index for i in range(n_points): nearest_distance = np.inf for j in range(n_centers): distance = dist(points[i], centers[j]) if distance < nearest_distance: nearest_distance = distance nearest_index = j labels[i] = nearest_index return np.asarray(labels) centers = points[:10] %timeit find_labels(points, centers) 1.72 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  65. def find_labels(double[:, :] points, double[:, :] centers): cdef int n_points

    = points.shape[0] cdef int n_centers = centers.shape[0] cdef double[:] labels = np.zeros(n_points) cdef double distance, nearest_distance cdef int nearest_index for i in range(n_points): nearest_distance = np.inf for j in range(n_centers): distance = dist(points[i], centers[j]) if distance < nearest_distance: nearest_distance = distance nearest_index = j labels[i] = nearest_index return np.asarray(labels) centers = points[:10] %timeit find_labels(points, centers) 1.72 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  66. Advantages: - Python-like code at C-like speeds! Disadvantages: - Explicit

    type annotation can be cumbersome - Often requires restructuring code - Code build becomes more complicated Recommendation: use for operations that can’t easily be expressed in NumPy
  67. def dist(x, y): dist = 0 for i in range(len(x)):

    dist += (x[i] - y[i]) ** 2 return dist def find_labels(points, centers): labels = [] min_dist = np.inf min_label = 0 for i in range(len(points)): for j in range(len(centers)): distance = dist(points[i], centers[j]) if distance < min_dist: min_dist , min_label = distance, j labels.append(min_label) return labels centers = points[:10] %timeit find_labels(points, centers) 97.7 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  68. import numba @numba.jit(nopython=True) def dist(x, y): dist = 0 for

    i in range(len(x)): dist += (x[i] - y[i]) ** 2 return dist @numba.jit(nopython=True) def find_labels(points, centers): labels = [] min_dist = np.inf min_label = 0 for i in range(len(points)): for j in range(len(centers)): distance = dist(points[i], centers[j]) if distance < min_dist: min_dist , min_label = distance, j labels.append(min_label) return labels centers = points[:10] %timeit find_labels(points, centers) 1.47 ms ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  69. Advantages: - Python code JIT-compiled to fortran speeds! Disadvantages: -

    Heavy dependency chain (LLVM) - Some Python constructs not supported - Still a bit finnicky Recommendation: use for analysis scripts where dependencies are not a concern. See my blog post Optimizing Python in the Real World: NumPy, Numba, and the NUFFT http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/
  70. Parallel Computation: http://dask.pydata.org/ import numpy as np a = np.random.randn(1000)

    b = a * 4 b_min = b.min() print(b_min) -13.2982888603 Typical data manipulation with NumPy:
  71. Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,

    chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask
  72. Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,

    chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask “Task Graph”
  73. Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,

    chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask b2_min.compute() -13.298288860312757
  74. def find_labels(points, centers): diff = (points[:, None, :] - centers)

    ** 2 distances = diff.sum(-1) return distances.argmin(1) labels = find_labels(points, centers)
  75. from dask import array as da def find_labels(points, centers): diff

    = (points[:, None, :] - centers) ** 2 distances = diff.sum(-1) return distances.argmin(1) points = da.from_array(points, chunks=1000) centers = da.from_array(centers, chunks=5) labels = find_labels(points, centers)
  76. from dask import array as da def find_labels(points, centers): diff

    = (points[:, None, :] - centers) distances = (diff ** 2).sum(-1) return distances.argmin(1) points_dask = da.from_array(points, chunks=1000) centers_dask = da.from_array(centers, chunks=5) labels = find_labels(points_dask, centers_dask)
  77. def find_labels(points, centers): diff = (points[:, None, :] - centers)

    distances = (diff ** 2).sum(-1) return distances.argmin(1) def compute_centers(points, labels): points_df = dd.from_dask_array(points) labels_df = dd.from_dask_array(labels) return points_df.groupby(labels_df).mean() def kmeans(points, n_clusters): centers = points[-n_clusters:] points = da.from_array(points, 1000) while True: old_centers = centers labels = find_labels(points, da.from_array(centers, 5)) centers = compute_centers(points, labels) centers = centers.compute().values if np.all(centers == old_centers): break return labels.compute() %timeit kmeans(points, 10) 3.28 s ± 192 ms per loop (mean ± std. dev. of 7 runs) Full, Parallelized K-Means
  78. def find_labels(points, centers): diff = (points[:, None, :] - centers)

    distances = (diff ** 2).sum(-1) return distances.argmin(1) def compute_centers(points, labels): points_df = dd.from_dask_array(points) labels_df = dd.from_dask_array(labels) return points_df.groupby(labels_df).mean() def kmeans(points, n_clusters): centers = points[-n_clusters:] points = da.from_array(points, 1000) while True: old_centers = centers labels = find_labels(points, da.from_array(centers, 5)) centers = compute_centers(points, labels) centers = centers.compute().values if np.all(centers == old_centers): break return labels.compute() %timeit kmeans(points, 10) 3.28 s ± 192 ms per loop (mean ± std. dev. of 7 runs) Full, Parallelized K-Means
  79. def find_labels(points, centers): diff = (points[:, None, :] - centers)

    distances = (diff ** 2).sum(-1) return distances.argmin(1) def compute_centers(points, labels): points_df = dd.from_dask_array(points) labels_df = dd.from_dask_array(labels) return points_df.groupby(labels_df).mean() def kmeans(points, n_clusters): centers = points[-n_clusters:] points = da.from_array(points, 1000) while True: old_centers = centers labels = find_labels(points, da.from_array(centers, 5)) centers = compute_centers(points, labels) centers = centers.compute().values if np.all(centers == old_centers): break return labels.compute() %timeit kmeans(points, 10) 3.28 s ± 192 ms per loop (mean ± std. dev. of 7 runs) Full, Parallelized K-Means
  80. Advantages: - Easy distributed coding, often with no change to

    NumPy or Pandas code! - Even works locally on out-of-core data Disadvantages: - High overhead, so not suitable for smaller problems Recommendation: use when data size or computation time warrants See my blog post Out of Core Dataframes in Python: Dask and OpenStreetMap http://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/
  81. Seven Strategies: 1. Line Profiling 2. Numpy Vectorization 3. Specialized

    Data Structures 4. Cython 5. Numba 6. Dask 7. Find an Existing Implementation!
  82. from sklearn.cluster import KMeans %timeit KMeans(4).fit_predict(points) 28.5 ms ± 701

    µs per loop (mean ± std. dev. of 7 runs, 10 loops each) from dask_ml.cluster import KMeans %timeit KMeans(4).fit(points).predict(points) 8.7 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ML Recommendation: resist the urge to reinvent the wheel.
  83. You can implement it yourself . . . you can

    make your numerical code fast! But the community is Python’s greatest strength.
  84. Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/

    Thank You! Slides available at https://speakerdeck.com/jakevdp/ Notebook with code from this talk: http:goo.gl/d8ZWwp