Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask - Out-of-core NumPy/Pandas through Task Sc...

Dask - Out-of-core NumPy/Pandas through Task Scheduling

Talk given at SciPy 2015.
Video: https://youtu.be/1kkFZ4P-XHg

Dask Array implements the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. In this talk we describe dask, dask.array, dask.dataframe, as well as task scheduling generally.

Docs: http://dask.pydata.org/en/latest/
Github: https://github.com/ContinuumIO/dask

Jim Crist

July 08, 2015
Tweet

More Decks by Jim Crist

Other Decks in Programming

Transcript

  1. Ocean Temperature Data • Daily mean ocean temperature every 1/4

    degree • 720 x 1440 array every day • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html
  2. One year’s worth from netCDF4 import Dataset import matplotlib.pyplot as

    plt from numpy import flipud data = Dataset("sst.day.mean.2015.v2.nc").variables["sst"] year_mean = data[:].mean(axis=0) plt.imshow(flipud(year_mean), cmap="RdBu_r") plt.title("Average Global Ocean Temperature, 2015")
  3. 36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc

    sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G .
  4. 36 year’s worth $ ls sst.day.mean.1981.v2.nc sst.day.mean.1993.v2.nc sst.day.mean.2005.v2.nc sst.day.mean.1982.v2.nc sst.day.mean.1994.v2.nc

    sst.day.mean.2006.v2.nc sst.day.mean.1983.v2.nc sst.day.mean.1995.v2.nc sst.day.mean.2007.v2.nc sst.day.mean.1984.v2.nc sst.day.mean.1996.v2.nc sst.day.mean.2008.v2.nc sst.day.mean.1985.v2.nc sst.day.mean.1997.v2.nc sst.day.mean.2009.v2.nc ... ... ... $ du -h 15G . 720 x 1440 x 12341 x 4 = 51 GB uncompressed!
  5. Blocked Algorithms Blocked mean x = h5py.File('myfile.hdf5')['x'] # Trillion element

    array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results
  6. Blocked algorithms allow for • parallelism • lower ram usage

    The trick is figuring out how to break the computation into blocks.
  7. Blocked algorithms allow for • parallelism • lower ram usage

    The trick is figuring out how to break the computation into blocks. This is where dask comes in.
  8. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem • Using blocked algorithms and task scheduling
  9. Dask is: • A parallel computing framework • That leverages

    the excellent python ecosystem • Using blocked algorithms and task scheduling • Written in pure python
  10. dask.array • Copies the numpy interface • Arithmetic: +, *,

    … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library
  11. dask.array • Copies the numpy interface • Arithmetic: +, *,

    … • Reductions: mean, max, … • Slicing: x[10:, 100:50:-2] • Fancy indexing: x[:, [3, 1, 2]] • Some linear algebra: tensordot, qr, svd, … Out-of-core, parallel, n-dimensional array library • New operations • Parallel algorithms (approximate quantiles, topk, …) • Slightly overlapping arrays • Integration with HDF5
  12. Out-of-core arrays import dask.array as da from netCDF4 import Dataset

    from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')
  13. dask.dataframe • Out-of-core, blocked parallel DataFrame • Mirrors pandas interface

    • Only implements a subset of pandas operations (currently)
  14. dask.dataframe Efficient operations • Elementwise operations: df.x + df.y •

    Row-wise selections: df[df.x > 0] • Aggregations: df.x.max() • groupby-aggregate: df.groupby(df.x).y.max() • Value counts: df.x.value_counts() • Drop duplicates: df.x.drop_duplicates() • Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
  15. dask.dataframe Less efficient operations (require shuffle unless on index) •

    Set index: df.set_index(df.x) • groupby-apply • Join not on the index: pd.merge(df1, df2, on='name')
  16. Out-of-core dataframes • Yearly csvs of all American flights since

    1990 • Contains information on times, airlines, locations, etc… • http://www.transtats.bts.gov/Fields.asp?Table_ID=236
  17. Out-of-core dataframes >>> import dask.dataframe as dd # Create a

    dataframe from csv files >>> df = dd.read_csv('*.csv', usecols=['Origin', 'DepTime', 'CRSDepTime', 'Cancelled']) # Get time series of non-cancelled and delayed flights >>> not_cancelled = df[df.Cancelled != 1] >>> delayed = not_cancelled[not_cancelled.DepTime > not_cancelled.CRSDepTime] # Count total and delayed flights per airport >>> total_per_airport = not_cancelled.Origin.value_counts() >>> delayed_per_airport = delayed.Origin.value_counts() # Calculate percent delayed per airport >>> percent_delayed = delayed_per_airport/total_per_airport # Remove airports that had less than 500 flights a year on average >>> out = percent_delayed[total_per_airport > 10000]
  18. Out-of-core dataframes # Convert to pandas dataframe, sort, and output

    top 10 >>> result = out.compute() >>> result.sort(ascending=False) >>> result.head(10) ATL 0.538589 PIT 0.515708 ORD 0.513163 PHL 0.508329 DFW 0.506470 CLT 0.501259 DEN 0.474589 JFK 0.453212 SFO 0.452156 CVG 0.452117 dtype: float64
  19. Out-of-core dataframes • 10 GB on disk • Need to

    read ~4 GB subset to perform computation • Max memory during computation is only 0.75 GB
  20. • Collections build task graphs • Schedulers execute task graphs

    • Graph specification = uniting interface
  21. Dask Specification • Dictionary of {name: task} • Tasks are

    tuples of (func, args...) (lispy syntax) • Args can be names, values, or tasks Python Code Dask Graph a = 1 b = 2 x = inc(a) y = inc(b) z = mul(x, y) dsk = {"a": 1, "b": 2, "x": (inc, "a"), "y": (inc, "b"), "z": (mul, "x", "y")}
  22. Can create graphs directly def load(filename): ... def clean(data): ...

    def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1': (load, 'myfile.a.data'), 'load-2': (load, 'myfile.b.data'), 'load-3': (load, 'myfile.c.data'), 'clean-1': (clean, 'load-1'), 'clean-2': (clean, 'load-2'), 'clean-3': (clean, 'load-3'), 'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]), 'store': (store, 'analyze')}
  23. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms
  24. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel
  25. Takeaways • Python can still handle large data using blocked

    algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines