array on disk sums = [] counts = [] for i in range(1000000): # One million times chunk = x[1000000*i: 1000000*(i+1)] # Pull out chunk sums.append(np.sum(chunk)) # Sum chunk counts.append(len(chunk)) # Count chunk result = sum(sums) / sum(counts) # Aggregate results
from glob import glob from numpy import flipud import matplotlib.pyplot as plt files = sorted(glob('*.nc')) data = [Dataset(f).variables['sst'] for f in files] arrs = [da.from_array(x, chunks=(24, 360, 360)) for x in data] x = da.concatenate(arrs, axis=0) full_mean = x.mean(axis=0) plt.imshow(np.flipud(full_mean), cmap='RdBu_r') plt.title('Average Global Ocean Temperature, 1981-2015')
• Contains information on times, airlines, locations, etc… • Roughly 121 million rows, 11GB of data • http://www.transtats.bts.gov/Fields.asp?Table_ID=236
of (subreddit, body) b = db.from_castra('reddit.castra', columns=['subreddit', 'body'], npartitions=8) # Filter out comments not in r/MachineLearning matches_subreddit = b.filter(lambda x: x[0] == 'MachineLearning') # Convert each comment into a list of words, and concatenate words = matches_subreddit.pluck(1).map(to_words).concat() # Count the frequencies for each word, and take the top 100 top_words = words.frequencies().topk(100, key=1).compute() Example - Reddit Data http://blaze.pydata.org/blog/2015/09/08/reddit-comments/
... @do def analyze(sequence_of_data): ... @do def store(result): with open(..., 'w') as f: f.write(result) files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data'] loaded = [load(f) for f in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyze)
a big-memory server may well provide better performance per dollar than a cluster.” “Nobody ever got fired for using Hadoop on a cluster” http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
algorithms • Dask collections form task graphs expressing these algorithms • Dask schedulers execute these graphs in parallel • Dask graphs can be directly created for custom pipelines