Ph.D. From MIT, 2012 Associate Prof. at Columbia / LDEO https://ocean-transport.github.io/ Core developer of Xarray Core developer of Zarr Co-founder of Pangeo Open Source Advocate
its benefits? • Part II: Demo of Pangeo workflow in the Cloud • Part III: Deep dive on on Pangeo technology stack • Part IV: Future Challenges for Cloud Native Science T h i s Ta l k 3
f D ata S c i e n c e 12 ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+ How do data scientists spend their time? Crowdflower Data Science Report (2016)
a d ” M o d e l 14 a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 25 vs. Massive computational resources available on demand. Elastic scaling. High throughput data storage for distributed processing.
Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 26 All collaborators around the world can access the same computational environment and data.
Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 27 Industry can exploit data more effectively if it’s already in the cloud.
Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 28 Researchers are not constrained by local infrastructure. https://coessing.org/
l o u d N at i v e 29 Analysis-Ready, Cloud-Optimized Data Data-Proximate Computing On Demand, Scalable Distributed Computing compute node compute node compute node compute node compute node compute node compute node compute node compute node
u n i t y ✓Students / Postdocs / Faculty / Software Devs / Data Scientists ✓Academia / National Labs / Industry / NGO ✓Weather / Climate / Oceans / Geoscience ✓US / UK / Europe / Australia Participation in Pangeo is open to anyone! http://pangeo.io
S ta c k 35 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Domain-Specific Packages
I n f r a s t r u c t u r e 37 Compute Data Dask Gateway .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 Zarr Datasets Node Pools (Autoscaling) preemptible (spot instance) normal http://catalog.pangeo.io
for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 39 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1 How do data scientists spend their time? Crowdflower Data Science Report (2016) What is “Analysis Ready”?
lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 40 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
R C O D ATA 41 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
l o u d N at i v e 43 Analysis-Ready, Cloud-Optimized Data Data-Proximate Computing On Demand, Scalable Distributed Computing compute node compute node compute node compute node compute node compute node compute node compute node compute node
FITS, ROOT, etc.) are NOT cloud-optimized (inefficient access on object storage) • Adopting CO formats (e.g. Parquet, Zarr) is confusing to users and data providers • Transcoding legacy data to ARCO format can be tedious and complicated • Some clever hacks, e.g. kerchunk: https://github.com/fsspec/kerchunk C h a l l e n g e : L e g a c y D ata F o r m at s 45
data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format. Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages. We hope that Pangeo Forge can play the same role for datasets.
• Fear of “lock in” to specific cloud provider due to Egress fees ($50K to download 1 PB) • Possible solutions • OSN / Internet2 • CloudFlare C h a l l e n g e : S t o r a g e a n d E g r e s s C o s t s 47
paid for by local institution • Since cloud infrastructure can scale to accommodate any number of users (up to the entire field), it’s not clear who should pay for it • My option: make it easily “franchisable” — allow institutions to incrementally add capacity to a federation to support their users while still leveraging economics of scale C h a l l e n g e : F u n d i n g M o d e l 49