r s 2 • Charles Stern (Columbia / LDEO) • Joe Hamman (CarbonPlan) • Anderson Banihirwe (CarbonPlan) • Rachel Wegener (U. Maryland) • Chiara Lepore (GRO Intelligence) • Sean Harkins (Development Seed) • Alex Merose (Google Research) • Tom Augspurger (Microsoft) • Martin Durant (Anaconda) • Many recipe contributors Funding: NSF Earthcube Program ($1.5M for 3 years)
e n c e V i s i o n 3 https://earthdata.nasa.gov/esds/open-science for 👩🔬 in everyone_in_the_world: for 📄 in all_scientific_knowledge: 👩🔬.verify(📄) discovery = 👩🔬.extend(📄) This would transform the 🌎 by allowing all of humanity to participate in the scienti fi c process. What are the barriers to realizing this vision?
d t o R e p r o d u c e a S c i e n t i f i c D i s c o v e r y 5 ✅ The Code 📚 Git, GitHub, GitLab, BitBucket, … ✅ The Environment 📦 Pypi, Conda, Mamba, Conda Forge, Docker, … ✅ The Data 💾 DOIs, Domain Data Repositories, Zenodo, Figshare, Dataverse, …
d t o R e p r o d u c e a S c i e n t i f i c D i s c o v e r y 6 ⚠ The Data 💾 But what about big data? Data-Intensive Science ? https:// fi gshare.com/articles/ fi gure/Earth_Data_Cube/4822930/2 Big Data 💡 Insights 💡 Understanding 💡 Predictions • open-ended problem • exploratory analysis • “human in the loop” • visualization needed • highly varied computational patterns / algorithms • no standard architecture
I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 7 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann
I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 9 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann # step 1: open data open(“/some/random/files/on/my/cluster”) # step 2: do mind-blowing AI! 🤦
exchange creates lots of work for individual scientists (downloading, organizing, cleaning). Most fi le-based datasets are a mess—even simulation output. 😫 Yet the hard work of data wrangling is rarely collaborative. Outputs are not reusable and can’t be shared. 💰 Doing data-intensive science requires either expensive local infrastructure or access to a big agency supercomputer. This really limits participation. 🏰 Data intensive science is locked inside data fortresses. Limiting access to outsiders is a feature, not a bug. Restricts collaboration and reproducibility! p r o b l e m s w i t h t h e S tat u s Q u o 10
v e S c i e n t i f i c D ata A n a ly t i c s 11 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment
S t o r a g e a n d C o m p u t e 12 Storage costs are steady. Data provider pays for storage costs. May be subsidized by cloud provider. (Thanks AWS, GCP, Azure! Or can live outside the cloud (e.g. Wasabi, OSN) Compute costs for interactive data analysis are bursty. Can take advantage of spot pricing Multi-tenancy: We can all use the same stack, but each institution pays for its own users. This is completely di ff erent from the status quo infrastructure!
o u d - N a i v e S ta c k 13 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Rich interactive computing environment in the web browser. xgcm xr f xhistogram gcm-filters climpred Cloud Services Domain specific packages Etc.
need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 14 Analysis Ready, Cloud Optimzed ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURP ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLY actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HG PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêG FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRP Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
R C O D ATA 15 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 16 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
ARCO Data is Hard! Domain Expertise: How to fi nd, clean, and homogenize data Tech Knowledge: How to ef fi ciently produce cloud-optimized formats Compute Resources: A place where to stage and upload the ARCO data Analysis Skills: To validate and make use of the ARCO data. To produce useful ARCO data, you must have: Data Scientist 😩
i t t o M a k e A R C O D ata? 19 Data providers are concerned with preservation and archival quality. Scientists users know what they need to make the data analysis-ready.
package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
R e c i p e s 23 FilePattern StorageConfig Recipe Executor Describes where to fi nd the source fi les which are the inputs to the recipe Describes where to store the outputs of our recipe A complete, self- contained representation of the pipeline Knows how to run the recipe. https://pangeo-forge.readthedocs.io/
C l o u d 24 Feedstock Contains the code and metadata for one or more Recipes Bakery https://pangeo-forge.org/ Storage Runs the recipes in the cloud using elastic scaling clusters GCS
l a b o r at i v e D ata C u r at i o n 25 Feedstock 🤓 Data User 🤓 Data Producer 🤓 Data Manager These data look weird… …Oh the metadata need an update. Ok I’ll make a PR to the recipe.
pangeo-forge.org 💣 Our recipes often contain 10000+ tasks. We are hitting the limits on Prefect as a work fl ow engine. Currently refactoring to move to Apache Beam. 😫 Data has lots of edge cases! This is really hard. 🌎 But we remain very excited about the potential of Pangeo Forge to transform how scientists interact with data. C u r r e n t s tAT U S 28