Pangeo Forge: Crowdsourcing Open Data in the Cloud

Crowdsourcing Open Data in the Cloud

C o n t r i b u t o
r s 2 • Charles Stern (Columbia / LDEO) • Joe Hamman (CarbonPlan) • Anderson Banihirwe (CarbonPlan) • Rachel Wegener (U. Maryland) • Chiara Lepore (GRO Intelligence) • Sean Harkins (Development Seed) • Alex Merose (Google Research) • Tom Augspurger (Microsoft) • Martin Durant (Anaconda) • Many recipe contributors Funding: NSF Earthcube Program ($1.5M for 3 years)

T h e O p e n S c i
e n c e V i s i o n 3 https://earthdata.nasa.gov/esds/open-science for 👩🔬 in everyone_in_the_world: for 📄 in all_scientific_knowledge: 👩🔬.verify(📄) discovery = 👩🔬.extend(📄) This would transform the 🌎 by allowing all of humanity to participate in the scienti fi c process. What are the barriers to realizing this vision?

W h at i s N e e d e
d t o R e p r o d u c e a S c i e n t i f i c D i s c o v e r y 4 The Code 📚 The Environment 📦 The Data 💾

d t o R e p r o d u c e a S c i e n t i f i c D i s c o v e r y 5 ✅ The Code 📚 Git, GitHub, GitLab, BitBucket, … ✅ The Environment 📦 Pypi, Conda, Mamba, Conda Forge, Docker, … ✅ The Data 💾 DOIs, Domain Data Repositories, Zenodo, Figshare, Dataverse, …

d t o R e p r o d u c e a S c i e n t i f i c D i s c o v e r y 6 ⚠ The Data 💾 But what about big data? Data-Intensive Science ? https:// fi gshare.com/articles/ fi gure/Earth_Data_Cube/4822930/2 Big Data 💡 Insights   💡 Understanding   💡 Predictions • open-ended problem • exploratory analysis • “human in the loop” • visualization needed • highly varied computational patterns / algorithms • no standard architecture

P r i v i l e g e d
I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 7 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

P r i v i l e g e d
I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

P r i v i l e g e d
I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 9 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann # step 1: open data open(“/some/random/files/on/my/cluster”) # step 2: do mind-blowing AI! 🤦

🗂 Emphasis on fi les as a medium of data
exchange creates lots of work for individual scientists (downloading, organizing, cleaning). Most fi le-based datasets are a mess—even simulation output. 😫 Yet the hard work of data wrangling is rarely collaborative. Outputs are not reusable and can’t be shared. 💰 Doing data-intensive science requires either expensive local infrastructure or access to a big agency supercomputer. This really limits participation. 🏰 Data intensive science is locked inside data fortresses. Limiting access to outsiders is a feature, not a bug. Restricts collaboration and reproducibility! p r o b l e m s w i t h t h e S tat u s Q u o 10

compute node C l o u d N at i
v e S c i e n t i f i c D ata A n a ly t i c s 11 1. Analysis-Ready, Cloud-Optimized Data 2. Data-Proximate Computing 3. Elastic Distributed Processing compute node compute node compute node compute node compute node compute node compute node compute node compute node Compute Environment

S e pa r at i o n o f
S t o r a g e a n d C o m p u t e 12 Storage costs are steady. Data provider pays for storage costs. May be subsidized by cloud provider.  (Thanks AWS, GCP, Azure! Or can live outside the cloud (e.g. Wasabi, OSN) Compute costs for interactive data analysis are bursty. Can take advantage of spot pricing Multi-tenancy:  We can all use the same stack, but each institution pays for its own users. This is completely di ff erent from the status quo infrastructure!

T h e Pa n g e o C l
o u d - N a i v e S ta c k 13 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Rich interactive computing environment in the web browser. xgcm xr f xhistogram gcm-filters climpred Cloud Services Domain specific packages Etc.

• Think in “datasets” not “data fi les” • No
need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 14 Analysis Ready, Cloud Optimzed ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURP ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLY actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HG PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêG FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRP Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?

E X A M P L E O F A
R C O D ATA 15 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

• Compatible with object storage   (access via HTTP) •
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 16 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

A R C o D ata i s Fa s
t ! 17 https://doi.org/10.1109/MCSE.2021.3059437 This also demonstrates the potential of the “hybrid cloud” model with OSN.

P r o b l e m : 18 Making
ARCO Data is Hard! Domain Expertise:   How to fi nd, clean, and homogenize data Tech Knowledge:   How to ef fi ciently produce cloud-optimized formats Compute Resources:   A place where to stage and upload the ARCO data Analysis Skills:   To validate and make use of the ARCO data. To produce useful ARCO data, you must have: Data Scientist 😩

W h o s e J o b i s
i t t o M a k e A R C O D ata? 19 Data providers are concerned with preservation and archival quality. Scientists users know what they need to make the data analysis-ready.

Pa n g e o F o r g e
20 Let’s democratize the production of ARCO data! Domain Expertise:   How to fi nd, clean, and homogenize data 🤓 Data Scientist

I n s p i r at i o n
: C o n d a F o r g e 21

22 Pangeo Forge Recipes Pangeo Forge Cloud Open source python
package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

R e c i p e s 23 FilePattern StorageConfig Recipe Executor Describes where to fi nd the source fi les which are the inputs to the recipe Describes where to store the outputs of our recipe A complete, self- contained representation of the pipeline Knows how to run the recipe. https://pangeo-forge.readthedocs.io/

C l o u d 24 Feedstock Contains the code and metadata for one or more Recipes Bakery https://pangeo-forge.org/ Storage Runs the recipes in the cloud using elastic scaling clusters GCS

V i s i o n : C o l
l a b o r at i v e D ata C u r at i o n 25 Feedstock 🤓 Data User 🤓 Data Producer 🤓 Data Manager These data look weird… …Oh the metadata need an update. Ok I’ll make a PR to the recipe.

D E M O 26

27 Oceanographers building a full-stack cloud SaaS automation platform. https://twitter.com/Colinoscopy/status/1255890780641689601
Charles Stern

🙌 Pangeo Forge Cloud is live and open for business!
  pangeo-forge.org 💣 Our recipes often contain 10000+ tasks. We are hitting the limits on Prefect as a work fl ow engine. Currently refactoring to move to Apache Beam. 😫 Data has lots of edge cases! This is really hard. 🌎 But we remain very excited about the potential of Pangeo Forge to transform how scientists interact with data. C u r r e n t s tAT U S 28

D e v e l o p m e n t 29 https://github.com/pangeo-forge/pangeo-forge-recipes This is a 💯% open project! Join us!

Pangeo Forge: Crowdsourcing Open Data in the Cloud

Pangeo Forge: Crowdsourcing Open Data in the Cloud

Ryan Abernathey

More Decks by Ryan Abernathey

Other Decks in Science

Featured

Transcript

Crowdsourcing Open Data in the Cloud

C o n t r i b u t o

T h e O p e n S c i

W h at i s N e e d e

W h at i s N e e d e

W h at i s N e e d e

P r i v i l e g e d

P r i v i l e g e d

P r i v i l e g e d

🗂 Emphasis on fi les as a medium of data

compute node C l o u d N at i

S e pa r at i o n o f

T h e Pa n g e o C l

• Think in “datasets” not “data fi les” • No

E X A M P L E O F A

• Compatible with object storage   (access via HTTP) •

A R C o D ata i s Fa s

P r o b l e m : 18 Making

W h o s e J o b i s

Pa n g e o F o r g e

I n s p i r at i o n

22 Pangeo Forge Recipes Pangeo Forge Cloud Open source python

Pa n g e o F o r g e

Pa n g e o F o r g e

V i s i o n : C o l

D E M O 26

27 Oceanographers building a full-stack cloud SaaS automation platform. https://twitter.com/Colinoscopy/status/1255890780641689601

🙌 Pangeo Forge Cloud is live and open for business!

Pa n g e o F o r g e