Open Science with Pangeo - From community to climate science in the cloud

Open Science with Pangeo Julius Busecke | UCLA IDRE Open
Science Workshop Mar 24 2022 | Contains slides adopted from Ryan Abernathey From community to climate science in the cloud

Who am I? Physical Oceanographer Studies the role of ocean
currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Pangeo Fan, User, and Member Open Source/Open Science Advocate

Source: https://www.un.org/en/global-issues/climate-change

4 Credit: NASA's Goddard Space Flight Center

4 Credit: NASA's Goddard Space Flight Center https://earthdata.nasa.gov/eosdis/cloud-evolution SWOT NISAR

5 Credit: NASA's Goddard Space Flight Center

How do we get to work with all this data?

How do we get to work with all this data?
FTP / OPeNDAP / etc. Download Files

MB 😀 FTP / OPeNDAP / etc.

GB 😐 FTP / OPeNDAP / etc.

TB 😖 FTP / OPeNDAP / etc.

PB 😱 FTP / OPeNDAP / etc.

P r i v i l e g e d
I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 11 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA
4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann

Privileged Institutions create “Data Fortresses*” Image credit: Moahim, CC BY-SA
4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann

Bring the compute to the data! But how?

Use a “Platform”

The Trouble with “Platforms”

The Trouble with “Platforms” Scientists’ creativity often exceeds pre-baked capabilities.
  Desire to go under the hood

  Desire to go under the hood What if you want to access data that isn’t included?   Data catalog is determined by provider, not users

  Desire to go under the hood What if you want to access data that isn’t included?   Data catalog is determined by provider, not users Platforms are “single instance”:   Fear of lock-in, possibility platform will disappear

OPEN Cloud Architecture Data Provider’s $ Data Consumer’s $

Interactive Computing Data Provider’s $ Data Consumer’s $ OPEN Cloud
Architecture

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing
OPEN Cloud Architecture

Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing
Analysis Ready Data  Cloud Optimized Formats OPEN Cloud Architecture

O p e n O c e a n C
l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” 22 *Coined by Fernando Perez

23 O p e n O c e a n
C l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment *Coined by Fernando Perez

24 👩💻👨💻👩💻 Group A:   Air-Sea Interaction 👩💻👨💻👩💻 Group B:
  Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment

24 👩💻👨💻👩💻 Group A:   Air-Sea Interaction 👩💻👨💻👩💻 Group B:
  Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a   “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry

https://openocean.cloud Material adopted from Ryan Abernathey

What is Pangeo?

What is Pangeo? Community obsessed with e ff icient data
processing.   Founded in 2017. Scientists and software developers coming together. http://pangeo.io/   Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software     Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure     Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.

CMIP6 Cloud Dataset • Pangeo partnered with ESGF and Google
Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/

Analyzing Petabyte scale climate data in your browser with Pangeo
Custom Analysis applied to each model and member

What we want to do

What we want to do 💡 Have an idea

What we want to do Write some code

What we want to do Rock some science

Time to science?

Di ff erent dimension names in the CMIP data.  
  Not quite analysis -ready No! Time to homogenize data!

🤔 💡 No! Time to clean data!

🤔 💡 Competition for brain power

🤔 Isn’t there a better way?

There is! + Analysis Ready Data in the cloud Crowd-Sourced
Data Cleaning   (peer-to-peer learning)

There is! + Analysis Ready Data in the cloud Crowd-Sourced
Data Cleaning   (peer-to-peer learning) Less data wrangling, more 💡 =

• Think in “Datasets” not “data fi les” • No
need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 40 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?

E X A M P L E O F A
R C O D ATA 41 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

• Compatible with object storage   (access via HTTP) •
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 42 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

P r o b l e m : 43 Making
ARCO Data is Hard! Domain Expertise:   How to fi nd, clean, and homogenize data Tech Knowledge:   How to ef fi ciently produce cloud-optimized formats Compute Resources:   A place where to stage and upload the ARCO data Communication Skills:   To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩

P a n g e o F o r g
e 44 Let’s democratize the production of ARCO data! Domain Expertise:   How to fi nd, clean, and homogenize data 🤓 Data Scientist

45 Pangeo Forge Recipes Pangeo Forge Cloud Open source python
package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

L e a r n M o r e 46
http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data @JuliusBusecke jbusecke [email protected]

A R C O D a t a i s
F a s t ! 48 https://doi.org/10.1109/MCSE.2021.3059437

Open Science with Pangeo - From community to cl...

Open Science with Pangeo - From community to climate science in the cloud

More Decks by Julius Busecke

Other Decks in Science

Featured

Transcript