Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge

Analysis-Ready Cloud-Optimized Data for your community and the entire world
with Pangeo- Forge Julius Busecke | April 29th 2024 | NCAR ESDS Forum M²LInES

Who am I? • Physical Oceanographer • Senior Sta ff
Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores

Why even bother about open science?

Agile Science - Speed counts! Idea 💡 Result ✅

Agile Science - Speed counts! Idea 💡 Result ✅ Tech/Infrastructure
limited Understanding limited

limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector?

limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector? How to speed up: - Collaboration - More Stakeholders - Reproducibility

Idea💡 ... ... Collaborate with other researchers 👩💻👨🔬 💡 💬
Revised Hypothesis💡 Revised Hypothesis💡 Publish Results Test on additional data 💽 ⚙

Open Data is the bedrock of Open Science

Because its fast and easy! Reproducible IPCC Science in Minutes
IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ

Because its inclusive! Everyone can learn Climate Science from real
climate data

Because more and more people need climate data Collaboration on
open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder

People love this way of accessing (CMIP) data in the
cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!

The key ingredients Publicly accessible raw data via S3 interface
Storage Discovery Ingestion Reproducible methods to transform data into Analysis-Ready Cloud- Optimized Formats A reliable way for researchers to discover, cite, and learn more about datasets

Collaborative Science means Community Hopping Portability of tools is key
🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 😀 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources

🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😡 🚫

🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 💽 💵⏳

👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀

👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀

Future Vision Community maintained/curated data • Separation between Data and
Compute! - We don't know what the demands in the future will be! • Public access more important than fastest compute IMO! - The time saved in migrating/downloading data for many researchers is huge! • The boundary of communities becomes more permeable for collaboration - I can run the same code on the same data even though I am in another part of the world

LEAP-Pangeo

ARCO Ingestion - The community Bakery Metadata Recipe Beam Runners:
Local Google Data f low Amazon Flink Pyspark (coming soon) Dask (coming soon)

Pangeo/ESGF CMIP6 Zarr Data Request some new data!

The anatomy of the CMIP6 recipe

• Recipe Steps • Start with a list of unique
identi f iers of a dataset (made up of .nc f iles) The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe

ESGF query pangeo-forge-esgf

ESGF query pangeo-forge-esgf • Python client for the ESGF API

• Parse single instance_ids from wildcards

• Parse single instance_ids from wildcards • Return http download urls for a given instance_id

• Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast.

• Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback

Dynamic Chunking Pangeo Forge `StoreToZarr` accepts custom function

Lessons learned: • Data Ingestion for a community/public is hard
work, but saves many people time and adds new members to the research discourse. • It should be funded/acknowledged appropriately! • Relying on short grant funding is risky. • Pangeo-Forge has come a long way in part due to these e ff orts! It is a good time to get involved! https://github.com/pangeo-forge • This vision is not dependent on commercial cloud storage, just the idea to expose data to the public in a 'cloud-like' fashion.

How to get involved https://github.com/leap-stc/data-management https://github.com/leap-stc/cmip6-leap-feedstock

29 I ❤ questions + comments jbusecke juliusbusecke.com @JuliusBusecke @[email protected]
@codeandcurrents.bsky.social

Discussion 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻
🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀

Analysis-Ready Cloud-Optimized Data for your co...

Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge

More Decks by Julius Busecke

Other Decks in Science

Featured

Transcript