Pangeo Analysis-Ready Cloud-Optimized CMIP Data

JULIUS BUSECKE | NOV 21 2024 | ESGF WEBINAR Lessons
learned and future directions PANGEO ANALYSIS-READY CLOUD- OPTIMIZED CMIP DATA

WHO AM I? M²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social 🌊
Climate Scientist Ocean transport of Heat, Carbon Oxygen Impact of small scale processes on global climate variability. 🤓Developer/Data Nerd Pangeo CMIP6 Cloud Data xMIP/xGCM 🤝 Open Science Advocate Manager for Data and Computation - NSF- LEAP Lead of Open Research - m2lines 🙂↔💀 Screw you, Elon!

- ESGF "Power User" + Maintainer of Pangeo Cloud Data
- Trying to integrate the lessons learned in the past years into the new ESGF infrastructure - Representative of many di ff erent CMIP6 Users and their struggles - Many people fi nd it incredibly hard to work with CMIP6 data, and I believe we should do as much as we can to make it as easy as possible for users. - If work is shifted downstream to each user, toil is ampli fi ed, and results become harder to reproduce. This hurts science. - This can often be overlooked when we have privileged access to the data and are focused on deeply technical details - I learned a lot about CMIP data, ESGF and how users work with that data in the past years! - Thank you for all your work! WHAT HAT AM I WEARING TODAY?

- CMIP data is used more and more broadly -
"Traditional Users" at large orgs like universities and labs - "New Users" are increasingly interested in accessing CMIP data - Industry (insurance, climate service providers, ...) - Local Gov, Non-Pro fi t, Defense ... - The "Random person" not belonging to any of the above INTRO “Universal Declaration of Human Rights.” United Nations. Accessed August 19, 2024. https://www.un.org/en/about-us/universal-declaration-of-human-rights.

Download Clean and Combine Crunch the data Interpret Results

Download Clean and Combine Crunch the data Interpret Results ⏳💸🚫

Download Clean and Combine Crunch the data Interpret Results ❌

P r i v i l e g e d
I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer

- Build prototypes of Analysis-Ready Cloud- Optimized CMIP6 "Caches" -
Experiment with di ff erent approaches: - Convert netcdfs to zarr (LDEO) - Google Cloud Storage - Rewrites data but enhances user experience and performance - Replicate netcdfs on cloud storage (GFDL) - AWS - Works as a ESGF replica node THE PANGEO / ESGF CLOUD DATA WORKING GROUP Centralize this cache in the cloud!

Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

• Recipe Steps • Start with a list of unique
identi f iers of a dataset (made up of .nc f iles) Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

CMIP6 CLOUD DATA - WHAT WORKED WELL ESGF Ingestion Pipeline
A single data repository in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫 Storage Provided by Google as Public Dataset

CMIP6 CLOUD DATA A single data repository in the cloud
serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia

serves all use cases Collaborative and agile Research Inclusive Education on real climate data Portable Methods and Results not just for Academia Fast Iteration - Lower Barrier of Entry

serves all use cases Collaborative and agile Research Portable Methods and Results not just for Academia Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry

serves all use cases Portable Methods and Results not just for Academia Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry

serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia

CMIP6 CLOUD DATA - WHAT COULD BE IMPROVED - Cataloging
- Simple CSV fi le with facets powering intake catalog - BUT this is workable and cataloging demands are very community speci fi c - Focusing on maximum performance and fl exibility could enable many di ff erent downstream cataloging approaches! - Usage Statistics - We have no usage statistics from Google - Could likely be remedied with a di ff erent agreement in the future - "Yet another way to access CMIP" - Users are already confused by the many options, +1 adds to the burden if docs are fragmented - Consolidated Docs and/or exposing ARCO cache data in the ESGF catalog would improve this!

- People love the Dataset/Datacube Access - Promote this entry
point to same level as fi le download in the base API and instructions/documentation! - Rewriting the data is really hard! - REST API docs unclear + synchronous clients slow. - Lots of QC issues to work around. - So lets not do it! - Creating virtual (zarr) references preserves a lot of the user experience that people love - Will make it easier to enable high performance (cloud storage based) and/or specialized (rechunked) 'caches' of the o ffi cial data! WHAT WE LEARNED AND WHERE TO GO NEXT

- Di ff erent fl avors of the same basic
idea: - Expose multiple logically connected fi les (e.g. timesteps) as a single dataset by combining metadata and referencing data chunks from each fi le - > User accesses the smallest scienti fi cally useful unit of data! - Kerchunk (requires fsspec/python) - References can be stored as json or parquet fi les - Works right now with s3 and http served CMIP6 fi les - Zarr V3 (e.g. implemented via icechunk) - Works with development branches - Would allow access - Aggregated NetCDF - ... - Each one of these approaches requires creating and hosting a few small additional fi les per dataset and pointing to a single fi le/url FILE AGGREGATION AND VIRTUAL ZARR REFS Kerchunk JSON example Xarray usage example

- Generation of Kerchunk json references has been demonstrated by
CEDA for ESGF-NG and will likely be a part of the new STAC catalog 🎉 - We can reference netcdfs over http and s3 - Open Questions: - When to generate/update references? - As part of the publishing or separate event? - Separate event could enable rebuilding di ff erent refs after publishing! - Can only publishers generate refs to 'their' fi les, or can I build references only? - Some caveats! - Additional QC/QA requirements: - Required: Consistent compression codecs and chunk size between fi les - Nice to have: Chunk size range checks FILE AGGREGATION AND VIRTUAL ZARR REFS Publish Node Reference Publish Node Reference Node ? Reference

- Using Virtualizarr to produce references (supports writing kerchunk and
icechunk) - Demonstrate work fl ow and test performance - Fully working examples of virtual zarr references from netcdf fi les on s3 and http - Using VirtualiZarr to combine fi les using xarray - Currently producing kerchunk references, but experimenting with icechunk - Please raise issues, reach out if you are interested. ONGOING WORK

- I think with small changes we can integrate many
of our lessons into the new ESGF Infrastructure - We can reproduce much of the UX from the Pangeo CMIP6 Cloud Data without rewriting to Zarr 🎉 - Requires: - ✅ Support for dataset level references/aggregations in the base data access API - ❓Support multiple references/aggregation methods? - ✅ Support replication to S3 storage - ❓QC on compression and chunking - Producing references using Virtualizarr can work as part of the publishing - This will signi fi cantly enhance the user experience and enable downstream e ff orts - Orgs can run cloud enabled replica nodes to enhance performance for parallel access (no commitment from ESGF) - Makes e ff orts to create specialized caches (rechunked native zarr, icechunk, ...) easier - Depending on QC/QA procedure these could be fully veri fi ed against published fi les too - Consolidated Docs 🙏! - Aside from the data access, having a single website to get info for new users, users who want to build downstream tools, maintainers of core infrastructure in one place would go a looooong way towards making access to climate data more inclusive and reduce time demand on ESGF maintainers. SUMMARY

I ❤ QUESTIONS AND COMMENTS

Pangeo Analysis-Ready Cloud-Optimized CMIP Data

Pangeo Analysis-Ready Cloud-Optimized CMIP Data

Julius Busecke

More Decks by Julius Busecke

Other Decks in Research

Featured

Transcript

JULIUS BUSECKE | NOV 21 2024 | ESGF WEBINAR Lessons

WHO AM I? M²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social 🌊

- ESGF "Power User" + Maintainer of Pangeo Cloud Data

- CMIP data is used more and more broadly -

Download Clean and Combine Crunch the data Interpret Results

Download Clean and Combine Crunch the data Interpret Results ⏳💸🚫

Download Clean and Combine Crunch the data Interpret Results ❌

P r i v i l e g e d

P r i v i l e g e d

- Build prototypes of Analysis-Ready Cloud- Optimized CMIP6 "Caches" -

Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

• Recipe Steps • Start with a list of unique

CMIP6 CLOUD DATA - WHAT WORKED WELL ESGF Ingestion Pipeline

CMIP6 CLOUD DATA A single data repository in the cloud

CMIP6 CLOUD DATA A single data repository in the cloud

CMIP6 CLOUD DATA A single data repository in the cloud

CMIP6 CLOUD DATA A single data repository in the cloud

CMIP6 CLOUD DATA A single data repository in the cloud

CMIP6 CLOUD DATA - WHAT COULD BE IMPROVED - Cataloging

- People love the Dataset/Datacube Access - Promote this entry

- Di ff erent fl avors of the same basic

- Generation of Kerchunk json references has been demonstrated by

- Using Virtualizarr to produce references (supports writing kerchunk and

- I think with small changes we can integrate many

I ❤ QUESTIONS AND COMMENTS