Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo Analysis-Ready Cloud-Optimized CMIP Data

Pangeo Analysis-Ready Cloud-Optimized CMIP Data

Slides for "2024-11-21_Pangeo Analysis-Ready Cloud-Optimized CMIP data" presented at ESGF Webinar Series on November 21, 2024 by Julius Busecke.

Julius Busecke

November 21, 2024
Tweet

More Decks by Julius Busecke

Other Decks in Research

Transcript

  1. JULIUS BUSECKE | NOV 21 2024 | ESGF WEBINAR Lessons

    learned and future directions PANGEO ANALYSIS-READY CLOUD- OPTIMIZED CMIP DATA
  2. WHO AM I? M²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social 🌊

    Climate Scientist Ocean transport of Heat, Carbon Oxygen Impact of small scale processes on global climate variability. 🤓Developer/Data Nerd Pangeo CMIP6 Cloud Data xMIP/xGCM 🤝 Open Science Advocate Manager for Data and Computation - NSF- LEAP Lead of Open Research - m2lines 🙂↔💀 Screw you, Elon!
  3. - ESGF "Power User" + Maintainer of Pangeo Cloud Data

    - Trying to integrate the lessons learned in the past years into the new ESGF infrastructure - Representative of many di ff erent CMIP6 Users and their struggles - Many people fi nd it incredibly hard to work with CMIP6 data, and I believe we should do as much as we can to make it as easy as possible for users. - If work is shifted downstream to each user, toil is ampli fi ed, and results become harder to reproduce. This hurts science. - This can often be overlooked when we have privileged access to the data and are focused on deeply technical details - I learned a lot about CMIP data, ESGF and how users work with that data in the past years! - Thank you for all your work! WHAT HAT AM I WEARING TODAY?
  4. - CMIP data is used more and more broadly -

    "Traditional Users" at large orgs like universities and labs - "New Users" are increasingly interested in accessing CMIP data - Industry (insurance, climate service providers, ...) - Local Gov, Non-Pro fi t, Defense ... - The "Random person" not belonging to any of the above INTRO “Universal Declaration of Human Rights.” United Nations. Accessed August 19, 2024. https://www.un.org/en/about-us/universal-declaration-of-human-rights.
  5. P r i v i l e g e d

    I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer
  6. P r i v i l e g e d

    I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer
  7. - Build prototypes of Analysis-Ready Cloud- Optimized CMIP6 "Caches" -

    Experiment with di ff erent approaches: - Convert netcdfs to zarr (LDEO) - Google Cloud Storage - Rewrites data but enhances user experience and performance - Replicate netcdfs on cloud storage (GFDL) - AWS - Works as a ESGF replica node THE PANGEO / ESGF CLOUD DATA WORKING GROUP Centralize this cache in the cloud!
  8. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  9. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  10. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  11. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  12. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  13. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  14. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  15. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! Zarr Ingestion via Pangeo-Forge The anatomy of the CMIP6 recipe
  16. CMIP6 CLOUD DATA - WHAT WORKED WELL ESGF Ingestion Pipeline

    A single data repository in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫 Storage Provided by Google as Public Dataset
  17. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia
  18. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Inclusive Education on real climate data Portable Methods and Results not just for Academia Fast Iteration - Lower Barrier of Entry
  19. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Portable Methods and Results not just for Academia Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry
  20. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Portable Methods and Results not just for Academia Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry
  21. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia
  22. CMIP6 CLOUD DATA - WHAT COULD BE IMPROVED - Cataloging

    - Simple CSV fi le with facets powering intake catalog - BUT this is workable and cataloging demands are very community speci fi c - Focusing on maximum performance and fl exibility could enable many di ff erent downstream cataloging approaches! - Usage Statistics - We have no usage statistics from Google - Could likely be remedied with a di ff erent agreement in the future - "Yet another way to access CMIP" - Users are already confused by the many options, +1 adds to the burden if docs are fragmented - Consolidated Docs and/or exposing ARCO cache data in the ESGF catalog would improve this!
  23. - People love the Dataset/Datacube Access - Promote this entry

    point to same level as fi le download in the base API and instructions/documentation! - Rewriting the data is really hard! - REST API docs unclear + synchronous clients slow. - Lots of QC issues to work around. - So lets not do it! - Creating virtual (zarr) references preserves a lot of the user experience that people love - Will make it easier to enable high performance (cloud storage based) and/or specialized (rechunked) 'caches' of the o ffi cial data! WHAT WE LEARNED AND WHERE TO GO NEXT
  24. - Di ff erent fl avors of the same basic

    idea: - Expose multiple logically connected fi les (e.g. timesteps) as a single dataset by combining metadata and referencing data chunks from each fi le - > User accesses the smallest scienti fi cally useful unit of data! - Kerchunk (requires fsspec/python) - References can be stored as json or parquet fi les - Works right now with s3 and http served CMIP6 fi les - Zarr V3 (e.g. implemented via icechunk) - Works with development branches - Would allow access - Aggregated NetCDF - ... - Each one of these approaches requires creating and hosting a few small additional fi les per dataset and pointing to a single fi le/url FILE AGGREGATION AND VIRTUAL ZARR REFS Kerchunk JSON example Xarray usage example
  25. - Generation of Kerchunk json references has been demonstrated by

    CEDA for ESGF-NG and will likely be a part of the new STAC catalog 🎉 - We can reference netcdfs over http and s3 - Open Questions: - When to generate/update references? - As part of the publishing or separate event? - Separate event could enable rebuilding di ff erent refs after publishing! - Can only publishers generate refs to 'their' fi les, or can I build references only? - Some caveats! - Additional QC/QA requirements: - Required: Consistent compression codecs and chunk size between fi les - Nice to have: Chunk size range checks FILE AGGREGATION AND VIRTUAL ZARR REFS Publish Node Reference Publish Node Reference Node ? Reference
  26. - Using Virtualizarr to produce references (supports writing kerchunk and

    icechunk) - Demonstrate work fl ow and test performance - Fully working examples of virtual zarr references from netcdf fi les on s3 and http - Using VirtualiZarr to combine fi les using xarray - Currently producing kerchunk references, but experimenting with icechunk - Please raise issues, reach out if you are interested. ONGOING WORK
  27. - I think with small changes we can integrate many

    of our lessons into the new ESGF Infrastructure - We can reproduce much of the UX from the Pangeo CMIP6 Cloud Data without rewriting to Zarr 🎉 - Requires: - ✅ Support for dataset level references/aggregations in the base data access API - ❓Support multiple references/aggregation methods? - ✅ Support replication to S3 storage - ❓QC on compression and chunking - Producing references using Virtualizarr can work as part of the publishing - This will signi fi cantly enhance the user experience and enable downstream e ff orts - Orgs can run cloud enabled replica nodes to enhance performance for parallel access (no commitment from ESGF) - Makes e ff orts to create specialized caches (rechunked native zarr, icechunk, ...) easier - Depending on QC/QA procedure these could be fully veri fi ed against published fi les too - Consolidated Docs 🙏! - Aside from the data access, having a single website to get info for new users, users who want to build downstream tools, maintainers of core infrastructure in one place would go a looooong way towards making access to climate data more inclusive and reduce time demand on ESGF maintainers. SUMMARY