Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo-ESGF CMIP6 Zarr Data 2.0 - Streaming acc...

Julius Busecke
December 13, 2024
4

Pangeo-ESGF CMIP6 Zarr Data 2.0 - Streaming access to CMIP6 data in the cloud that rocks!

Slides for "2024-12-13_Pangeo-ESGF CMIP6 Zarr Data 2.0 - Streaming access to CMIP6 data in the cloud that rocks!" presented at AGU 2024 on December 13, 2024 by Julius Busecke.

Julius Busecke

December 13, 2024
Tweet

Transcript

  1. JULIUS BUSECKE | AGU 2024 | DEC 13 2024 Streaming

    access to CMIP6 data in the cloud that rocks! PANGEO-ESGF CMIP6 ZARR DATA 2.0
  2. Comparing Simulations to observations Future Predictions Emission Scenarios Model Spread

    WHAT IS CMIP? "THE INTERGOVERNMENTAL PANEL ON CLIMATE CHANGE (IPCC) IS THE UNITED NATIONS BODY FOR ASSESSING THE SCIENCE RELATED TO CLIMATE CHANGE." WWW.IPCC.CH/ 62 ABSTRACTS AT AGU MENTION CMIP EXPLICITLY BUT PROBABLY MANY MORE USE THE DATA!
  3. - Many 100.000s of individual datasets - Each dataset is

    identi fi ed by a unique id, consisting of 'facets' - Facets are part of CMIP controlled vocabulary (https:// wcrp-cmip.github.io/CMIP6_CVs/) - Unfortunately you still need to learn your vocabulary for now WHAT IS CMIP? - ORGANIZATION AND VOCABULARY https://wcrp-cmip.org/cmip-data-access/#access-routes CMIP Cycle MIP activity Modelling Center Model Code Experiment/forcing scenarios Ensemble member Output Variable Model Grid CMIP6.ScenarioMIP.NOAA-GFDL.GFDL-CM4.ssp585.r1i1p1f1.Omon.thetao.gn mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version di ff erent simulations components of a single simulation example MIP table used
  4. - Reality: Large institutions create mirrors of parts of the

    archive, restricted to employees (data fortresses) - Large overhead, requires both expertise, time and funds - Individual access/data cleaning approaches might be incompatible, hindering reusability/reproducibility - E ff ectively limits conducting climate science to large legacy orgs WHAT IS CMIP? - TONS OF DATA! ESGF Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫
  5. WHAT DID WE DO? https://pangeo-data.github.io/pangeo-cmip6-cloud/ Convert to Zarr on Cloud

    Storage v1: manual ingestion v2: automated pangeo-forge/beam pipelines
  6. WHAT DID WE DO? https://pangeo-data.github.io/pangeo-cmip6-cloud/ Convert to Zarr on Cloud

    Storage v1: manual ingestion v2: automated pangeo-forge/beam pipelines
  7. CMIP6 CLOUD DATA ESGF Ingestion Pipeline A single data repository

    in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫 Storage Provided by Google as Public Dataset
  8. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia
  9. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Inclusive Education on real climate data Portable Methods and Results not just for Academia Fast Iteration - Lower Barrier of Entry
  10. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Portable Methods and Results not just for Academia Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry
  11. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Portable Methods and Results not just for Academia Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry
  12. CMIP6 CLOUD DATA A single data repository in the cloud

    serves all use cases Collaborative and agile Research Inclusive Education on real climate data Fast Iteration - Lower Barrier of Entry Portable Methods and Results not just for Academia
  13. HOW CAN I REQUEST NEW DATA? Choose "New File Request"

    and submit a list unique ids you want ingested!
  14. WHAT IS NEXT? - Ongoing Work to bring some of

    the improved user experience to the new generation of ESGF and CMIP 7 - Using virtualization to avoid doubling demand for storage. - Do we still need to Convert data to native zarr or create performance optimized caches?
  15. TL;DR We have a ton of data already in the

    cloud Explore it today if you like! And let us know about the awesome stu ff you do with it! We ❤ to upload new data Submit a request if your favorite data is missing. And most importantly ...
  16. WE NEED A FUTURE WHERE WORKING WITH CMIP7 DATA FEELS

    LIKE ... ... THIS 🤘 ... AND NOT LIKE THIS https://github.com/zarr-developers/zarr-illustrations-falk-2022 ... for EVERYONE on this planet!
  17. WE NEED A FUTURE WHERE WORKING WITH CMIP7 DATA FEELS

    LIKE ... ... THIS 🤘 ... AND NOT LIKE THIS https://github.com/zarr-developers/zarr-illustrations-falk-2022 ... for EVERYONE on this planet!
  18. I ❤ QUESTIONS + FEEDBACK jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social

    🙂↔💀 Screw you, Elon! DEMO: IPCC PLOT FROM SCRATCH IN MINUTES All the links 👉
  19. I ❤ QUESTIONS + FEEDBACK jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social

    🙂↔💀 Screw you, Elon! DEMO: IPCC PLOT FROM SCRATCH IN MINUTES All the links 👉