Upgrade to PRO for Only $50/Yearβ€”Limited-Time Offer! πŸ”₯

"How to transform thousands of CMIP6 datasets t...

Avatar for Julius Busecke Julius Busecke
November 30, 2023
57

"How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do thisΒ again!"

Slides for "How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge - And why we should never do this again!" presented at Pangeo Showcase on November 29, 2023 by Julius Busecke, Charles Stern.

Additional Material:
Recording of the presentation: https://www.youtube.com/watch?v=vZKlcsYNNbU

Avatar for Julius Busecke

Julius Busecke

November 30, 2023
Tweet

Transcript

  1. How to transform thousands of CMIP6 datasets to Zarr with

    Pangeo Forge Julius Busecke and Charles Stern | Nov 28th 2023 | Pangeo Showcase And why we should never do this again! Award 8434 Awards 2026932, 2019625
  2. Who am I? β€’ Physical Oceanographer β€’ Senior Sta ff

    Associate at Columbia University β€’ Manager of Data and Computing for LEAP β€’ Lead of Open Science for M2LInES β€’ Core Developer of xGCM, xMIP β€’ Pangeo Fan, User, and Member β€’ Open Source/ Open Science Advocate β€’ Maintainer of the Pangeo CMIP6 zarr stores
  3. Because its fast and easy! Reproducible IPCC Science in Minutes

    IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ
  4. Because more and more people need this data Collaboration on

    open data: Everybody wins! πŸ‘¨πŸ’Ό πŸ‘¨πŸ”¬ πŸ§‘πŸ’Ό πŸ§‘πŸ’» Science Paper πŸ‘©πŸ”¬πŸŽ‰πŸŽ“ Derived Proprietary Data Product πŸ§‘πŸ’ΌπŸ’΅πŸŽŠ Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector πŸ‘· πŸ‘©πŸ« πŸ„ Everyone gets to explore the clean dataset! Public Stakeholder
  5. Challenges β€’ ~13 PB πŸ‘€ β€’ How to prioritize the

    most desired datasets? β€’ millions of datasets with variable size, naming need to be converted to Analysis- Ready Cloud-Optimized Zarr stores πŸ— β€’ Each datasets consists of possibly many nc f iles, some are not always available β€’ Many di ff erent grids and timesteps -> vastly di ff erent array shapes β€’ Ingestion is not just one way. Some datasets are retracted, and any mirror needs to re f lect that. ⚠
  6. The Proof of concept β€’ Public Storage Buckets from Google

    and Amazon β€’ Collaborative e ff ort in the Pangeo / ESGF Cloud Data Working Group β€’ Zarr stores manual processed by Naomi Henderson β€’ ~150k datasets uploaded using user requests, and tons of manual labor in jupyter notebooks β€’ Cataloging: Intake-esm collection based on a large csv f ile β€’ Retraction: Remove datasets from catalog but never deleted them. β€’ Early upload enabled downstream work to use and improve the useability of the cloud data early on. β€’ But then Naomi retired 😱 https://github.com/jbusecke/xMIP https://pangeo-data.github.io/pangeo-cmip6-cloud/
  7. Building a robot Naomi β€’ Pangeo-Forge β€’ Open source platform

    for data Extraction, Transformation, Loading (ETL) β€’ Encodes all information needed to recreate an ARCO copy β€’ Originally designed for few massive datasets β€’ CMIP6 is a unique use-case: Massive amount of small-ish datasets β€’ Massive refactor to Apache-Beam f inally enabled us to scale this ingestion β€’ Big Shout out to Charles Stern and all Pangeo-Forge maintainers. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.readthedocs.io/en/latest/index.html
  8. β€’ Recipe Steps β€’ Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) β€’ dynamically request download urls β€’ dynamic chunking (based on the size of each dataset) β€’ Cataloging and Testing inline β€’ Recipe Execution β€’ Orchestrating thousands of single recipes in dictionary β€’ Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe
  9. ESGF query pangeo-forge-esgf β€’ Python client for the ESGF API

    β€’ Parse single instance_ids from wildcards β€’ Return http download urls for a given instance_id β€’ This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. β€’ If there are better packages I should use, Id love the feedback
  10. Dynamic Chunking Bring your own! β€’ Preserve monthly chunks for

    some fancy calendar once zarr allows unequal chunks. β€’ Chunk one dimension only if a certain other dimension is available β€’ ...
  11. Ok so whats up with the "why we should never

    do it again"? Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➑ Scientist πŸš€ Not just for academics. Public/private sector uses the cloud data!