Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis-Ready Cloud-Optimized Data for your co...

Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge

Slides for "Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge" presented at ESDS Forum on April 29, 2024 by Julius Busecke.

Julius Busecke

April 29, 2024
Tweet

More Decks by Julius Busecke

Other Decks in Science

Transcript

  1. Analysis-Ready Cloud-Optimized Data for your community and the entire world

    with Pangeo- Forge Julius Busecke | April 29th 2024 | NCAR ESDS Forum M²LInES
  2. Who am I? • Physical Oceanographer • Senior Sta ff

    Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores
  3. Agile Science - Speed counts! Idea 💡 Result ✅ Tech/Infrastructure

    limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector?
  4. Agile Science - Speed counts! Idea 💡 Result ✅ Tech/Infrastructure

    limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector? How to speed up: - Collaboration - More Stakeholders - Reproducibility
  5. Idea💡 ... ... Collaborate with other researchers 👩💻👨🔬 💡 💬

    Revised Hypothesis💡 Revised Hypothesis💡 Publish Results Test on additional data 💽 ⚙
  6. Because its fast and easy! Reproducible IPCC Science in Minutes

    IPCC Chapter 9 2-10 minutes on LEAP-Pangeo JupyterHub Code Repository https://github.com/jbusecke/presentation_wcrp_open_science_conference Scipy Talk https://www.youtube.com/watch?v=7niNfs3ZpfQ
  7. Because more and more people need climate data Collaboration on

    open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder
  8. People love this way of accessing (CMIP) data in the

    cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!
  9. The key ingredients Publicly accessible raw data via S3 interface

    Storage Discovery Ingestion Reproducible methods to transform data into Analysis-Ready Cloud- Optimized Formats A reliable way for researchers to discover, cite, and learn more about datasets
  10. Collaborative Science means Community Hopping Portability of tools is key

    🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 😀 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources
  11. Collaborative Science means Community Hopping Portability of tools is key

    🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😡 🚫
  12. Collaborative Science means Community Hopping Portability of tools is key

    🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 💽 💵⏳
  13. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀
  14. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀
  15. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀
  16. Collaborative Science means Community Hopping Portability of tools is key

    👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀
  17. Future Vision Community maintained/curated data • Separation between Data and

    Compute! - We don't know what the demands in the future will be! • Public access more important than fastest compute IMO! - The time saved in migrating/downloading data for many researchers is huge! • The boundary of communities becomes more permeable for collaboration - I can run the same code on the same data even though I am in another part of the world
  18. ARCO Ingestion - The community Bakery Metadata Recipe Beam Runners:

    Local Google Data f low Amazon Flink Pyspark (coming soon) Dask (coming soon)
  19. ARCO Ingestion - The community Bakery Metadata Recipe Beam Runners:

    Local Google Data f low Amazon Flink Pyspark (coming soon) Dask (coming soon)
  20. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) The anatomy of the CMIP6 recipe
  21. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls The anatomy of the CMIP6 recipe
  22. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) The anatomy of the CMIP6 recipe
  23. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline The anatomy of the CMIP6 recipe
  24. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution The anatomy of the CMIP6 recipe
  25. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe
  26. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe
  27. • Recipe Steps • Start with a list of unique

    identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe
  28. ESGF query pangeo-forge-esgf • Python client for the ESGF API

    • Parse single instance_ids from wildcards
  29. ESGF query pangeo-forge-esgf • Python client for the ESGF API

    • Parse single instance_ids from wildcards • Return http download urls for a given instance_id
  30. ESGF query pangeo-forge-esgf • Python client for the ESGF API

    • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast.
  31. ESGF query pangeo-forge-esgf • Python client for the ESGF API

    • Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback
  32. Lessons learned: • Data Ingestion for a community/public is hard

    work, but saves many people time and adds new members to the research discourse. • It should be funded/acknowledged appropriately! • Relying on short grant funding is risky. • Pangeo-Forge has come a long way in part due to these e ff orts! It is a good time to get involved! https://github.com/pangeo-forge • This vision is not dependent on commercial cloud storage, just the idea to expose data to the public in a 'cloud-like' fashion.
  33. Discussion 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻

    🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀