Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge
Slides for "Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge" presented at ESDS Forum on April 29, 2024 by Julius Busecke.
Associate at Columbia University • Manager of Data and Computing for LEAP • Lead of Open Science for M2LInES • Core Developer of xGCM, xMIP • Pangeo Fan, User, and Member • Open Source/ Open Science Advocate • Maintainer of the Pangeo CMIP6 zarr stores
limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector? How to speed up: - Collaboration - More Stakeholders - Reproducibility
open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder
cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!
Storage Discovery Ingestion Reproducible methods to transform data into Analysis-Ready Cloud- Optimized Formats A reliable way for researchers to discover, cite, and learn more about datasets
🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 😀 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources
🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😡 🚫
🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 💽 💵⏳
👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀
👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀
Compute! - We don't know what the demands in the future will be! • Public access more important than fastest compute IMO! - The time saved in migrating/downloading data for many researchers is huge! • The boundary of communities becomes more permeable for collaboration - I can run the same code on the same data even though I am in another part of the world
identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) The anatomy of the CMIP6 recipe
identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline The anatomy of the CMIP6 recipe
identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution The anatomy of the CMIP6 recipe
identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe
identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary The anatomy of the CMIP6 recipe
identi f iers of a dataset (made up of .nc f iles) • dynamically request download urls • dynamic chunking (based on the size of each dataset) • Cataloging and Testing inline • Recipe Execution • Orchestrating thousands of single recipes in dictionary • Currently runs on Google Data f low but could run on any beam runner! The anatomy of the CMIP6 recipe
• Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast.
• Parse single instance_ids from wildcards • Return http download urls for a given instance_id • This is my f irst async code, so can probably be vastly improved, but it seems resonably fast. • If there are better packages I should use, Id love the feedback
work, but saves many people time and adds new members to the research discourse. • It should be funded/acknowledged appropriately! • Relying on short grant funding is risky. • Pangeo-Forge has come a long way in part due to these e ff orts! It is a good time to get involved! https://github.com/pangeo-forge • This vision is not dependent on commercial cloud storage, just the idea to expose data to the public in a 'cloud-like' fashion.