Open Climate Data for Agile and Inclusive Science Communities

Open Climate Data for Agile and Inclusive Science Communities Julius
Busecke | Sep 5 2024 | US Clivar P

Who am I? 🌊 Climate Scientist Ocean transport of Heat,
Carbon Oxygen Impact of small scale processes on global climate variability. 🤓 Open Science Nerd/ Advocate Maintainer: Pangeo CMIP6 Cloud Data xMIP/xGCM ⚙ Integration Engineer Manager for Data and Computation - NSF-LEAP Lead of Open Research - m2lines M²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social

How do we advance science faster? Idea 💡 Result ✅

Tech/Infrastructure limited Understanding limited

Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility

Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools

How do we advance science faster? Truly open access to
ARCO data! Idea 💡 Result ✅ Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools

Analysis-Ready Cloud-Optimized (ARCO) data

Analysis-Ready Cloud-Optimized (ARCO) data Analysis-Ready: • Think in “Datasets/Datacubes” not
“ f iles” and "folders" • No need for tedious homogenizing / cleaning steps • Curated and cataloged Chunked appropriately for analysis Rich metadata Everything in one dataset object

Analysis-Ready Cloud-Optimized (ARCO) data Analysis-Ready: • Think in “Datasets/Datacubes” not
“ f iles” and "folders" • No need for tedious homogenizing / cleaning steps • Curated and cataloged Cloud Optimized: • Compatible with object storage (access via HTTP) • Supports lazy access, intelligent subsetting, and streaming access • Integrates with high-level analysis libraries and distributed frameworks for high parallel throughput Abernathey et al., "Cloud-Native Repositories for Big Scienti fi c Data," 2021, doi: 10.1109/MCSE.2021.3059437

Pangeo CMIP6 Cloud Data ESGF

Pangeo CMIP6 Cloud Data ESGF Everybody rolls their own Custom
Code Custom Code Custom Code University Lab Industry ❌ ✋🚫

Pangeo CMIP6 Cloud Data ESGF Ingestion Pipeline A single data
repository in the cloud serves all use cases Everybody rolls their own Custom Code Custom Code Custom Code University Lab Industry ❌ ✋🚫

People love this way of accessing (CMIP) data in the
cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!

Pangeo/ESGF CMIP6 Zarr Data Request some new data! https://github.com/leap-stc/cmip6-leap-feedstock

Why do process studies need ARCO data? Planning Improve observational
design, resource allocation, more targeted simulations by having easy and fast access to data for researchers across institutions Leverage Outside Expertise Cross Discipline Collaboration across scienti f ic f ields, engaging the ML community, industry and non-pro f it sector. Legacy Reuse of observational data and modeling beyond initial study

Collaborative Science means Community Hopping Portability of tools is key
🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 😀 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources

🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😡 🚫

🧑💻 👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 💽 💵⏳

👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 Data maintained by the community but publicly accessible from outside Data maintained by the community but only accessible from community resources 😀

👩💻 👩💻 👨💻 🧑💻 🖥 💽 👩💻 👩💻 👨💻 🧑💻 🖥 👩💻 👩💻 👨💻 🧑💻 🖥 💽 🧑💻 💽 💽 😀 😀 😀

Maintaining a community ARCO dataset How we do it at
LEAP/m2lines Data Expert: Knows about the dataset (where f iles are stored, how they are named, how to add metadata to make the data more useful). Encodes this knowledge into recipe. Infrastructure Expert: Knows how to execute receipe on pipelines and how to populate cloud storage Science Expert: Finds ARCO data in catalog and uses the data for science. Provides feedback. More info on the LEAP Data Ingestion: https://leap-stc.github.io/guides/data_guide.html#ingesting-datasets-into-cloud-storage

Challenges We have to admit that we are not even
close yet! • The reality still looks very di ff erent! • The majority of our science is not accessible beyond institutional barriers • There is near 0 enforcement of meaningful requirements for open data by publishers. • It is hard work to produce and maintain ARCO data! • This should not be done by students and postdocs alone! Researchers should not have to become DevOps Engineers to do their work! • But we need better ways of acknowledging the data engineering work within science!

Challenges ⚠ Opinionated Take ⚠ • The world around is
rapidly changing. Doing 'science the way it was always done' is not su ff icient anymore! • Academia and government labs are not the only players anymore. • Working together as much as possible is imperative to deal with the climate crises and interrelated crises • Access to science is not just a human right, it will also reduce the amount of toil in the scienti f ic community! • Investing in truly public open data for climate science is a long term investment into science in general, and our collective mental health in particular. Everyone has the right to freely participate in the cultural life of the community, to enjoy the arts and to share in scienti f ic advancement and its bene f its. UN Declaration of Human Rights

Challenges (continued) ⚠ Opinionated Take ⚠ • Lets put our
money and time where our mouth is! • We should all lobby our employers, funding agencies, colleagues to embrace open and cloud native access to the datasets we produce!

We can do this, together!

LEAP-Pangeo https://leap-stc.github.io/intro.html

Open Climate Data for Agile and Inclusive Scien...

Open Climate Data for Agile and Inclusive Science Communities

Julius Busecke

More Decks by Julius Busecke

Featured

Transcript