Open Climate Data for Agile and Inclusive ScienceΒ Communities
Slides for "2024-09-05_Open Climate Data for Agile and Inclusive Science Communities" presented at 2024 US Clivar PSMI Panel on September 05, 2024 by Julius Busecke.
Carbon Oxygen Impact of small scale processes on global climate variability. π€ Open Science Nerd/ Advocate Maintainer: Pangeo CMIP6 Cloud Data xMIP/xGCM β Integration Engineer Manager for Data and Computation - NSF-LEAP Lead of Open Research - m2lines MΒ²LInES jbusecke juliusbusecke.com @JuliusBusecke @[email protected] @codeandcurrents.bsky.social
Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools
ARCO data! Idea π‘ Result β Tech/Infrastructure limited Understanding limited How to speed up: - Collaboration - More Stakeholders - Reproducibility How to speed up: - Open/Fast Access to data - Community OSS tools
β f ilesβ and "folders" β’ No need for tedious homogenizing / cleaning steps β’ Curated and cataloged Chunked appropriately for analysis Rich metadata Everything in one dataset object
β f ilesβ and "folders" β’ No need for tedious homogenizing / cleaning steps β’ Curated and cataloged Cloud Optimized: β’ Compatible with object storage (access via HTTP) β’ Supports lazy access, intelligent subsetting, and streaming access β’ Integrates with high-level analysis libraries and distributed frameworks for high parallel throughput Abernathey et al., "Cloud-Native Repositories for Big Scienti fi c Data," 2021, doi: 10.1109/MCSE.2021.3059437
cloud! Lower barrier of entry + Quick exploration of ideas = New CMIPers! Teaching with the real data Student β‘ Scientist π Not just for academics. Public/private sector uses the cloud data!
design, resource allocation, more targeted simulations by having easy and fast access to data for researchers across institutions Leverage Outside Expertise Cross Discipline Collaboration across scienti f ic f ields, engaging the ML community, industry and non-pro f it sector. Legacy Reuse of observational data and modeling beyond initial study
LEAP/m2lines Data Expert: Knows about the dataset (where f iles are stored, how they are named, how to add metadata to make the data more useful). Encodes this knowledge into recipe. Infrastructure Expert: Knows how to execute receipe on pipelines and how to populate cloud storage Science Expert: Finds ARCO data in catalog and uses the data for science. Provides feedback. More info on the LEAP Data Ingestion: https://leap-stc.github.io/guides/data_guide.html#ingesting-datasets-into-cloud-storage
close yet! β’ The reality still looks very di ff erent! β’ The majority of our science is not accessible beyond institutional barriers β’ There is near 0 enforcement of meaningful requirements for open data by publishers. β’ It is hard work to produce and maintain ARCO data! β’ This should not be done by students and postdocs alone! Researchers should not have to become DevOps Engineers to do their work! β’ But we need better ways of acknowledging the data engineering work within science!
rapidly changing. Doing 'science the way it was always done' is not su ff icient anymore! β’ Academia and government labs are not the only players anymore. β’ Working together as much as possible is imperative to deal with the climate crises and interrelated crises β’ Access to science is not just a human right, it will also reduce the amount of toil in the scienti f ic community! β’ Investing in truly public open data for climate science is a long term investment into science in general, and our collective mental health in particular. Everyone has the right to freely participate in the cultural life of the community, to enjoy the arts and to share in scienti f ic advancement and its bene f its. UN Declaration of Human Rights
money and time where our mouth is! β’ We should all lobby our employers, funding agencies, colleagues to embrace open and cloud native access to the datasets we produce!