Columbia University Manager of Data and Computing - LEAP NSF-STC Lead of Open Science - M2LInES Core Developer of xGCM, xMIP Pangeo Fan, User, and Member Open Source/Open Science Advocate Maintainer of the Pangeo CMIP6 zarr stores
Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS https://pangeo-data.github.io/pangeo-cmip6-cloud/
open data: Everybody wins! 👨💼 👨🔬 🧑💼 🧑💻 Science Paper 👩🔬🎉🎓 Derived Proprietary Data Product 🧑💼💵🎊 Fixes naming Corrects units Tunes Compression Uploads new data Academia Private Sector / Public Sector 👷 👩🏫 🏄 Everyone gets to explore the clean dataset! Public Stakeholder
store Easy parallelization -> fast iteration on analysis Flexible scienti f ic data structure Crowd Sourced Cleaning + Combining Your favorite code to analyze/interpret 🎁
exploration of ideas = New CMIPers! Teaching with the real data Student ➡ Scientist 🚀 Not just for academics. Public/private sector uses the cloud data!
limited Understanding limited How to speed up: - Open/Fast Access to data - Community OSS tools - Infrastructure Support from Private Sector? How to speed up: - Collaboration - More Stakeholders - Reproducibility