Unlocking the Potential of Cloud Native Science with Pangeo

U n l o c k i n g t
h e P o t e n t i a l o f C l o u d N at i v e S c i e n c e R y a n A b e r n a t h e y I D I E S 2 0 2 1

W h o A m I ? 2 Physical Oceanographer
Ph.D. From MIT, 2012 Associate Prof. at Columbia / LDEO https://ocean-transport.github.io/ Core developer of Xarray Core developer of Zarr Co-founder of Pangeo Open Source Advocate

• Part I: What is “cloud native” science? What are
its beneﬁts? • Part II: Demo of Pangeo workﬂow in the Cloud • Part III: Deep dive on on Pangeo technology stack • Part IV: Future Challenges for Cloud Native Science T h i s Ta l k 3

T W O Pa p e r S 4 https://doi.org/10.1029/2020AV000354
https://doi.org/10.1109/MCSE.2021.3059437

5 Credit: NASA's Goddard Space Flight Center

6 https://earthdata.nasa.gov/eosdis/cloud-evolution SWOT NISAR

D ata i s E x p l o d
i n g i n A l l F i e l d s ! 7 James Webb Space Telescope Light Sheet Fluorescence Microscope

W h at S c i e n c e
d o w e w a n t t o d o w i t h A l l T h i s D ata? 8

9 Take the mean! W h at S c i
e n c e d o w e w a n t t o d o w i t h A l l T h i s D ata?

10 Analyze spatiotemporal variability W h at S c i
e n c e d o w e w a n t t o d o w i t h A l l T h i s D ata?

11 Machine learning! Credit: Berkeley Lab W h at S
c i e n c e d o w e w a n t t o d o w i t h A l l T h i s D ata?

8 0 / 2 0 R u l e o
f D ata S c i e n c e 12 ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+ How do data scientists spend their time? Crowdﬂower Data Science Report (2016)

“ D ata W o r k ” i n
A I 13 “…incentivizing data excellence as a ﬁrst-class citizen of AI…”

T h e “ D o w n l o
a d ” M o d e l 14 a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze

15 MB 😀 T h e “ D o w
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze

16 GB 😐 T h e “ D o w

17 TB 😖 T h e “ D o w

18 PB 😱 T h e “ D o w

O P E N C l o u d A
r c h i t e c t u r e 19 Data Provider’s $ Data Consumer’s $

O P E N C l o u d A
r c h i t e c t u r e 20 Interactive Computing Data Provider’s $ Data Consumer’s $

O P E N C l o u d A
r c h i t e c t u r e 21 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing

O P E N C l o u d A
r c h i t e c t u r e 22 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data  Cloud Optimized Formats

O P E N C l o u d A
r c h i t e c t u r e 23 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data  Cloud Optimized Formats

• Performance • Reliability • Cost Effectiveness • Collaboration •
Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 24

Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 25 vs. Massive computational resources available on demand. Elastic scaling. High throughput data storage for distributed processing.

Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 26 All collaborators around the world can access the same computational environment and data.

Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 27 Industry can exploit data more eﬀectively if it’s already in the cloud.

Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 28 Researchers are not constrained by local infrastructure. https://coessing.org/

P i l l a r s o f C
l o u d N at i v e 29 Analysis-Ready, Cloud-Optimized Data Data-Proximate Computing On Demand, Scalable Distributed Computing compute node compute node compute node compute node compute node compute node compute node compute node compute node

D E M O 30 http://gallery.pangeo.io/repos/pangeo-gallery/physical-oceanography/01_sea-surface-height.html https://tinyurl.com/y5fngdbh

• Open Community • Open Source Software • Open Source
Infrastructure 31 W h at i s Pa n g e o ? “A community platform for Big Data geoscience” Funders

32 Pa n g e o C o m m
u n i t y ✓Students / Postdocs / Faculty / Software Devs / Data Scientists ✓Academia / National Labs / Industry / NGO ✓Weather / Climate / Oceans / Geoscience ✓US / UK / Europe / Australia Participation in Pangeo is open to anyone! http://pangeo.io

Pa n g e o S h o w c
a s e 33 https://pangeo.io/pangeo-showcase.html

Pa n g e o S o f t w
a r e E c o s y s t e m 34 Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy

Pa n g e o C l o u d
S ta c k 35 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Domain-Speciﬁc Packages

36 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask
worker Dask worker Juptyer pod T h e Pa n g e o C l o u d S ta c k Cloud Object Store Cloud Compute Cluster HTTP  GET http://pangeo.io/cloud.html

I n f r a s t r u c t u r e 37 Compute Data Dask Gateway .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 Zarr Datasets Node Pools (Autoscaling) preemptible (spot instance) normal http://catalog.pangeo.io

D ata C ata l o g 38 catalog.pangeo.io

• Think in “Datasets” not “data ﬁles” • No need
for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 39 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1 How do data scientists spend their time? Crowdﬂower Data Science Report (2016) What is “Analysis Ready”?

• Compatible with object storage (access via HTTP) • Supports
lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 40 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?

E X A M P L E O F A
R C O D ATA 41 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

Xarray + Dask + Zarr 42 Legacy Server C l
o u d O p t i m i z e d S c a l e s !

P i l l a r s o f C
l o u d N at i v e 43 Analysis-Ready, Cloud-Optimized Data Data-Proximate Computing On Demand, Scalable Distributed Computing compute node compute node compute node compute node compute node compute node compute node compute node compute node

😣 Legacy data formats 😣 Storage and egress costs 😣
Funding models C h a l l e n g e s f o r C l o u d N at i v e S c i e n c e 44

• Most of our existing scientiﬁc data formats (e.g. HDF5,
FITS, ROOT, etc.) are NOT cloud-optimized (inefﬁcient access on object storage) • Adopting CO formats (e.g. Parquet, Zarr) is confusing to users and data providers • Transcoding legacy data to ARCO format can be tedious and complicated • Some clever hacks, e.g. kerchunk: https://github.com/fsspec/kerchunk C h a l l e n g e : L e g a c y D ata F o r m at s 45

46 https://pangeo-forge.org/ Pangeo Forge is an open source platform for
data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format. Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages. We hope that Pangeo Forge can play the same role for datasets.

• S3 is very expensive! ($250K / PB / year)
• Fear of “lock in” to speciﬁc cloud provider due to Egress fees ($50K to download 1 PB) • Possible solutions • OSN / Internet2 • CloudFlare C h a l l e n g e : S t o r a g e a n d E g r e s s C o s t s 47

• Local infrastructure, used only by members, should clearly be
paid for by local institution • Since cloud infrastructure can scale to accommodate any number of users (up to the entire ﬁeld), it’s not clear who should pay for it • My option: make it easily “franchisable” — allow institutions to incrementally add capacity to a federation to support their users while still leveraging economics of scale C h a l l e n g e : F u n d i n g M o d e l 49

V I S I O N 50 data pods industry
group research group HPC  Centers

L e a r n M o r e 51
http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data

Unlocking the Potential of Cloud Native Science...

Unlocking the Potential of Cloud Native Science with Pangeo

More Decks by Ryan Abernathey

Other Decks in Science

Featured

Transcript