Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unlocking the Potential of Cloud Native Science...

Ryan Abernathey
November 10, 2021

Unlocking the Potential of Cloud Native Science with Pangeo

Keynote Talk given at JHU IDIES Symposium

https://idies.jhu.edu/news-events/events/idies-annual-symposium/

Ryan Abernathey

November 10, 2021
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. U n l o c k i n g t

    h e P o t e n t i a l o f C l o u d N at i v e S c i e n c e R y a n A b e r n a t h e y I D I E S 2 0 2 1
  2. W h o A m I ? 2 Physical Oceanographer

    Ph.D. From MIT, 2012 Associate Prof. at Columbia / LDEO https://ocean-transport.github.io/ Core developer of Xarray Core developer of Zarr Co-founder of Pangeo Open Source Advocate
  3. • Part I: What is “cloud native” science? What are

    its benefits? • Part II: Demo of Pangeo workflow in the Cloud • Part III: Deep dive on on Pangeo technology stack • Part IV: Future Challenges for Cloud Native Science T h i s Ta l k 3
  4. T W O Pa p e r S 4 https://doi.org/10.1029/2020AV000354

    https://doi.org/10.1109/MCSE.2021.3059437
  5. D ata i s E x p l o d

    i n g i n A l l F i e l d s ! 7 James Webb Space Telescope Light Sheet Fluorescence Microscope
  6. W h at S c i e n c e

    d o w e w a n t t o d o w i t h A l l T h i s D ata? 8
  7. 9 Take the mean! W h at S c i

    e n c e d o w e w a n t t o d o w i t h A l l T h i s D ata?
  8. 10 Analyze spatiotemporal variability W h at S c i

    e n c e d o w e w a n t t o d o w i t h A l l T h i s D ata?
  9. 11 Machine learning! Credit: Berkeley Lab W h at S

    c i e n c e d o w e w a n t t o d o w i t h A l l T h i s D ata?
  10. 8 0 / 2 0 R u l e o

    f D ata S c i e n c e 12 ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+ How do data scientists spend their time? Crowdflower Data Science Report (2016)
  11. “ D ata W o r k ” i n

    A I 13 “…incentivizing data excellence as a first-class citizen of AI…”
  12. T h e “ D o w n l o

    a d ” M o d e l 14 a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  13. 15 MB 😀 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  14. 16 GB 😐 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  15. 17 TB 😖 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  16. 18 PB 😱 T h e “ D o w

    n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
  17. O P E N C l o u d A

    r c h i t e c t u r e 19 Data Provider’s $ Data Consumer’s $
  18. O P E N C l o u d A

    r c h i t e c t u r e 20 Interactive Computing Data Provider’s $ Data Consumer’s $
  19. O P E N C l o u d A

    r c h i t e c t u r e 21 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing
  20. O P E N C l o u d A

    r c h i t e c t u r e 22 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats
  21. O P E N C l o u d A

    r c h i t e c t u r e 23 Data Provider’s $ Data Consumer’s $ Interactive Computing Parallel Computing Analysis Ready Data
 Cloud Optimized Formats
  22. • Performance • Reliability • Cost Effectiveness • Collaboration •

    Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 24
  23. • Performance • Reliability • Cost Effectiveness • Collaboration •

    Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 25 vs. Massive computational resources available on demand. Elastic scaling. High throughput data storage for distributed processing.
  24. • Performance • Reliability • Cost Effectiveness • Collaboration •

    Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 26 All collaborators around the world can access the same computational environment and data.
  25. • Performance • Reliability • Cost Effectiveness • Collaboration •

    Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 27 Industry can exploit data more effectively if it’s already in the cloud.
  26. • Performance • Reliability • Cost Effectiveness • Collaboration •

    Reproducibility • Creativity • Downstream Impacts • Access + Inclusion W h y C l o u d N at i v e ? 28 Researchers are not constrained by local infrastructure. https://coessing.org/
  27. P i l l a r s o f C

    l o u d N at i v e 29 Analysis-Ready, Cloud-Optimized Data Data-Proximate Computing On Demand, Scalable Distributed Computing compute node compute node compute node compute node compute node compute node compute node compute node compute node
  28. • Open Community • Open Source Software • Open Source

    Infrastructure 31 W h at i s Pa n g e o ? “A community platform for Big Data geoscience” Funders
  29. 32 Pa n g e o C o m m

    u n i t y ✓Students / Postdocs / Faculty / Software Devs / Data Scientists ✓Academia / National Labs / Industry / NGO ✓Weather / Climate / Oceans / Geoscience ✓US / UK / Europe / Australia Participation in Pangeo is open to anyone! http://pangeo.io
  30. Pa n g e o S h o w c

    a s e 33 https://pangeo.io/pangeo-showcase.html
  31. Pa n g e o S o f t w

    a r e E c o s y s t e m 34 Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy
  32. Pa n g e o C l o u d

    S ta c k 35 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Domain-Specific Packages
  33. 36 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask

    worker Dask worker Juptyer pod T h e Pa n g e o C l o u d S ta c k Cloud Object Store Cloud Compute Cluster HTTP
 GET http://pangeo.io/cloud.html
  34. Pa n g e o C l o u d

    I n f r a s t r u c t u r e 37 Compute Data Dask Gateway .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 Zarr Datasets Node Pools (Autoscaling) preemptible (spot instance) normal http://catalog.pangeo.io
  35. Pa n g e o C l o u d

    D ata C ata l o g 38 catalog.pangeo.io
  36. • Think in “Datasets” not “data files” • No need

    for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D ata 39 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU  How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1   How do data scientists spend their time? Crowdflower Data Science Report (2016) What is “Analysis Ready”?
  37. • Compatible with object storage (access via HTTP) • Supports

    lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D ata 40 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
  38. E X A M P L E O F A

    R C O D ATA 41 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
  39. Xarray + Dask + Zarr 42 Legacy Server C l

    o u d O p t i m i z e d S c a l e s !
  40. P i l l a r s o f C

    l o u d N at i v e 43 Analysis-Ready, Cloud-Optimized Data Data-Proximate Computing On Demand, Scalable Distributed Computing compute node compute node compute node compute node compute node compute node compute node compute node compute node
  41. 😣 Legacy data formats 😣 Storage and egress costs 😣

    Funding models C h a l l e n g e s f o r C l o u d N at i v e S c i e n c e 44
  42. • Most of our existing scientific data formats (e.g. HDF5,

    FITS, ROOT, etc.) are NOT cloud-optimized (inefficient access on object storage) • Adopting CO formats (e.g. Parquet, Zarr) is confusing to users and data providers • Transcoding legacy data to ARCO format can be tedious and complicated • Some clever hacks, e.g. kerchunk: https://github.com/fsspec/kerchunk C h a l l e n g e : L e g a c y D ata F o r m at s 45
  43. 46 https://pangeo-forge.org/ Pangeo Forge is an open source platform for

    data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format. Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages. We hope that Pangeo Forge can play the same role for datasets.
  44. • S3 is very expensive! ($250K / PB / year)

    • Fear of “lock in” to specific cloud provider due to Egress fees ($50K to download 1 PB) • Possible solutions • OSN / Internet2 • CloudFlare C h a l l e n g e : S t o r a g e a n d E g r e s s C o s t s 47
  45. 48

  46. • Local infrastructure, used only by members, should clearly be

    paid for by local institution • Since cloud infrastructure can scale to accommodate any number of users (up to the entire field), it’s not clear who should pay for it • My option: make it easily “franchisable” — allow institutions to incrementally add capacity to a federation to support their users while still leveraging economics of scale C h a l l e n g e : F u n d i n g M o d e l 49
  47. V I S I O N 50 data pods industry

    group research group HPC
 Centers
  48. L e a r n M o r e 51

    http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data