Upgrade to Pro — share decks privately, control downloads, hide ads and more …

VirtualiZarr + Icechunk talk at SciPy 2025

VirtualiZarr + Icechunk talk at SciPy 2025

Avatar for Tom Nicholas

Tom Nicholas

July 10, 2025
Tweet

More Decks by Tom Nicholas

Other Decks in Programming

Transcript

  1. Virtu a liZ a rr & Icechunk: Build a cloud-optimised

    d a t a cube in 3 lines Tom Nichol a s - SciPy - July 2025
  2. OPEN SOURCE CONTRIBUTIONS PRIOR EXPERIENCE a bout me • Pl

    a sm a Physicist • Pivoted to geoscience d a t a • Open source m a int a iner Tom Nichol a s, PhD FORWARD ENGINEER MISSION STATEMENT To empower people to use scientific d a t a to solve hum a nity’s gre a test ch a llenges
  3. EARTHMOVER.IO story  18 months a go I w a

    s a sked to cloud-optimize some d a t a - I’m fi n a lly done 😅  Such a p a in I wrote a new p a ck a ge - now it’s 3 lines of Python 💪  This a ppro a ch (Virtu a liZ a rr + Icechunk) should work for m a ny d a t a sets 🌐  Avoids copying the d a t a - a cloud-n a tive bridge for a rchiv a l d a t a 📜☁  Icechunk’s “multipl a yer mode” me a ns users c a n re a d d a t a during upd a tes! Coll a bor a tors: Virtu a liZ a rr: M a x Jones, Se a n H a rkins, Aimee B a rci a uk a s & Kyle B a rron (DevSeed), R a ph a el H a gen (C a rbonPl a n), Juli a Signell (Element84), Icechunk: Seb a sti a n G a lkin & Deep a k Cheri a n (E a rthmover) OAE E ff iciency M a p: Sh a ne Loe ff ler & K a t a M a rtin (C a rbonPl a n), M a tt Long ([C]Worthy)
  4. EARTHMOVER.IO d a t a set: [C]Worthy OAE Efficiency Atl

    a s • Ensemble of clim a te model simul a tions • Arr a nged in a grid (“tiles”) • On S3 (in Source Cooper a tive’s bucket) 🙄 NetCDF 😬 50TB (200TB uncompressed) 😱 500,000 netCDF4 files! 😵💫 Logic a lly 6-dimension a l d a t a cube ☝ (3D + time + 2 ensemble dimensions)
  5. EARTHMOVER.IO be a utiful geosp a ti a l visu

    a liz a tion HTTPS://CARBONPLAN.ORG/RESEARCH/OAE - EFFICIENCY/ Sh a ne Loeffler
  6. EARTHMOVER.IO ugly distribution: “Pile of files / tiles” → Wh

    a t p a sses for a “cloud d a t a a rchive” • Result of a n a ive “lift a nd shift”
  7. EARTHMOVER.IO d a t a h a s structure •

    “Level 3 d a t a ” in geosp a ti a l j a rgon
  8. EARTHMOVER.IO problem: not “cloud-optimized” • Client code h a s

    to look a t met a d a t a spre a d throughout e a ch file • Do this for every file, then check they c a n be combined • If you c a lled xr.open_mfdataset th a t’s wh a t it would do 1 minute per file Time just to open: = a n entire ye a r!!! 🤯 x 500,000 files
  9. EARTHMOVER.IO rewrite a s Z a rr! 50TB +50TB •

    Time to open now <1 second! 🥳 • Uses 2x the stor a ge 👎
  10. EARTHMOVER.IO Icechunk stores “virtu a l chunks”  Tr a

    ns a ction a l, cloud- n a tive stor a ge engine for Z a rr  Works together with Z a rr Python 3 a nd X a rr a y  Supports virtu a l chunks https://github.com/ e a rth-mover/icechunk/ https://icechunk.io  Core implemented in Rust; thin Python wr a pper  100% open source (Ap a che 2.0)
  11. EARTHMOVER.IO Virtu a liZ a rr extr a cts virtu

    a l chunks import virtualizarr as vz vz.open_virtual_dataset ( ) • Combining chunk references from different files == a rr a y conc a ten a tion • So use xarray.concat()! (By wr a pping ManifestArrays) • C a n write references to Icechunk stores • (Or to the Kerchunk form a t) “chunk m a nifests” “Virtu a l” xr.Dataset ( )
  12. EARTHMOVER.IO Sc a ling to 500,000 files • M a

    p-reduce problem, c a lled vi a vz.open_virtual_mfdataset(<filepaths>) • Serverlessly m a p vz.open_virtual_dataset over files a s AWS L a mbd a s • (M a x 1000 l a mbd a s a t once, so b a tch this) • Demo… vz.open_virtual_dataset( ) xarray.concat(…) 🧑💻 vz.open_virtual_dataset( ) vz.open_virtual_dataset( )
  13. EARTHMOVER.IO Ex a mple • 10GB / s • From

    l a ptop • B a tched • Resum a ble • Gener a l • 3 lines (ish…)
  14. EARTHMOVER.IO W a tch out for rele a ses! ✨

    → Replic a te demo yourself with upcoming Virtu a liZ a rr 2.0 rele a se • Sc a ling demo some unrele a sed fe a tures in Virtu a liZ a rr 2.0 • But Virtu a liZ a rr a nd Icechunk work together tod a y! • Icechunk 1.0 w a s a lso rele a sed tod a y! • Production-re a dy • St a ble, b a ckw a rds-comp a tible on-disk form a t! (With open specific a tion)
  15. EARTHMOVER.IO Re a ding whilst writing? • E a ch

    loop iter a tion a dds more d a t a to the store • Re a l-world d a t a often needs to do this periodic a lly on ongoing b a sis • e.g. e a ch time a NASA im a ging s a tellite tr a nsmits a nother im a ge • W a nt a full history of ch a nges SNAPSHOT HISTORY 22:05:21.173389+00:00 Repository initi a lized 22:05:55.090858+00:00 Deleted met a d a t a 22:05:31.750954+00:00 Upd a ted incorrect d a t a 22:05:22.133229+00:00 Added a rr a ys A & B Repo • Also w a nt users to be a ble to s a fely re a d previous d a t a whilst new d a t a is being ingested…
  16. EARTHMOVER.IO Z a rr without Icechunk → Z a rr

    is not designed for “multipl a yer mode” → X Y ← → TIME USER A USER B st a rt writing big upd a te fi nish writing big upd a te re a d d a t a These issues stem from the f a ct th a t Z a rr is not a monolithic fi le form a t. Z a rr d a t a is spre a d over m a ny fi les. (More like a d a t a b a se.)
  17. EARTHMOVER.IO Z a rr with Icechunk: Multipl a yer mode

    → tr a ns a ctions en a ble s a fe coll a bor a tion  All upd a tes occur with a tr a ns a ction a nd cre a te a new sn a pshot of the d a t a set  Seri a liz a ble isol a tion between tr a ns a ctions; re a ders only ever see a committed sn a pshot  No locks required for re a ding or writing d a t a  Optimistic concurrency control for detecting a nd resolving write conflicts These fe a tures m a ke Z a rr work more like a d a t a b a se.
  18. EARTHMOVER.IO Br a nch: dev T a g: v1.0 Git-like

    version control for Arr a y d a t a → Sn a pshots, Br a nches, T a gs Br a nch: m a in
  19. EARTHMOVER.IO “Which file form a ts?” • Currently c a

    n p a rse netCDF4, HDF5, netCDF3, “n a tive” Z a rr (v3), FITS • TIFF, COG, GRIB coming soon • In Virtu a liZ a rr v2.0 you c a n write your own custom “P a rser” for a more niche form a t • e.g. WIP p a rser for HuggingF a ce’s S a feTensors form a t for ML model weights
  20. EARTHMOVER.IO Aside: P a rsers a s a runtime Z

    a rr tr a nsl a tion l a yer • P a rser returns a “M a nifestStore” • Z a rr-python/X a rr a y c a n re a d this store directly • Uses obstore rust cr a te under the hood • So you don’t a ctu a lly need to seri a lize to Kerchunk / Icechunk form a t to re a d d a t a b a ck… • i.e. use Virtu a liZ a rr a s a runtime tr a nsl a tion l a yer to re a d a ny (p a rse a ble) form a t a s Z a rr! • (This only m a kes sense for a lre a dy-cloud-optimized d a t a , or when running outside of the cloud)
  21. EARTHMOVER.IO Summ a ry  Level 3/4 d a t

    a sets often in a rchiv a l form a ts with Z a rr-like ( a rr a y) structure  Virtu a l Icechunk stores point a t fi les without copying the d a t a  Build virtu a l d a t a cubes using X a rr a y synt a x vi a Virtu a liZ a rr  Icechunk a llows increment a l upd a tes a s new d a t a a rrives Format NetCDF4 “Native” Zarr Icechunk 🧊 # of URLs 500,000 1 1 Time to open ~1 year < 1 sec < 1 sec Storage increase 0% 100% <0.0004% Convert using Xarray? N/a Yes Yes Version- controlled? No No Yes Update-safe? No No Yes
  22. EARTHMOVER.IO "But you c a n’t ch a nge the

    chunks” • Correct. Th a t is the prim a ry downside of this a ppro a ch. • You c a n choose to write n a tive chunks a longside the virtu a l chunks though • Allows you to ensure sm a ll coordin a te v a ri a bles a re a ll one chunk • Allows you to increment a lly overwrite d a t a with more suit a ble chunking a fter virtu a l ingestion
  23. EARTHMOVER.IO Bonus: “Wh a t d a t a sets

    c a n you virtu a lize?” • Currently some a ddition a l requirements imposed by Z a rr d a t a model • 1 chunk = 1 HTTP r a nge request • Homogenous chunk sh a pes • Homogenous chunk codecs (e.g. compression) • No other per-chunk met a d a t a , only per- a rr a y • Some of these could be rel a xed by future Z a rr development…
  24. EARTHMOVER.IO Bonus: “Wh a t a bout Kerchunk?” Format NetCDF4

    “Native” Zarr Kerchunk Icechunk 🧊 # of URLs 500,000 1 1 1 Time to open ~1 year < 1 sec < 1 sec < 1 sec Storage increase 0% 100% 0.0004% 0.0004% Convert using Xarray? N/a Yes No Yes Version- controlled? No No No Yes Update-safe? No No No Yes  Kerchunk is two things: 1. Python p a ck a ge • Virtu a liZ a rr p a ck a ge repl a ces this 2. Form a t for storing references • Icechunk form a t is a n a ltern a tive • (Though Virtu a liZ a rr c a n write to both)
  25. EARTHMOVER.IO Bonus: “How does the p a r a lleliz

    a tion work?” • vz.open_virtu a l_mfd a t a set a ccepts a n Executor • Follows concurrent.Futures interf a ce (ide a from Cubed) • Comes with a LithopsExecutor a D a skDel a yedExecutor, a nd works with the python Thre a dPoolExecutor • Lithops • Open-source p a ck a ge a bstr a cting over serverless APIs of v a rious cloud providers • H a d to build a runtime using Docker first • But then my python function just runs on AWS L a mbd a s