2025 UCSC LSD: NFS Must Die! (v2)

NFS* Must Die! And how to get Beyond ﬁle sharing
in the cloud *Not just “the” NFS, but any NFS. Tom Lyon - @[email protected] speakerdeck.com/pugs

Who is Tom Lyon? (aka Pugs) • Programming since 1967
• UNIX since 1975 - Princeton, Bell Labs, Amdahl, Sun… • #8 at Sun Microsystems - NFS, Automounter, Translucent File System, NSE • Founder/CTO/Chief Scientist of ◦ Ipsilon Networks (-> Nokia 1998), ◦ Netillion (->⏚ 2006), ◦ Nuova Systems (-> Cisco 2009), ◦ DriveScale (-> Twitter 2021) • Mostly retired 2022

Datasets = Layers = Filesystems = Volumes This talk is
about networks and datasets, especially in the cloud A dataset is a set of related files with some application-determined consistency among the files In Linux, can consider a dynamically mountable file-system as a dataset File systems live in volumes allocated from block/object storage providers The way forward with BeyondFS: Layered, Immutable Datasets

Target/Example Use Case AI training dataset with a billion ﬁles
A few hundred agents frequently updating the dataset with a few thousand ﬁles each time Thousands of GPU systems consuming the dataset

NFS: The Good Parts • Rapid access to arbitrarily large
datasets ◦ No need to copy ﬁrst! • Access by network-oblivious programs - they think it’s local data ◦ Network aware programs should use object storage • 40 year success story - Widely implemented • Amazing engineering, getting better all the time, but…

Existential Crisis: Shared Mutable Data Most important lesson in concurrent
& distributed programming: Shared Mutable Data is a BAD IDEA What is the purpose of a network ﬁle system? NFS provides Shared, Mutable Data Breeding ground for races and inconsistencies.

There is no ‘N’ in POSIX • POSIX ﬁle operations
assume consistency and availability ◦ Consistency - concurrent clients see same data/results ◦ Availability - if the ﬁle system is unavailable, your client is also dead • CAP Theorem ◦ In a network, you can’t have both Consistency and Availability ◦ Partitions happen • NFS hard mounts (Cons.) vs soft mounts (Avail.)

Semantics? What Semantics? • POSIX? (Have you read the spec?)
• Protocol? • Particular ﬁlesystem? • Syscall interface? • Every system chooses a subset of the above, and then adds special features • Every interface constantly evolving (protocols slowly) • Result: Mine ﬁeld of unsafe practices • NFS is *not* transparent; can break oblivious programs

2 Hardest Problems in Computer Science 1. Naming 2. Cache
Invalidation 3. Off-by-one errors NFS provides no naming help Caching is necessary for performance; NFS spec is mostly about invalidation Cache invalidation: slow AND buggy!

Who is the Client? • Traditional: People ◦ Care a
lot about names ◦ Not Restartable • Cloud: Functions (jobs, containers, serverless functions, whatever) ◦ Just want a handle/uuid ◦ Restartable (pure functional) is goal • Developer: every environment is like my environment, right? ◦ Tested on one FS, expected to work everywhere

This Is The Way: Layered, Immutable Data • Cloud scale
dataset success stories ◦ Git ◦ Docker ﬁles ◦ Delta Lake, Pachyderm, LakeFS, many more • Data CANNOT be shared unless/until it is immutable (commit point) • Cache it! Replicate it! It won’t change! • BUT… ◦ Too much copying!! (Poor man’s immutability)

NVMe Over Fabrics & Block Storage Providers • Crazy fast
remote block device access. • Block semantics - very well deﬁned! • Blocks are aggressively cached by OS • BeyondFS allows many different block storage providers ◦ AWS EBS, Google & Azure equivalents ◦ OpenEBS, Longhorn, Ceph ◦ DellEMC, NetApp, Pure ◦ LVM or ZFS volumes for easy dev/test ◦ Even ﬁles emulating block devices

Device Mapper Magic • Snapshot - copy on write •
Extend - online FS expansion - no need to hit ENOSPC • Multi-pathing - for better network availability • Dm-crypt for privacy, dm-verity for integrity • Many, many, more dm capabilities

The solution: Beyond FS (byfs) Layered, immutable, block based ﬁle-system
images Accessed via network block protocols - nvme-o-f, iscsi, … Copy-on-write semantics for updating But, no updates shared until a commit point for the dataset Auto-mounted, cloud-scale naming & sharing Per-domain control service to keep track of everything

Beyond FS Overview

Filesystem Layering

Copy-on-Write: Files or Blocks? • OverlayFS - CoW for files
- good for most types of files • DM Snapshot - CoW for blocks - better for databases, log files, etc. • Ext2/3/4 FS in each layer - Linux “native”, some support in every OS • 2 regions in each layer - 1 for files, 1 for DB (manual placement req’d) • DB region also used for …

The Billion File Problem • On very large FSes, users
running ﬁnd can bring serious pain • Change ﬁnd to look for an actual database at the top of the block snapshot region • Update on each commit • Make it serious with SQLite3 - real SQL queries

File *and* block C.O.W.

The Small File Problem • Every file comes with a
huge amount of metadata • Many files + ordering for consistency leads to terrible performance • Journaling file-systems cause more delays • In BeyondFS, files never cross the wire, only blocks • Journaling not needed, commmit/sync/umount is only consistency point. • Faster than “local” and Free transactions!

Security • SSH-like key management for Authentication & Authorization •
Within a dataset, all ﬁles are readable • Mount-time translation of userids to “nobody” equivalent • Dm-crypt for at-rest and on-the-wire privacy • Dm-verity for strong checksums

Naming • For computers: ◦ /byfs/uuid[@domain]/x/y/z ▪ Enters a writable
space with RO layers below • For people (hard problem): ◦ /byfs/project[@domain]/dataset/x/y/z • /byfs automount everywhere, including containers • Well known DNS server/service: byfs.domain.com

Creation Byfs create [-C class] [Human name] Returns uuid to
empty dataset; uuid always means “latest” Class determines Storage backend Replication/Retention policies Authorization & Access Lots of commands to maintain the human friendly name space

Commits & Conﬂicts “Byfs commit” - commit current dataset, return
uuid • Merge conﬂicts - commit fails - restart function Various merge policies possible

Ay, Ay, EIO Even with block semantics, network partitions must
be dealt with. Network timeouts eventually cause I/O error - EIO Only 1 program in existence that properly handles EIO Kernel enhancement: mount ﬂag MS_FAILHARD Any EIO on mount point causes all following I/Os to fail Byfs refuses commit on any failed dataset

Garbage collection Each layer keeps a reference count: Upper layer(s)
Current users (lease based w timeouts for lost clients) Tags Garbage collection - If no tags or active users, a layer may be merged into the next layer up

Lazy unmounts Don’t unmount layers aggressively High potential for layer
reuse Blocks remained cached

Summary Network ﬁle systems are inherently unsafe The CAP theorem
*will* bite you No one can agree on which semantics are safe; no tools exist to check or constrain you to a safe subset Block storage has well deﬁned semantics and an immense number of under-utilized capabilities Let’s get beyond…

Status This is a call for collaborators. I need help.
Lots of help.

2025 UCSC LSD: NFS Must Die! (v2)

2025 UCSC LSD: NFS Must Die! (v2)

Tom Lyon

More Decks by Tom Lyon

Other Decks in Technology

Featured

Transcript

NFS* Must Die! And how to get Beyond ﬁle sharing

Who is Tom Lyon? (aka Pugs) • Programming since 1967

Datasets = Layers = Filesystems = Volumes This talk is

Target/Example Use Case AI training dataset with a billion ﬁles

NFS: The Good Parts • Rapid access to arbitrarily large

Existential Crisis: Shared Mutable Data Most important lesson in concurrent

There is no ‘N’ in POSIX • POSIX ﬁle operations

Semantics? What Semantics? • POSIX? (Have you read the spec?)

2 Hardest Problems in Computer Science 1. Naming 2. Cache

Who is the Client? • Traditional: People ◦ Care a

This Is The Way: Layered, Immutable Data • Cloud scale

NVMe Over Fabrics & Block Storage Providers • Crazy fast

Device Mapper Magic • Snapshot - copy on write •

The solution: Beyond FS (byfs) Layered, immutable, block based ﬁle-system

Beyond FS Overview

Filesystem Layering

Copy-on-Write: Files or Blocks? • OverlayFS - CoW for ﬁles

The Billion File Problem • On very large FSes, users

File and block C.O.W.

The Small File Problem • Every ﬁle comes with a

Security • SSH-like key management for Authentication & Authorization •

Naming • For computers: ◦ /byfs/uuid[@domain]/x/y/z ▪ Enters a writable

Creation Byfs create [-C class] [Human name] Returns uuid to

Commits & Conﬂicts “Byfs commit” - commit current dataset, return

Ay, Ay, EIO Even with block semantics, network partitions must

Garbage collection Each layer keeps a reference count: Upper layer(s)

Lazy unmounts Don’t unmount layers aggressively High potential for layer

Summary Network ﬁle systems are inherently unsafe The CAP theorem

Status This is a call for collaborators. I need help.