about networks and datasets, especially in the cloud A dataset is a set of related files with some application-determined consistency among the files In Linux, can consider a dynamically mountable file-system as a dataset File systems live in volumes allocated from block/object storage providers The way forward with BeyondFS: Layered, Immutable Datasets
datasets ◦ No need to copy first! • Access by network-oblivious programs - they think it’s local data ◦ Network aware programs should use object storage • 40 year success story - Widely implemented • Amazing engineering, getting better all the time, but…
& distributed programming: Shared Mutable Data is a BAD IDEA What is the purpose of a network file system? NFS provides Shared, Mutable Data Breeding ground for races and inconsistencies.
assume consistency and availability ◦ Consistency - concurrent clients see same data/results ◦ Availability - if the file system is unavailable, your client is also dead • CAP Theorem ◦ In a network, you can’t have both Consistency and Availability ◦ Partitions happen • NFS hard mounts (Cons.) vs soft mounts (Avail.)
• Protocol? • Particular filesystem? • Syscall interface? • Every system chooses a subset of the above, and then adds special features • Every interface constantly evolving (protocols slowly) • Result: Mine field of unsafe practices • NFS is *not* transparent; can break oblivious programs
Invalidation 3. Off-by-one errors NFS provides no naming help Caching is necessary for performance; NFS spec is mostly about invalidation Cache invalidation: slow AND buggy!
lot about names ◦ Not Restartable • Cloud: Functions (jobs, containers, serverless functions, whatever) ◦ Just want a handle/uuid ◦ Restartable (pure functional) is goal • Developer: every environment is like my environment, right? ◦ Tested on one FS, expected to work everywhere
dataset success stories ◦ Git ◦ Docker files ◦ Delta Lake, Pachyderm, LakeFS, many more • Data CANNOT be shared unless/until it is immutable (commit point) • Cache it! Replicate it! It won’t change! • BUT… ◦ Too much copying!! (Poor man’s immutability)
remote block device access. • Block semantics - very well defined! • Blocks are aggressively cached by OS • BeyondFS allows many different block storage providers ◦ AWS EBS, Google & Azure equivalents ◦ OpenEBS, Longhorn, Ceph ◦ DellEMC, NetApp, Pure ◦ LVM or ZFS volumes for easy dev/test ◦ Even files emulating block devices
Extend - online FS expansion - no need to hit ENOSPC • Multi-pathing - for better network availability • Dm-crypt for privacy, dm-verity for integrity • Many, many, more dm capabilities
images Accessed via network block protocols - nvme-o-f, iscsi, … Copy-on-write semantics for updating But, no updates shared until a commit point for the dataset Auto-mounted, cloud-scale naming & sharing Per-domain control service to keep track of everything
- good for most types of files • DM Snapshot - CoW for blocks - better for databases, log files, etc. • Ext2/3/4 FS in each layer - Linux “native”, some support in every OS • 2 regions in each layer - 1 for files, 1 for DB (manual placement req’d) • DB region also used for …
running find can bring serious pain • Change find to look for an actual database at the top of the block snapshot region • Update on each commit • Make it serious with SQLite3 - real SQL queries
huge amount of metadata • Many files + ordering for consistency leads to terrible performance • Journaling file-systems cause more delays • In BeyondFS, files never cross the wire, only blocks • Journaling not needed, commmit/sync/umount is only consistency point. • Faster than “local” and Free transactions!
Within a dataset, all files are readable • Mount-time translation of userids to “nobody” equivalent • Dm-crypt for at-rest and on-the-wire privacy • Dm-verity for strong checksums
space with RO layers below • For people (hard problem): ◦ /byfs/project[@domain]/dataset/x/y/z • /byfs automount everywhere, including containers • Well known DNS server/service: byfs.domain.com
empty dataset; uuid always means “latest” Class determines Storage backend Replication/Retention policies Authorization & Access Lots of commands to maintain the human friendly name space
be dealt with. Network timeouts eventually cause I/O error - EIO Only 1 program in existence that properly handles EIO Kernel enhancement: mount flag MS_FAILHARD Any EIO on mount point causes all following I/Os to fail Byfs refuses commit on any failed dataset
Current users (lease based w timeouts for lost clients) Tags Garbage collection - If no tags or active users, a layer may be merged into the next layer up
*will* bite you No one can agree on which semantics are safe; no tools exist to check or constrain you to a safe subset Block storage has well defined semantics and an immense number of under-utilized capabilities Let’s get beyond…