Formal Methods at Amazon S3

Using lightweight formal methods to validate a key-value storage node
in Amazon S3 By James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Ge ff en, Andrew War fi eld Presented by Andrey Satarin, @asatarin February, 2022

Outline • Introduction and ShardStore • Validating a Storage System
• Conformance Checking • Checking Crash Consistency • Checking Concurrent Executions • Other Properties • Experience and Lessons 2

Introduction and ShardStore 3

ShardStore and S3 • The core of S3 are storage
node servers • ShardStore — new key-value storage node • 40k lines of code in Rust • Crash consistency and concurrency in the implementation • Slowly rolling out to replace previous version 4

Validation goals • Functional API correcteness • Crash consistency of
on disk data • Concurrent correctness of API calls and background tasks Soundness-correctness trade-off — willing to accept weaker guarantees than formal methods 5

ShardStore • Log-Structured Merge Tree (LSM) • Data in chunks,
chunks in extents • More than one log complicates crash consistency • Garbage collection (GC) in the background 6 Figure 1. ShardStore’s on-disk layout

Validating a Storage System 7

Properties • Focus on durability and consistency • Performance and
availability is out of scope • Additional safety properties — undefined behavior, bounds checking, etc Results must outlive involvement of formal methods experts and be supported by development team in the future   => lightweight approach to formal methods 8

Three views on durability 9 Section Sequential Crash-free “4 Conformance
Checking” Sequential With crashes “5 Checking Crash Consistency” Concurrent Crash-free “6 Checking Concurrent Executions” Concurrent With crashes Out of scope

Reference model • Executable specification with the same interface in
Rust • 1% of the size of the implementation • For simplicity omits implementation failures (IO, resource exhaustion, etc) • Also used as a mock for unit tests, to help keep it up-to-date 10

Conformance Checking 11

Property-based testing • Implementation code refines the model • Argument
bias to steer into interesting states • Default to random selection, only bias if have quantitative evidence of the benefit • Code coverage to identify blind spots in tests 12

Failure injection • Fail-stop crash • Covered in “5 Checking
Crash Consistency” • Disk IO error • Relax check against the model • Resource exhaustion • Out of scope for property-based testing 13

Checking Crash Consistency 14

Write path Crash consistency is the primary motivation for this
effort Every put operation has three steps: 1. Write chunked data to an extent 2. Write index entry in the LSM tree 3. Update LSM tree metadata to point to new on-disk index data 15

Dependency graph • Inspired by soft updates • IO scheduler
respects dependencies • Next append only issued if dependency is persisted 16

17 Figure 2 (a) Dependency Graph

Two properties Persistence — if dependency is persisted, it should
be visible after the crash Forward progress — after non-crash shutdown every operation’s dependency is persistent 18

Extending property-based testing • Add new operations to model (e.g.
DirtyReboot, IndexFlush) • Adding block-level crash states proved to be slow and did not uncover new bugs • Block level crashes are not used by default 19

Checking Concurrent Executions 20

Checking Concurrent Executions Checking for linearizability Hand-written harness to validate
key properties • Loom model checker for Rust with sound model checking (slow) • Shuttle model checker with probabilistic algorithms (faster) Loom and Shuttle offer a soundness-scalability trade-off 21

Other Properties 22

Other properties • Undefined behavior • Miri interpreter for Rust
• Rust compiler dynamic analysis tools • Serialization • Crux symbolic execution engine to prove panic-freedom • Fuzzing 23

Experience and Lessons 24

Experience • Developing the reference model took ~ 2 x
9 months of FM experts • Non-experts contributed 18% of the model code so far Benefits: • Early detection is a great • Continuous integration/validation keeps the model up-to-date 25

Limitations • Hard to evaluate coverage by property-based tests •
Accidental complexity gluing with S3 not covered • Huge API surface — not everything is covered 26

Testing distributed systems Curated list of resources on testing distributed
systems   https://asatarin.github.io/testing-distributed-systems/ 27

The end 28

Contacts • Follow me on Twitter @asatarin • https://www.linkedin.com/in/asatarin/ •
https://asatarin.github.io/ 29

References • Self reference for this talk (slides, video, etc)
• "Using lightweight formal methods to validate a key-value storage node in Amazon S3" paper • Talk at SOSP 2021 • Blog post from Murat Demirbas 30

Formal Methods at Amazon S3

Formal Methods at Amazon S3

Andrey Satarin

More Decks by Andrey Satarin

Other Decks in Programming

Featured

Transcript

Using lightweight formal methods to validate a key-value storage node

Outline • Introduction and ShardStore • Validating a Storage System

Introduction and ShardStore 3

ShardStore and S3 • The core of S3 are storage

Validation goals • Functional API correcteness • Crash consistency of

ShardStore • Log-Structured Merge Tree (LSM) • Data in chunks,

Validating a Storage System 7

Properties • Focus on durability and consistency • Performance and

Three views on durability 9 Section Sequential Crash-free “4 Conformance

Reference model • Executable specification with the same interface in

Conformance Checking 11

Property-based testing • Implementation code refines the model • Argument

Failure injection • Fail-stop crash • Covered in “5 Checking

Checking Crash Consistency 14

Write path Crash consistency is the primary motivation for this

Dependency graph • Inspired by soft updates • IO scheduler

17 Figure 2 (a) Dependency Graph

Two properties Persistence — if dependency is persisted, it should

Extending property-based testing • Add new operations to model (e.g.

Checking Concurrent Executions 20

Checking Concurrent Executions Checking for linearizability Hand-written harness to validate

Other Properties 22

Other properties • Undefined behavior • Miri interpreter for Rust

Experience and Lessons 24

Experience • Developing the reference model took ~ 2 x

Limitations • Hard to evaluate coverage by property-based tests •

Testing distributed systems Curated list of resources on testing distributed

The end 28

Contacts • Follow me on Twitter @asatarin • https://www.linkedin.com/in/asatarin/ •

References • Self reference for this talk (slides, video, etc)