Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Formal Methods at Amazon S3

Formal Methods at Amazon S3

Presentation on "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" paper for distributed systems reading group.


Andrey Satarin

February 02, 2022

More Decks by Andrey Satarin

Other Decks in Programming


  1. Using lightweight formal methods to validate a key-value storage node

    in Amazon S3 By James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Ge ff en, Andrew War fi eld Presented by Andrey Satarin, @asatarin February, 2022
  2. Outline • Introduction and ShardStore • Validating a Storage System

    • Conformance Checking • Checking Crash Consistency • Checking Concurrent Executions • Other Properties • Experience and Lessons 2
  3. ShardStore and S3 • The core of S3 are storage

    node servers • ShardStore — new key-value storage node • 40k lines of code in Rust • Crash consistency and concurrency in the implementation • Slowly rolling out to replace previous version 4
  4. Validation goals • Functional API correcteness • Crash consistency of

    on disk data • Concurrent correctness of API calls and background tasks Soundness-correctness trade-off — willing to accept weaker guarantees than formal methods 5
  5. ShardStore • Log-Structured Merge Tree (LSM) • Data in chunks,

    chunks in extents • More than one log complicates crash consistency • Garbage collection (GC) in the background 6 Figure 1. ShardStore’s on-disk layout
  6. Properties • Focus on durability and consistency • Performance and

    availability is out of scope • Additional safety properties — undefined behavior, bounds checking, etc Results must outlive involvement of formal methods experts and be supported by development team in the future 
 => lightweight approach to formal methods 8
  7. Three views on durability 9 Section Sequential Crash-free “4 Conformance

    Checking” Sequential With crashes “5 Checking Crash Consistency” Concurrent Crash-free “6 Checking Concurrent Executions” Concurrent With crashes Out of scope
  8. Reference model • Executable specification with the same interface in

    Rust • 1% of the size of the implementation • For simplicity omits implementation failures (IO, resource exhaustion, etc) • Also used as a mock for unit tests, to help keep it up-to-date 10
  9. Property-based testing • Implementation code refines the model • Argument

    bias to steer into interesting states • Default to random selection, only bias if have quantitative evidence of the benefit • Code coverage to identify blind spots in tests 12
  10. Failure injection • Fail-stop crash • Covered in “5 Checking

    Crash Consistency” • Disk IO error • Relax check against the model • Resource exhaustion • Out of scope for property-based testing 13
  11. Write path Crash consistency is the primary motivation for this

    effort Every put operation has three steps: 1. Write chunked data to an extent 2. Write index entry in the LSM tree 3. Update LSM tree metadata to point to new on-disk index data 15
  12. Dependency graph • Inspired by soft updates • IO scheduler

    respects dependencies • Next append only issued if dependency is persisted 16
  13. Two properties Persistence — if dependency is persisted, it should

    be visible after the crash Forward progress — after non-crash shutdown every operation’s dependency is persistent 18
  14. Extending property-based testing • Add new operations to model (e.g.

    DirtyReboot, IndexFlush) • Adding block-level crash states proved to be slow and did not uncover new bugs • Block level crashes are not used by default 19
  15. Checking Concurrent Executions Checking for linearizability Hand-written harness to validate

    key properties • Loom model checker for Rust with sound model checking (slow) • Shuttle model checker with probabilistic algorithms (faster) Loom and Shuttle offer a soundness-scalability trade-off 21
  16. Other properties • Undefined behavior • Miri interpreter for Rust

    • Rust compiler dynamic analysis tools • Serialization • Crux symbolic execution engine to prove panic-freedom • Fuzzing 23
  17. Experience • Developing the reference model took ~ 2 x

    9 months of FM experts • Non-experts contributed 18% of the model code so far Benefits: • Early detection is a great • Continuous integration/validation keeps the model up-to-date 25
  18. Limitations • Hard to evaluate coverage by property-based tests •

    Accidental complexity gluing with S3 not covered • Huge API surface — not everything is covered 26
  19. Testing distributed systems Curated list of resources on testing distributed

 https://asatarin.github.io/testing-distributed-systems/ 27
  20. References • Self reference for this talk (slides, video, etc)

    • "Using lightweight formal methods to validate a key-value storage node in Amazon S3" paper • Talk at SOSP 2021 • Blog post from Murat Demirbas 30