$30 off During Our Annual Pro Sale. View Details »

Formal Methods at Amazon S3

Formal Methods at Amazon S3

Presentation on "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" paper for distributed systems reading group.

https://asatarin.github.io/talks/2022-02-formal-methods-at-amazon-s3/

Andrey Satarin

February 02, 2022
Tweet

More Decks by Andrey Satarin

Other Decks in Programming

Transcript

  1. Using lightweight formal methods to validate
    a key-value storage node in Amazon S3
    By James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle
    Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Ge
    ff
    en, Andrew War
    fi
    eld

    Presented by Andrey Satarin, @asatarin

    February, 2022

    View Slide

  2. Outline
    • Introduction and ShardStore


    • Validating a Storage System


    • Conformance Checking


    • Checking Crash Consistency


    • Checking Concurrent Executions


    • Other Properties


    • Experience and Lessons
    2

    View Slide

  3. Introduction and ShardStore
    3

    View Slide

  4. ShardStore and S3
    • The core of S3 are storage node servers


    • ShardStore — new key-value storage node


    • 40k lines of code in Rust


    • Crash consistency and concurrency in the implementation


    • Slowly rolling out to replace previous version
    4

    View Slide

  5. Validation goals
    • Functional API correcteness


    • Crash consistency of on disk data


    • Concurrent correctness of API calls and background tasks


    Soundness-correctness trade-off — willing to accept weaker guarantees than
    formal methods
    5

    View Slide

  6. ShardStore
    • Log-Structured Merge Tree
    (LSM)


    • Data in chunks, chunks in
    extents


    • More than one log
    complicates crash consistency


    • Garbage collection (GC)
    in the background
    6
    Figure 1. ShardStore’s on-disk layout

    View Slide

  7. Validating a Storage System
    7

    View Slide

  8. Properties
    • Focus on durability and consistency


    • Performance and availability is out of scope


    • Additional safety properties — undefined behavior, bounds checking, etc


    Results must outlive involvement of formal methods experts and be
    supported by development team in the future

    => lightweight approach to formal methods
    8

    View Slide

  9. Three views on durability
    9
    Section
    Sequential Crash-free “4 Conformance Checking”
    Sequential With crashes “5 Checking Crash Consistency”
    Concurrent Crash-free
    “6 Checking Concurrent
    Executions”
    Concurrent With crashes Out of scope

    View Slide

  10. Reference model
    • Executable specification with the same interface in Rust


    • 1% of the size of the implementation


    • For simplicity omits implementation failures (IO, resource exhaustion, etc)


    • Also used as a mock for unit tests, to help keep it up-to-date
    10

    View Slide

  11. Conformance Checking
    11

    View Slide

  12. Property-based testing
    • Implementation code refines the model


    • Argument bias to steer into interesting states


    • Default to random selection, only bias if have quantitative evidence
    of the benefit


    • Code coverage to identify blind spots in tests
    12

    View Slide

  13. Failure injection
    • Fail-stop crash


    • Covered in “5 Checking Crash Consistency”


    • Disk IO error


    • Relax check against the model


    • Resource exhaustion


    • Out of scope for property-based testing
    13

    View Slide

  14. Checking Crash Consistency
    14

    View Slide

  15. Write path
    Crash consistency is the primary motivation for this effort


    Every put operation has three steps:


    1. Write chunked data to an extent


    2. Write index entry in the LSM tree


    3. Update LSM tree metadata to point to new on-disk index data
    15

    View Slide

  16. Dependency graph
    • Inspired by soft updates


    • IO scheduler respects dependencies


    • Next append only issued if dependency is persisted
    16

    View Slide

  17. 17
    Figure 2 (a) Dependency Graph

    View Slide

  18. Two properties
    Persistence — if dependency is persisted, it should be visible after the crash


    Forward progress — after non-crash shutdown every operation’s dependency
    is persistent
    18

    View Slide

  19. Extending property-based testing
    • Add new operations to model (e.g. DirtyReboot, IndexFlush)


    • Adding block-level crash states proved to be slow and did not uncover new
    bugs


    • Block level crashes are not used by default
    19

    View Slide

  20. Checking Concurrent Executions
    20

    View Slide

  21. Checking Concurrent Executions
    Checking for linearizability


    Hand-written harness to validate key properties


    • Loom model checker for Rust with sound model checking (slow)


    • Shuttle model checker with probabilistic algorithms (faster)


    Loom and Shuttle offer a soundness-scalability trade-off
    21

    View Slide

  22. Other Properties
    22

    View Slide

  23. Other properties
    • Undefined behavior


    • Miri interpreter for Rust


    • Rust compiler dynamic analysis tools


    • Serialization


    • Crux symbolic execution engine to prove panic-freedom


    • Fuzzing
    23

    View Slide

  24. Experience and Lessons
    24

    View Slide

  25. Experience
    • Developing the reference model took ~ 2 x 9 months of FM experts


    • Non-experts contributed 18% of the model code so far


    Benefits:


    • Early detection is a great


    • Continuous integration/validation keeps the model up-to-date
    25

    View Slide

  26. Limitations
    • Hard to evaluate coverage by property-based tests


    • Accidental complexity gluing with S3 not covered


    • Huge API surface — not everything is covered
    26

    View Slide

  27. Testing distributed systems
    Curated list of resources on testing distributed systems

    https://asatarin.github.io/testing-distributed-systems/
    27

    View Slide

  28. The end
    28

    View Slide

  29. Contacts
    • Follow me on Twitter @asatarin


    • https://www.linkedin.com/in/asatarin/


    • https://asatarin.github.io/
    29

    View Slide

  30. References
    • Self reference for this talk (slides, video, etc)


    • "Using lightweight formal methods to validate a key-value storage node in
    Amazon S3" paper


    • Talk at SOSP 2021


    • Blog post from Murat Demirbas
    30

    View Slide