$30 off During Our Annual Pro Sale. View Details »

The Verification of a Distributed System

The Verification of a Distributed System

Caitie McCaffrey

December 08, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. The Verification of a
    Distributed System
    A Practitioner’s Guide to Increasing Confidence in System Correctness

    View Slide

  2. Distributed Systems Engineer
    Caitie McCaffrey
    caitiem.com
    @caitie

    View Slide

  3. Increase Confidence by Exploring More
    of the Input State Space
    All Inputs

    View Slide

  4. View Slide

  5. LESLIE LAMPORT
    “A Distributed System is one in which
    the failure of a computer you didn’t even
    know existed can render your own
    computer unusable”

    View Slide

  6. Service Service
    Service
    We Are All
    Building
    Distributed
    Systems

    View Slide

  7. Twitter
    Services

    View Slide

  8. View Slide

  9. Overview
    Formal Verification
    Provably Correct Systems
    Testing in the Wild
    Increase Confidence in System Correctness
    Research
    A New Hope

    View Slide

  10. References

    View Slide

  11. Provably
    Correct
    Formal
    Verification

    View Slide

  12. Formal Specifications
    Written description of what a system is supposed to do
    TLA+ Coq

    View Slide

  13. All Inputs
    State Space Explored by Formal Verification

    View Slide

  14. All Inputs
    Formal Verification
    State Space Explored by Formal Verification

    View Slide

  15. Hour Clock Specification
    ————————————— MODULE HourClock ————————————————
    EXTENDS Naturals
    VARIABLE hr
    HCini == hr \in (1 .. 12)
    HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1
    HC == HCini /\ [][HCnxt] _hr
    ————————————————————————————————————————————-
    THEOREM HC => []HCini
    =============================================
    Leslie Lamport, Specifying Systems
    TLA+

    View Slide

  16. Use of
    Formal
    Methods at
    Amazon
    Web Services
    TLA+

    View Slide

  17. “Formal Methods Have Been a Big Success”
    S3 & 10+ Core Pieces of
    Infrastructure Verified
    2 Serious Bugs Found
    Increased Confidence to make
    Optimizations
    Use of Formal Methods at Amazon Web Services
    TLA+

    View Slide

  18. Leslie Lamport, Specifying Systems
    “Its a good idea to understand a system
    before building it, so its a good idea to write
    a specification of a system before
    implementing it”
    TLA+

    View Slide

  19. “Formal methods deal with
    models of systems, not the
    systems themselves”
    Use of Formal Methods at Amazon Web Services

    View Slide

  20. Program Extraction

    View Slide

  21. POPL
    2016
    “Our Verified
    Implementation is
    extracted to OCaml
    & runs on real
    networks”
    Program Extraction
    COQ

    View Slide

  22. POPL
    2016
    “We have developed &
    checked our framework
    in Coq, extracted it to
    OCaml, and built
    executable stores”
    Program Extraction
    COQ

    View Slide

  23. Distributed
    Systems Testing
    in the Wild
    “Seems Pretty Legit”

    View Slide

  24. Unit Tests
    Testing of Individual Software
    Components or Modules

    View Slide

  25. Simple
    Testing Can
    Prevent
    Most Critical
    Failures

    View Slide

  26. 77% of Production failures
    can be reproduced by a
    Unit Test
    Simple Testing can Prevent Most Critical Failures

    View Slide

  27. Testing error handling
    code could have
    prevented 58% of
    catastrophic failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  28. Error Handling Code is simply empty or
    only contains a Log statement
    Error Handler aborts cluster on an overly
    general exception
    Error Handler contains comments like
    FIXME or TODO
    35% of
    Catastrophic
    Failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  29. Scala
    Types
    Are Not
    Testing
    A Short Counter Example

    View Slide

  30. TCP Doesn’t Care About
    Your Type System

    View Slide

  31. Integration Tests
    Testing of integrated modules to
    verify combined functionality

    View Slide

  32. Three nodes or
    less can reproduce
    98% of failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  33. Property Based
    Testing

    View Slide

  34. QuickCheck ScalaCheck
    Haskell
    Erlang
    Scala
    Java
    & &
    C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl,
    Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift
    Languages with Quick Check Ports:

    View Slide

  35. ScalaCheck Examples

    View Slide

  36. Fault Injection
    Introducing faults into the system under test

    View Slide

  37. -The Verification of a Distributed System
    “Without explicitly forcing a system
    to fail, it is unreasonable to have any
    confidence it will operate correctly
    in failure modes”

    View Slide

  38. Netflix Simian Army
    • Chaos Monkey: kills instances
    • Latency Monkey: artificial latency
    induced
    • Chaos Gorilla: simulates outage
    of entire availability zone.

    View Slide

  39. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View Slide

  40. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View Slide

  41. CAUTION: Passing Tests
    Does Not Ensure Correctness

    View Slide

  42. All Inputs
    State Space
    Explored
    Unit Tests

    View Slide

  43. All Inputs
    State Space
    Explored
    Unit Tests
    Integration
    Tests

    View Slide

  44. All Inputs
    State Space
    Explored
    Unit Tests
    Integration
    Tests
    Property
    Tests

    View Slide

  45. All Inputs
    State Space
    Explored
    Unit Tests
    Integration
    Tests
    Property
    Tests
    Fault
    Injection
    Tests

    View Slide

  46. GAME DAYS
    Resilience Engineering: Learning to Embrace Failure
    Breaking your services on purpose

    View Slide

  47. How to Run a GameDay
    1. Notify Engineering Teams that Failure is Coming
    2. Induce Failures
    3. Monitor Systems Under Test
    4. Observing Only Team Monitors Recovery Processes
    & Systems, Files Bugs
    5. Prioritize Bugs & Get Buy-In Across Teams
    Resilience Engineering: Learning to Embrace Failure

    View Slide

  48. Game Day at Stripe
    “During a recent game day, we
    tested failing over a Redis cluster by
    running kill -9 on its primary
    node, and ended up losing all
    data in the cluster”
    Game Day Exercises at Stripe: Learning from `kill -9`

    View Slide

  49. TESTING IN
    PRODUCTION
    Some thoughts on

    View Slide

  50. Monitoring
    Testing
    is not

    View Slide

  51. CANARIES
    “Verification” in production

    View Slide

  52. Probability
    of failure
    Rank
    Catastrophic
    Failures
    Classical

    engineering
    Reactive

    ops
    unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View Slide

  53. Probability
    of failure
    Rank
    Unit & Integration Tests
    Classical

    engineering
    Reactive

    ops
    unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View Slide

  54. Probability
    of failure
    Rank
    Unit & Integration Tests
    Property Based Testing
    Classical

    engineering
    Reactive

    ops unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View Slide

  55. Probability
    of failure
    Rank
    Unit & Integration Tests
    Property Based Testing
    Fault Injection Testing
    Classical

    engineering
    Reactive

    ops unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View Slide

  56. Probability
    of failure
    Rank
    Unit & Integration Tests
    Property Based Testing
    Fault Injection Testing
    Canaries
    Game Days
    Monitoring
    Classical

    engineering
    Reactive

    ops unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View Slide

  57. Research
    Improving the Verification
    of Distributed Systems
    Lineage Driven Fault Injection
    ‘Cause I’m Strong Enough:
    Reasoning about Consistency Choices in Distributed Systems
    IronFleet:
    Proving Practical Distributed Systems Correct

    View Slide

  58. ‘Cause I’m
    Strong Enough
    POPL
    2016

    View Slide

  59. Bank Application
    Bank Account must be > 0
    Deposit Money
    Withdrawal Money

    View Slide

  60. ‘Cause I’m Strong Enough: Reasoning About Consistency Choices in Distributed Systems

    View Slide

  61. Conclusion
    Use Formal Verification on
    Critical Components
    Unit Tests & Integration Tests find a
    multitude of Errors
    Increase Confidence via Property
    Testing & Fault Injection

    View Slide

  62. Camille Fournier
    “Enjoy the ride, have fun, and
    test your freaking code”

    View Slide

  63. Thank You
    Peter Alvaro
    Kyle Kingsbury
    Christopher
    Meiklejohn
    Alex Rasmussen
    Ines Sombra
    Nathan Taylor
    Alvaro Videla

    View Slide

  64. Questions
    @caitie
    https://github.com/CaitieM20/
    Talks/tree/master/
    TheVerificationOfADistributedSystem
    Resources:

    View Slide