Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Verification of a Distributed System

The Verification of a Distributed System

Caitie McCaffrey

May 24, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. The Verification of a
    Distributed System
    A Practitioner’s Guide to Increasing Confidence in System Correctness

    View full-size slide

  2. Distributed Systems Engineer
    Caitie McCaffrey
    caitiem.com
    @caitie

    View full-size slide

  3. LESLIE LAMPORT
    “A Distributed System is one in which
    the failure of a computer you didn’t even
    know existed can render your own
    computer unusable”

    View full-size slide

  4. Service Service
    Service
    We Are All
    Building
    Distributed
    Systems

    View full-size slide

  5. Twitter
    Services

    View full-size slide

  6. Overview
    Formal Verification
    Provably Correct Systems
    Testing in the Wild
    Increase Confidence in System Correctness
    Research
    A New Hope

    View full-size slide

  7. Safety Liveness
    &

    View full-size slide

  8. Formal
    Verification

    View full-size slide

  9. Provably
    Correct
    Formal
    Verification

    View full-size slide

  10. Formal Specifications
    Written description of what a system is supposed to do
    TLA+ Coq

    View full-size slide

  11. Leslie Lamport, Specifying Systems
    “Its a good idea to understand a system
    before building it, so its a good idea to write
    a specification of a system before
    implementing it”
    TLA+

    View full-size slide

  12. Hour Clock Specification
    ————————————— MODULE HourClock ————————————————
    EXTENDS Naturals
    VARIABLE hr
    HCini == hr \in (1 .. 12)
    HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1
    HC == HCini /\ [][HCnxt] _hr
    ————————————————————————————————————————————-
    THEOREM HC => []HCini
    =============================================
    Leslie Lamport, Specifying Systems
    TLA+

    View full-size slide

  13. Use of
    Formal
    Methods at
    Amazon
    Web Services
    TLA+

    View full-size slide

  14. “Formal Methods Have Been a Big Success”
    S3 & 10+ Core Pieces of
    Infrastructure Verified
    2 Serious Bugs Found
    Increased Confidence to make
    Optimizations
    Use of Formal Methods at Amazon Web Services
    TLA+

    View full-size slide

  15. “Formal methods deal with
    models of systems, not the
    systems themselves”
    Use of Formal Methods at Amazon Web Services

    View full-size slide

  16. Distributed
    Systems Testing
    in the Wild

    View full-size slide

  17. Distributed
    Systems Testing
    in the Wild
    “Seems Pretty Legit”

    View full-size slide

  18. Unit Tests
    Testing of Individual Software
    Components or Modules

    View full-size slide

  19. Scala
    Types
    Are Not
    Testing
    A Short Counter Example

    View full-size slide

  20. TCP Doesn’t Care About
    Your Type System

    View full-size slide

  21. Integration Tests
    Testing of integrated modules to
    verify combined functionality

    View full-size slide

  22. Simple
    Testing Can
    Prevent
    Most Critical
    Failures

    View full-size slide

  23. Three nodes or
    less can reproduce
    98% of failures
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  24. Testing error handling
    code could have
    prevented 58% of
    catastrophic failures
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  25. Error Handling Code is simply empty or
    only contains a Log statement
    Error Handler aborts cluster on an overly
    general exception
    Error Handler contains comments like
    FIXME or TODO
    35% of
    Catastrophic
    Failures
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  26. Property Based
    Testing

    View full-size slide

  27. QuickCheck ScalaCheck
    Haskell
    Erlang
    Scala
    Java
    & &
    C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl,
    Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift
    Languages with Quick Check Ports:

    View full-size slide

  28. ScalaCheck Examples

    View full-size slide

  29. Fault Injection
    Introducing faults into the system under test

    View full-size slide

  30. -The Verification of a Distributed System
    “Without explicitly forcing a system
    to fail, it is unreasonable to have any
    confidence it will operate correctly
    in failure modes”

    View full-size slide

  31. Netflix Simian Army
    • Chaos Monkey: kills instances
    • Latency Monkey: artificial latency
    induced
    • Chaos Gorilla: simulates outage
    of entire availability zone.

    View full-size slide

  32. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View full-size slide

  33. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View full-size slide

  34. CAUTION: Passing Tests
    Does Not Ensure Correctness

    View full-size slide

  35. GAME DAYS
    Resilience Engineering: Learning to Embrace Failure
    Breaking your services on purpose

    View full-size slide

  36. How to Run a GameDay
    1. Notify Engineering Teams that Failure is Coming
    2. Induce Failures
    3. Monitor Systems Under Test
    4. Observing Only Team Monitors Recovery Processes
    & Systems, Files Bugs
    5. Prioritize Bugs & Get Buy-In Across Teams
    Resilience Engineering: Learning to Embrace Failure

    View full-size slide

  37. Game Day at Stripe
    “During a recent game day, we
    tested failing over a Redis cluster by
    running kill -9 on its primary
    node, and ended up losing all
    data in the cluster”
    Game Day Exercises at Stripe: Learning from `kill -9`

    View full-size slide

  38. TESTING IN
    PRODUCTION
    Some thoughts on

    View full-size slide

  39. Monitoring
    Testing
    is not

    View full-size slide

  40. CANARIES
    “Verification” in production

    View full-size slide

  41. Verification
    Wild
    in the wild
    Unit & Integration Tests
    Property Based Testing
    Fault Injection
    Canaries

    View full-size slide

  42. Research
    Improving the Verification
    of Distributed Systems
    Lineage Driven Fault Injection
    ‘Cause I’m Strong Enough:
    Reasoning about Consistency Choices in Distributed Systems
    IronFleet:
    Proving Practical Distributed Systems Correct
    Towards Property Based
    Consistency Verification

    View full-size slide

  43. Lineage-driven
    Fault Injection

    View full-size slide

  44. Lineage-driven
    Fault Injection
    Molly In Action

    View full-size slide

  45. Netflix & Molly
    Distributed Tracing + FIT To
    construct call graphs
    Metric Systems to Determine
    if Call was a Success
    Used FIT to Inject Failures
    determined by Molly
    “Monkeys in Lab Coats”: Applied Failure Testing Research at Netflix

    View full-size slide

  46. Conclusion
    Use Formal Verification on
    Critical Components
    Unit Tests & Integration Tests find a
    multitude of Errors
    Increase Confidence via Property
    Testing & Fault Injection

    View full-size slide

  47. Camille Fournier
    “Enjoy the ride, have fun, and
    test your freaking code”

    View full-size slide

  48. Thank You
    Peter Alvaro
    Kyle Kingsbury
    Christopher
    Meiklejohn
    Alex Rasmussen
    Ines Sombra
    Nathan Taylor
    Alvaro Videla

    View full-size slide

  49. Questions
    @caitie
    http://github.com/CaitieM20/
    TheVerificationOfDistributedSystem
    Resources:

    View full-size slide