$30 off During Our Annual Pro Sale. View Details »

The Verification of a Distributed System

The Verification of a Distributed System

Caitie McCaffrey

November 30, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. The Verification of a
    Distributed System
    A Practitioner’s Guide to Increasing Confidence in System Correctness

    View Slide

  2. Distributed Systems Engineer
    Caitie McCaffrey
    caitiem.com
    @caitie

    View Slide

  3. View Slide

  4. LESLIE LAMPORT
    “A Distributed System is one in which
    the failure of a computer you didn’t even
    know existed can render your own
    computer unusable”

    View Slide

  5. Service Service
    Service
    We Are All
    Building
    Distributed
    Systems

    View Slide

  6. Twitter
    Services

    View Slide

  7. View Slide

  8. Overview
    Formal Verification
    Provably Correct Systems
    Testing in the Wild
    Increase Confidence in System Correctness
    Research
    A New Hope

    View Slide

  9. References

    View Slide

  10. Provably
    Correct
    Formal
    Verification

    View Slide

  11. Formal Specifications
    Written description of what a system is supposed to do
    TLA+ Coq

    View Slide

  12. Hour Clock Specification
    ————————————— MODULE HourClock ————————————————
    EXTENDS Naturals
    VARIABLE hr
    HCini == hr \in (1 .. 12)
    HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1
    HC == HCini /\ [][HCnxt] _hr
    ————————————————————————————————————————————-
    THEOREM HC => []HCini
    =============================================
    Leslie Lamport, Specifying Systems
    TLA+

    View Slide

  13. Use of
    Formal
    Methods at
    Amazon
    Web Services
    TLA+

    View Slide

  14. “Formal Methods Have Been a Big Success”
    S3 & 10+ Core Pieces of
    Infrastructure Verified
    2 Serious Bugs Found
    Increased Confidence to make
    Optimizations
    Use of Formal Methods at Amazon Web Services
    TLA+

    View Slide

  15. Leslie Lamport, Specifying Systems
    “Its a good idea to understand a system
    before building it, so its a good idea to write
    a specification of a system before
    implementing it”
    TLA+

    View Slide

  16. “Formal methods deal with
    models of systems, not the
    systems themselves”
    Use of Formal Methods at Amazon Web Services

    View Slide

  17. Program Extraction

    View Slide

  18. POPL
    2016
    “Our Verified
    Implementation is
    extracted to OCaml
    & runs on real
    networks”
    Program Extraction
    COQ

    View Slide

  19. POPL
    2016
    “We have developed &
    checked our framework
    in Coq, extracted it to
    OCaml, and built
    executable stores”
    Program Extraction
    COQ

    View Slide

  20. Distributed
    Systems Testing
    in the Wild
    “Seems Pretty Legit”

    View Slide

  21. Unit Tests
    Testing of Individual Software
    Components or Modules

    View Slide

  22. Simple
    Testing Can
    Prevent
    Most Critical
    Failures

    View Slide

  23. 77% of Production failures
    can be reproduced by a
    Unit Test
    Simple Testing can Prevent Most Critical Failures

    View Slide

  24. Error Handling Code is simply empty or
    only contains a Log statement
    Error Handler aborts cluster on an overly
    general exception
    Error Handler contains comments like
    FIXME or TODO
    35% of
    Catastrophic
    Failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  25. Testing error handling
    code could have
    prevented 58% of
    catastrophic failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  26. Scala
    Types
    Are Not
    Testing
    A Short Counter Example

    View Slide

  27. TCP Doesn’t Care About
    Your Type System

    View Slide

  28. Integration Tests
    Testing of integrated modules to
    verify combined functionality

    View Slide

  29. Three nodes or
    less can reproduce
    98% of failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  30. Property Based
    Testing

    View Slide

  31. QuickCheck ScalaCheck
    Haskell
    Erlang
    Scala
    Java
    & &
    C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl,
    Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift
    Languages with Quick Check Ports:

    View Slide

  32. ScalaCheck Examples

    View Slide

  33. Fault Injection
    Introducing faults into the system under test

    View Slide

  34. -The Verification of a Distributed System
    “Without explicitly forcing a system
    to fail, it is unreasonable to have any
    confidence it will operate correctly
    in failure modes”

    View Slide

  35. Netflix Simian Army
    • Chaos Monkey: kills instances
    • Latency Monkey: artificial latency
    induced
    • Chaos Gorilla: simulates outage
    of entire availability zone.

    View Slide

  36. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View Slide

  37. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View Slide

  38. CAUTION: Passing Tests
    Does Not Ensure Correctness

    View Slide

  39. GAME DAYS
    Resilience Engineering: Learning to Embrace Failure
    Breaking your services on purpose

    View Slide

  40. How to Run a GameDay
    1. Notify Engineering Teams that Failure is Coming
    2. Induce Failures
    3. Monitor Systems Under Test
    4. Observing Only Team Monitors Recovery Processes
    & Systems, Files Bugs
    5. Prioritize Bugs & Get Buy-In Across Teams
    Resilience Engineering: Learning to Embrace Failure

    View Slide

  41. Game Day at Stripe
    “During a recent game day, we
    tested failing over a Redis cluster by
    running kill -9 on its primary
    node, and ended up losing all
    data in the cluster”
    Game Day Exercises at Stripe: Learning from `kill -9`

    View Slide

  42. TESTING IN
    PRODUCTION
    Some thoughts on

    View Slide

  43. Monitoring
    Testing
    is not

    View Slide

  44. CANARIES
    “Verification” in production

    View Slide

  45. Verification
    Wild
    in the
    Unit & Integration Tests
    Property Based Testing
    Fault Injection
    Canaries

    View Slide

  46. Research
    Improving the Verification
    of Distributed Systems
    Lineage Driven Fault Injection
    ‘Cause I’m Strong Enough:
    Reasoning about Consistency Choices in Distributed Systems
    IronFleet:
    Proving Practical Distributed Systems Correct

    View Slide

  47. ‘Cause I’m
    Strong Enough
    POPL
    2016

    View Slide

  48. ‘Cause I’m Strong Enough: Reasoning About Consistency Choices in Distributed Systems

    View Slide

  49. Conclusion
    Use Formal Verification on
    Critical Components
    Unit Tests & Integration Tests find a
    multitude of Errors
    Increase Confidence via Property
    Testing & Fault Injection

    View Slide

  50. Camille Fournier
    “Enjoy the ride, have fun, and
    test your freaking code”

    View Slide

  51. Thank You
    Peter Alvaro
    Kyle Kingsbury
    Christopher
    Meiklejohn
    Alex Rasmussen
    Ines Sombra
    Nathan Taylor
    Alvaro Videla

    View Slide

  52. Questions
    @caitie
    https://github.com/CaitieM20/
    Talks/tree/master/
    TheVerificationOfADistributedSystem
    Resources:

    View Slide