Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PWL NY: Simple Testing Can Prevent Most Critical Failures

PWL NY: Simple Testing Can Prevent Most Critical Failures

Caitie McCaffrey

June 14, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. Simple Testing Can Prevent
    Most Critical Failures:
    An Analysis of Production Failures in
    Distributed Data-Intensive Systems
    Papers We Love New York - June 2016

    View full-size slide

  2. Caitie McCaffrey
    @caitie
    Distributed Systems Engineer
    CaitieM.com

    View full-size slide

  3. Analyzed Failures in Real
    World Systems

    View full-size slide

  4. “A majority (77%) of
    failures require more
    than one input event to
    manifest, but most of
    the failures (90%)
    require no more than 3”
    Complexity of Failures

    View full-size slide

  5. “The specific order of events is
    important in 88% of the failures that
    require multiple events
    Complexity of Failures

    View full-size slide

  6. “3 Nodes or less can
    reproduce 98% of Failures”
    Complexity of Failures

    View full-size slide

  7. Unit Tests
    “A majority of production failures
    (77%) can be reproduced by a unit
    test”

    View full-size slide

  8. Top Down Fault Injection
    & State Space
    Exploration is Expensive

    View full-size slide

  9. Logging
    • 76% of the failures print explicit failure-
    related error messages
    • For 84% of the failures, all of the triggering
    events are logged
    • Logs are noisy: each failure prints 824 log
    messages (median)

    View full-size slide

  10. Catastrophic Failures

    View full-size slide

  11. Error Handling
    • 92% of failures were the result of incorrect
    handling of non-fatal errors
    • 58% of faults could have been detected via
    simple testing
    • 35% of failures caused by bad practices in
    error handling code

    View full-size slide

  12. • Error Handling Code is simply empty or only
    contains a Log statement
    • Error Handler aborts cluster on an overly
    general exception
    • Error Handler contains comments like FIXME
    or TODO
    Bad Practices

    View full-size slide

  13. Aspirator
    Performs static analysis of Java bytecode to
    detect:
    • error handler is empty
    • error handler over-catches exceptions
    and aborts
    • error handler contains phrases like
    “TODO” or “FIXME”

    View full-size slide

  14. • 500 New Bugs & Bad Practices
    • 115 Fasle Positives
    • 171 bugs reported
    • 143 bugs confirmed or fixed
    Aspirator Results

    View full-size slide

  15. -developer
    “I fail to see the reason to handle every
    exception”
    Developer Reactions

    View full-size slide

  16. “It is often much harder to reason about the
    correctness of a system’s abnormal path than
    its normal execution path ”

    View full-size slide

  17. Moving Forward
    • Use a tool like Aspirator that is capable of
    identifying trivial bugs
    • Enforce code reviews of error handling code
    • High code coverage on error handling code

    View full-size slide

  18. Questions
    @caitie

    View full-size slide