$30 off During Our Annual Pro Sale. View Details »

Understanding Partial Failures in Large Systems

Understanding Partial Failures in Large Systems

Andrey Satarin

May 25, 2022
Tweet

More Decks by Andrey Satarin

Other Decks in Programming

Transcript

  1. Understanding, Detecting and Localizing
    Partial Failures in Large System Software
    By Chang Lou, Peng Huang, and Scott Smith

    Presented by Andrey Satarin, @asatarin

    May, 2022

    https://asatarin.github.io/talks/2022-05-understanding-partial-failures/

    View Slide

  2. Outline
    • Understanding Partial Failures


    • Catching Partial Failures with Watchdogs


    • Generating Watchdogs with OmegaGen


    • Evaluation


    • Conclusions



    2

    View Slide

  3. Understanding Partial Failures
    3

    View Slide

  4. Partial Failure
    A partial failure — a failure in a process P when a fault does not crash P, but
    causes safety or liveness violation or severe slowness for some functionality


    • It’s process level, not node level


    • Process is still alive, this is not a fail-stop failure


    • Could be missed by usual health checks


    • Can lead to catastrophic outage
    4

    View Slide

  5. Failure Hierarchy
    5
    Fail-stop
    Omission failure
    Fail-recover
    Byzantine failure

    View Slide

  6. Failure Hierarchy
    6
    Fail-stop
    Omission failure
    Fail-recover
    Byzantine failure
    Partial failure

    View Slide

  7. Questions
    • How do partial failures manifest in modern systems?


    • How to systematically detect and localize partial failures at runtime?
    7

    View Slide

  8. 8

    View Slide

  9. Findings 1-2
    Finding 1: In all the five systems, partial failures appear throughout release
    history (Table 1). 54% of them occur in the most recent three years’ software
    releases.


    Finding 2: The root causes of studied failures are diverse. The top three
    (total 48%) root cause types are uncaught errors, inde
    fi
    nite blocking, and
    buggy error handling.
    9

    View Slide

  10. Findings 3-5
    Finding 3: Nearly half (48%) of the partial failures cause some functionality to
    be stuck.


    Liveness violations are straightforward to detect


    Finding 4: In 13% of the studied cases, a module became a “zombie” with
    unde
    fi
    ned failure semantics.


    Finding 5: 15% of the partial failures are silent (including data loss,
    corruption, inconsistency, and wrong results).
    10

    View Slide

  11. Findings 6-7
    Finding 6: 71% of the failures are triggered by some speci
    fi
    c environment
    condition, input, or faults in other processes.


    Hard to expose with testing => need runtime checking


    Finding 7: The majority (68%) of the failures are “sticky” — the process will
    not recover from the faults by itself.


    11

    View Slide

  12. Catching Partial Failures with
    Watchdogs
    12

    View Slide

  13. Current Checkers
    • Probe checkers


    • Execute external API to detect issues


    • Signal checkers


    • Monitor health indicator provided by the system
    13

    View Slide

  14. Issues with Current Checkers
    • Probe checkers


    • Large API surface can’t be covered with probes


    • Partial failures might not be observable at the API level


    • Signal checkers


    • Susceptible to environment noise


    • Poor accuracy
    14

    View Slide

  15. Mimic Checkers
    • Mimic-style checkers — selects some representative operations from each
    module of the main program, imitates them, and detects errors


    • Can pinpoint the faulty module and failing instructions
    15

    View Slide

  16. Intrinsic Watchdog
    • Synchronizes state with the main program via hooks in the program


    • Executes concurrently with the main program


    • Lives in the same address space as the main program


    • Generated automatically
    16

    View Slide

  17. 17

    View Slide

  18. Generating Watchdogs with OmegaGen
    18

    View Slide

  19. Generating Watchdogs
    • Identify long-running methods (1)


    • Locate vulnerable operations (2)


    • Reduce main program (3)


    • Encapsulate reduced program with context factory and hooks (4)


    • Add checks to catch faults (5)
    19

    View Slide

  20. 20

    View Slide

  21. Validate Impact of Caught Faults
    • Runs validation step to reduce false alarms


    • Default validation is to re-run the check


    • Supports manually written validation
    21

    View Slide

  22. Preventing Side Effects
    • Redirect I/O for writes


    • Idempotent wrappers for reads


    • Re-write socket operations as ping


    • If I/O to a another large system => better to apply OmegaGen on that
    system
    22

    View Slide

  23. Evaluation
    23

    View Slide

  24. Questions
    • Does our approach work for large software?


    • Can the generated watchdogs detect and localize diverse forms of real-
    world partial failures?


    • Do the watchdogs provide strong isolation?


    • Do the watchdogs report false alarms?


    • What is the runtime overhead to the main program?
    24

    View Slide

  25. Detection
    • Collected and reproduced 22 real-world failures in six systems


    • Built-in (baseline) detectors did not detect any partial failures


    • Detected 20 out of 22 partial failures with the median detection time
    of 5 seconds


    • Highly effective against liveness issues — deadlocks, indefinite blocking


    • Effective against explicit safety issues — exceptions, errors
    25

    View Slide

  26. Localization
    • Directly pinpoint the faulty instruction for 55% (11/20) of the detected
    cases


    • For 35% (7/20) of detected cases, either localize to some program point
    within the same function or some function along the call chain


    • Probe or signal detectors can only pinpoint the faulty process
    26

    View Slide

  27. False Alarms
    • The false alarm ratio is calculated from total false failure reports divided by
    the total number of check executions.


    • The watchdogs and baseline detectors are all configured to run checks
    every second


    • Can false alarm ratio be traded for detection time? (Median detection time
    is 5 seconds)
    27

    View Slide

  28. 28

    View Slide

  29. 29

    View Slide

  30. Conclusions
    30

    View Slide

  31. Conclusions
    • Study of 100 real-world partial failures in popular software


    • OmegaGen to generate watchdogs from code


    • Generated watchdogs detect 20/22 partial failures and pinpoint scope
    in 18/20 cases


    • Exposed new partial failure in ZooKeeper
    31

    View Slide

  32. The End
    32

    View Slide

  33. Contacts
    • Follow me on Twitter @asatarin


    • https://www.linkedin.com/in/asatarin/


    • https://asatarin.github.io/
    33

    View Slide

  34. References
    • Self reference for this talk (slides, video, etc)


    • "Understanding, Detecting and Localizing Partial Failures in Large System
    Software" paper


    • Talk at NSDI 2020


    • Post from The Morning Paper blog
    34

    View Slide