Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Partial Failures in Large Systems

Understanding Partial Failures in Large Systems

Andrey Satarin

May 25, 2022

More Decks by Andrey Satarin

Other Decks in Programming


  1. Understanding, Detecting and Localizing Partial Failures in Large System Software

    By Chang Lou, Peng Huang, and Scott Smith Presented by Andrey Satarin, @asatarin May, 2022 https://asatarin.github.io/talks/2022-05-understanding-partial-failures/
  2. Outline • Understanding Partial Failures • Catching Partial Failures with

    Watchdogs • Generating Watchdogs with OmegaGen • Evaluation • Conclusions • 2
  3. Partial Failure A partial failure — a failure in a

    process P when a fault does not crash P, but causes safety or liveness violation or severe slowness for some functionality • It’s process level, not node level • Process is still alive, this is not a fail-stop failure • Could be missed by usual health checks • Can lead to catastrophic outage 4
  4. Questions • How do partial failures manifest in modern systems?

    • How to systematically detect and localize partial failures at runtime? 7
  5. 8

  6. Findings 1-2 Finding 1: In all the five systems, partial

    failures appear throughout release history (Table 1). 54% of them occur in the most recent three years’ software releases. Finding 2: The root causes of studied failures are diverse. The top three (total 48%) root cause types are uncaught errors, inde fi nite blocking, and buggy error handling. 9
  7. Findings 3-5 Finding 3: Nearly half (48%) of the partial

    failures cause some functionality to be stuck. Liveness violations are straightforward to detect Finding 4: In 13% of the studied cases, a module became a “zombie” with unde fi ned failure semantics. Finding 5: 15% of the partial failures are silent (including data loss, corruption, inconsistency, and wrong results). 10
  8. Findings 6-7 Finding 6: 71% of the failures are triggered

    by some speci fi c environment condition, input, or faults in other processes. Hard to expose with testing => need runtime checking Finding 7: The majority (68%) of the failures are “sticky” — the process will not recover from the faults by itself. 11
  9. Current Checkers • Probe checkers • Execute external API to

    detect issues • Signal checkers • Monitor health indicator provided by the system 13
  10. Issues with Current Checkers • Probe checkers • Large API

    surface can’t be covered with probes • Partial failures might not be observable at the API level • Signal checkers • Susceptible to environment noise • Poor accuracy 14
  11. Mimic Checkers • Mimic-style checkers — selects some representative operations

    from each module of the main program, imitates them, and detects errors • Can pinpoint the faulty module and failing instructions 15
  12. Intrinsic Watchdog • Synchronizes state with the main program via

    hooks in the program • Executes concurrently with the main program • Lives in the same address space as the main program • Generated automatically 16
  13. 17

  14. Generating Watchdogs • Identify long-running methods (1) • Locate vulnerable

    operations (2) • Reduce main program (3) • Encapsulate reduced program with context factory and hooks (4) • Add checks to catch faults (5) 19
  15. 20

  16. Validate Impact of Caught Faults • Runs validation step to

    reduce false alarms • Default validation is to re-run the check • Supports manually written validation 21
  17. Preventing Side Effects • Redirect I/O for writes • Idempotent

    wrappers for reads • Re-write socket operations as ping • If I/O to a another large system => better to apply OmegaGen on that system 22
  18. Questions • Does our approach work for large software? •

    Can the generated watchdogs detect and localize diverse forms of real- world partial failures? • Do the watchdogs provide strong isolation? • Do the watchdogs report false alarms? • What is the runtime overhead to the main program? 24
  19. Detection • Collected and reproduced 22 real-world failures in six

    systems • Built-in (baseline) detectors did not detect any partial failures • Detected 20 out of 22 partial failures with the median detection time of 5 seconds • Highly effective against liveness issues — deadlocks, indefinite blocking • Effective against explicit safety issues — exceptions, errors 25
  20. Localization • Directly pinpoint the faulty instruction for 55% (11/20)

    of the detected cases • For 35% (7/20) of detected cases, either localize to some program point within the same function or some function along the call chain • Probe or signal detectors can only pinpoint the faulty process 26
  21. False Alarms • The false alarm ratio is calculated from

    total false failure reports divided by the total number of check executions. • The watchdogs and baseline detectors are all configured to run checks every second • Can false alarm ratio be traded for detection time? (Median detection time is 5 seconds) 27
  22. 28

  23. 29

  24. Conclusions • Study of 100 real-world partial failures in popular

    software • OmegaGen to generate watchdogs from code • Generated watchdogs detect 20/22 partial failures and pinpoint scope in 18/20 cases • Exposed new partial failure in ZooKeeper 31
  25. References • Self reference for this talk (slides, video, etc)

    • "Understanding, Detecting and Localizing Partial Failures in Large System Software" paper • Talk at NSDI 2020 • Post from The Morning Paper blog 34