Understanding, Detecting and Localizing
Partial Failures in Large System Software
By Chang Lou, Peng Huang, and Scott Smith
Presented by Andrey Satarin, @asatarin
• Understanding Partial Failures
• Catching Partial Failures with Watchdogs
• Generating Watchdogs with OmegaGen
Understanding Partial Failures
A partial failure — a failure in a process P when a fault does not crash P, but
causes safety or liveness violation or severe slowness for some functionality
• It’s process level, not node level
• Process is still alive, this is not a fail-stop failure
• Could be missed by usual health checks
• Can lead to catastrophic outage
• How do partial failures manifest in modern systems?
• How to systematically detect and localize partial failures at runtime?
Finding 1: In all the five systems, partial failures appear throughout release
history (Table 1). 54% of them occur in the most recent three years’ software
Finding 2: The root causes of studied failures are diverse. The top three
(total 48%) root cause types are uncaught errors, inde
nite blocking, and
buggy error handling.
Finding 3: Nearly half (48%) of the partial failures cause some functionality to
Liveness violations are straightforward to detect
Finding 4: In 13% of the studied cases, a module became a “zombie” with
ned failure semantics.
Finding 5: 15% of the partial failures are silent (including data loss,
corruption, inconsistency, and wrong results).
Finding 6: 71% of the failures are triggered by some speci
condition, input, or faults in other processes.
Hard to expose with testing => need runtime checking
Finding 7: The majority (68%) of the failures are “sticky” — the process will
not recover from the faults by itself.
Catching Partial Failures with
• Probe checkers
• Execute external API to detect issues
• Signal checkers
• Monitor health indicator provided by the system
Issues with Current Checkers
• Probe checkers
• Large API surface can’t be covered with probes
• Partial failures might not be observable at the API level
• Signal checkers
• Susceptible to environment noise
• Poor accuracy
• Mimic-style checkers — selects some representative operations from each
module of the main program, imitates them, and detects errors
• Can pinpoint the faulty module and failing instructions
• Synchronizes state with the main program via hooks in the program
• Executes concurrently with the main program
• Lives in the same address space as the main program
• Generated automatically
Generating Watchdogs with OmegaGen
• Identify long-running methods (1)
• Locate vulnerable operations (2)
• Reduce main program (3)
• Encapsulate reduced program with context factory and hooks (4)
• Add checks to catch faults (5)
Validate Impact of Caught Faults
• Runs validation step to reduce false alarms
• Default validation is to re-run the check
• Supports manually written validation
Preventing Side Effects
• Redirect I/O for writes
• Idempotent wrappers for reads
• Re-write socket operations as ping
• If I/O to a another large system => better to apply OmegaGen on that
• Does our approach work for large software?
• Can the generated watchdogs detect and localize diverse forms of real-
world partial failures?
• Do the watchdogs provide strong isolation?
• Do the watchdogs report false alarms?
• What is the runtime overhead to the main program?
• Collected and reproduced 22 real-world failures in six systems
• Built-in (baseline) detectors did not detect any partial failures
• Detected 20 out of 22 partial failures with the median detection time
of 5 seconds
• Highly effective against liveness issues — deadlocks, indefinite blocking
• Effective against explicit safety issues — exceptions, errors
• Directly pinpoint the faulty instruction for 55% (11/20) of the detected
• For 35% (7/20) of detected cases, either localize to some program point
within the same function or some function along the call chain
• Probe or signal detectors can only pinpoint the faulty process
• The false alarm ratio is calculated from total false failure reports divided by
the total number of check executions.
• The watchdogs and baseline detectors are all configured to run checks
• Can false alarm ratio be traded for detection time? (Median detection time
is 5 seconds)
• Study of 100 real-world partial failures in popular software
• OmegaGen to generate watchdogs from code
• Generated watchdogs detect 20/22 partial failures and pinpoint scope
in 18/20 cases
• Exposed new partial failure in ZooKeeper
• Follow me on Twitter @asatarin
• Self reference for this talk (slides, video, etc)
• "Understanding, Detecting and Localizing Partial Failures in Large System
• Talk at NSDI 2020
• Post from The Morning Paper blog