By Chang Lou, Peng Huang, and Scott Smith Presented by Andrey Satarin, @asatarin May, 2022 https://asatarin.github.io/talks/2022-05-understanding-partial-failures/
process P when a fault does not crash P, but causes safety or liveness violation or severe slowness for some functionality • It’s process level, not node level • Process is still alive, this is not a fail-stop failure • Could be missed by usual health checks • Can lead to catastrophic outage 4
failures appear throughout release history (Table 1). 54% of them occur in the most recent three years’ software releases. Finding 2: The root causes of studied failures are diverse. The top three (total 48%) root cause types are uncaught errors, inde fi nite blocking, and buggy error handling. 9
failures cause some functionality to be stuck. Liveness violations are straightforward to detect Finding 4: In 13% of the studied cases, a module became a “zombie” with unde fi ned failure semantics. Finding 5: 15% of the partial failures are silent (including data loss, corruption, inconsistency, and wrong results). 10
by some speci fi c environment condition, input, or faults in other processes. Hard to expose with testing => need runtime checking Finding 7: The majority (68%) of the failures are “sticky” — the process will not recover from the faults by itself. 11
surface can’t be covered with probes • Partial failures might not be observable at the API level • Signal checkers • Susceptible to environment noise • Poor accuracy 14
Can the generated watchdogs detect and localize diverse forms of real- world partial failures? • Do the watchdogs provide strong isolation? • Do the watchdogs report false alarms? • What is the runtime overhead to the main program? 24
systems • Built-in (baseline) detectors did not detect any partial failures • Detected 20 out of 22 partial failures with the median detection time of 5 seconds • Highly effective against liveness issues — deadlocks, indefinite blocking • Effective against explicit safety issues — exceptions, errors 25
of the detected cases • For 35% (7/20) of detected cases, either localize to some program point within the same function or some function along the call chain • Probe or signal detectors can only pinpoint the faulty process 26
total false failure reports divided by the total number of check executions. • The watchdogs and baseline detectors are all configured to run checks every second • Can false alarm ratio be traded for detection time? (Median detection time is 5 seconds) 27
software • OmegaGen to generate watchdogs from code • Generated watchdogs detect 20/22 partial failures and pinpoint scope in 18/20 cases • Exposed new partial failure in ZooKeeper 31