Understanding Partial Failures in Large Systems

Understanding, Detecting and Localizing Partial Failures in Large System Software
By Chang Lou, Peng Huang, and Scott Smith Presented by Andrey Satarin, @asatarin May, 2022 https://asatarin.github.io/talks/2022-05-understanding-partial-failures/

Outline • Understanding Partial Failures • Catching Partial Failures with
Watchdogs • Generating Watchdogs with OmegaGen • Evaluation • Conclusions • 2

Understanding Partial Failures 3

Partial Failure A partial failure — a failure in a
process P when a fault does not crash P, but causes safety or liveness violation or severe slowness for some functionality • It’s process level, not node level • Process is still alive, this is not a fail-stop failure • Could be missed by usual health checks • Can lead to catastrophic outage 4

Failure Hierarchy 5 Fail-stop Omission failure Fail-recover Byzantine failure

Failure Hierarchy 6 Fail-stop Omission failure Fail-recover Byzantine failure Partial
failure

Questions • How do partial failures manifest in modern systems?
• How to systematically detect and localize partial failures at runtime? 7

Findings 1-2 Finding 1: In all the five systems, partial
failures appear throughout release history (Table 1). 54% of them occur in the most recent three years’ software releases. Finding 2: The root causes of studied failures are diverse. The top three (total 48%) root cause types are uncaught errors, inde fi nite blocking, and buggy error handling. 9

Findings 3-5 Finding 3: Nearly half (48%) of the partial
failures cause some functionality to be stuck. Liveness violations are straightforward to detect Finding 4: In 13% of the studied cases, a module became a “zombie” with unde fi ned failure semantics. Finding 5: 15% of the partial failures are silent (including data loss, corruption, inconsistency, and wrong results). 10

Findings 6-7 Finding 6: 71% of the failures are triggered
by some speci fi c environment condition, input, or faults in other processes. Hard to expose with testing => need runtime checking Finding 7: The majority (68%) of the failures are “sticky” — the process will not recover from the faults by itself. 11

Catching Partial Failures with Watchdogs 12

Current Checkers • Probe checkers • Execute external API to
detect issues • Signal checkers • Monitor health indicator provided by the system 13

Issues with Current Checkers • Probe checkers • Large API
surface can’t be covered with probes • Partial failures might not be observable at the API level • Signal checkers • Susceptible to environment noise • Poor accuracy 14

Mimic Checkers • Mimic-style checkers — selects some representative operations
from each module of the main program, imitates them, and detects errors • Can pinpoint the faulty module and failing instructions 15

Intrinsic Watchdog • Synchronizes state with the main program via
hooks in the program • Executes concurrently with the main program • Lives in the same address space as the main program • Generated automatically 16

Generating Watchdogs with OmegaGen 18

Generating Watchdogs • Identify long-running methods (1) • Locate vulnerable
operations (2) • Reduce main program (3) • Encapsulate reduced program with context factory and hooks (4) • Add checks to catch faults (5) 19

Validate Impact of Caught Faults • Runs validation step to
reduce false alarms • Default validation is to re-run the check • Supports manually written validation 21

Preventing Side Effects • Redirect I/O for writes • Idempotent
wrappers for reads • Re-write socket operations as ping • If I/O to a another large system => better to apply OmegaGen on that system 22

Evaluation 23

Questions • Does our approach work for large software? •
Can the generated watchdogs detect and localize diverse forms of real- world partial failures? • Do the watchdogs provide strong isolation? • Do the watchdogs report false alarms? • What is the runtime overhead to the main program? 24

Detection • Collected and reproduced 22 real-world failures in six
systems • Built-in (baseline) detectors did not detect any partial failures • Detected 20 out of 22 partial failures with the median detection time of 5 seconds • Highly effective against liveness issues — deadlocks, indefinite blocking • Effective against explicit safety issues — exceptions, errors 25

Localization • Directly pinpoint the faulty instruction for 55% (11/20)
of the detected cases • For 35% (7/20) of detected cases, either localize to some program point within the same function or some function along the call chain • Probe or signal detectors can only pinpoint the faulty process 26

False Alarms • The false alarm ratio is calculated from
total false failure reports divided by the total number of check executions. • The watchdogs and baseline detectors are all configured to run checks every second • Can false alarm ratio be traded for detection time? (Median detection time is 5 seconds) 27

Conclusions 30

Conclusions • Study of 100 real-world partial failures in popular
software • OmegaGen to generate watchdogs from code • Generated watchdogs detect 20/22 partial failures and pinpoint scope in 18/20 cases • Exposed new partial failure in ZooKeeper 31

The End 32

Contacts • Follow me on Twitter @asatarin • https://www.linkedin.com/in/asatarin/ •
https://asatarin.github.io/ 33

References • Self reference for this talk (slides, video, etc)
• "Understanding, Detecting and Localizing Partial Failures in Large System Software" paper • Talk at NSDI 2020 • Post from The Morning Paper blog 34

Understanding Partial Failures in Large Systems

Understanding Partial Failures in Large Systems

More Decks by Andrey Satarin

Other Decks in Programming

Featured

Transcript