Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critica...
Search
Caitie McCaffrey
June 14, 2016
Technology
8
450
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
331
21k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
310
Argus Papers We Love
caitiem20
13
1.2k
The Verification of a Distributed System
caitiem20
22
2.2k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
800
The Verification of a Distributed System
caitiem20
12
760
The Verification of a Distributed System
caitiem20
6
760
A Brief History of Distributed Programming: RPC
caitiem20
31
6.6k
Building Scalable Stateful Services
caitiem20
12
1.6k
Other Decks in Technology
See All in Technology
ドメインイベントを活用したPHPコードのリファクタリング
kajitack
2
1.1k
移行できそうでやりきれなかった 10年超えのシステムを葬るための戦略
ryu955
2
180
Oracle Cloud Infrastructure:2025年3月度サービス・アップデート
oracle4engineer
PRO
0
320
17年のQA経験が導いたスクラムマスターへの道 / 17 Years in QA to Scrum Master
toma_sm
0
290
モンテカルロ木探索のパフォーマンスを予測する Kaggleコンペ解説 〜生成AIによる未知のゲーム生成〜
rist
4
990
ウェブアクセシビリティとは
lycorptech_jp
PRO
0
170
モノリスの認知負荷に立ち向かう、コードの所有者という思想と現実
kzkmaeda
0
100
Javaの新しめの機能を知ったかぶれるようになる話 #kanjava
irof
3
4.8k
技術的負債を正しく理解し、正しく付き合う #phperkaigi / PHPerKaigi 2025
shogogg
7
1.7k
LINE API Deep Dive Q1 2025: Unlocking New Possibilities
linedevth
1
150
DevinはクラウドエンジニアAIになれるのか!? 実践的なガードレール設計/devin-can-become-a-cloud-engineer-ai-practical-guardrail-design
tomoki10
3
1.2k
SaaSプロダクト開発におけるバグの早期検出のためのAcceptance testの取り組み
kworkdev
PRO
0
240
Featured
See All Featured
How to Think Like a Performance Engineer
csswizardry
22
1.4k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
102
18k
How to train your dragon (web standard)
notwaldorf
91
5.9k
Designing for Performance
lara
605
69k
Mobile First: as difficult as doing things right
swwweet
223
9.5k
Building Better People: How to give real-time feedback that sticks.
wjessup
367
19k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
176
52k
Building Your Own Lightsaber
phodgson
104
6.3k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
22
2.6k
Done Done
chrislema
183
16k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
27
1.6k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
30
1.1k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie