“Formal Methods Have Been a Big Success” S3 & 10+ Core Pieces of Infrastructure Verified 2 Serious Bugs Found Increased Confidence to make Optimizations Use of Formal Methods at Amazon Web Services TLA+
Leslie Lamport, Specifying Systems “Its a good idea to understand a system before building it, so its a good idea to write a specification of a system before implementing it” TLA+
Error Handling Code is simply empty or only contains a Log statement Error Handler aborts cluster on an overly general exception Error Handler contains comments like FIXME or TODO 35% of Catastrophic Failures Simple Testing can Prevent Most Critical Failures
-The Verification of a Distributed System “Without explicitly forcing a system to fail, it is unreasonable to have any confidence it will operate correctly in failure modes”
Kyle has used this tool to show us that many of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test
Kyle has used this tool to show us that many of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test
How to Run a GameDay 1. Notify Engineering Teams that Failure is Coming 2. Induce Failures 3. Monitor Systems Under Test 4. Observing Only Team Monitors Recovery Processes & Systems, Files Bugs 5. Prioritize Bugs & Get Buy-In Across Teams Resilience Engineering: Learning to Embrace Failure
Game Day at Stripe “During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster” Game Day Exercises at Stripe: Learning from `kill -9`
Probability of failure Rank Catastrophic Failures Classical engineering Reactive ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
Probability of failure Rank Unit & Integration Tests Classical engineering Reactive ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
Probability of failure Rank Unit & Integration Tests Property Based Testing Classical engineering Reactive ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
Probability of failure Rank Unit & Integration Tests Property Based Testing Fault Injection Testing Classical engineering Reactive ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
Probability of failure Rank Unit & Integration Tests Property Based Testing Fault Injection Testing Canaries Game Days Monitoring Classical engineering Reactive ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
Research Improving the Verification of Distributed Systems Lineage Driven Fault Injection ‘Cause I’m Strong Enough: Reasoning about Consistency Choices in Distributed Systems IronFleet: Proving Practical Distributed Systems Correct
Conclusion Use Formal Verification on Critical Components Unit Tests & Integration Tests find a multitude of Errors Increase Confidence via Property Testing & Fault Injection