responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Insights from Cook’s model
Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown Strategies to build resilience
Unknown-Unknown Strategies to build resilience Code standards Programming patterns Full system testing Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries
of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems Key insights from Chubby %
but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK-UNK scenarios
Are we resilient to our dependencies? Use orthogonality & decomposition Theory matters! Am I providing enough control to my operators? Operators impact resilience Narrowing your API helps The existence of this stresses diligence on the other two areas tl;dr The goal is to build failure domain independence