Things you may see in this talk Pugs Fast talking Life pondering Un-tweetable moments Rantifestos What surprised me this year Wedding factoids and trivia
to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability
to harvest degradation but the application can continue if they fail You can only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality) ♥
of safety based on: mentoring, responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Resilience is operator community focused
system!) Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown The goal is to build failure domain independence
latencies in the stack Without fault tolerance: 30 dependencies w 99.99% uptime could result in 2+ hours of downtime per month! Leveraged client libraries
of Semaphores. Separate threads on per- dependency thread pools Circuit-breakers to relieve pressure in underlying systems Exceptions cause app to shed load until things are healthy
client library control + storage of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems
construct but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK- UNK scenarios
from v1 to v3 Used Bimodal Multicast (Gossip protocol) to provide extremely fast purging speed Design concerns & system evolution Tyler McMullen Bruce Spang
NetSys João Taveira Araújo looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk
NetSys João Taveira Araújo looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk
NetSys João Taveira Araújo looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk
stages of evolution Resilient systems like Varnish, Powderhorn, and Faild have taught us many lessons but some applications have availability problems, why? But wait a minute! ♥
replication of data, replay of messages, anti-entropy build resilience Gossip / epidemic protocols too Capacity planning matters Optimizations can make your system less resilient!
TEST! Versioning from the start - provide an upgrade path from day 1 Upgrades & evolvability of systems is still tricky. Mixed-mode operations need to be common Re-examine the way we prototype systems
or yield? Orthogonality & decomposition FTW Do we have enough redundancies in place? Are we resilient to our dependencies? Am I providing enough control to my operators? Would I want to be on call for this? Rank your services: what can be dropped, killed, deferred? Monitoring and alerting in place? The existence of this stresses diligence on the other two areas Have we done everything we can? Abandon hope and resort to human sacrifices ♥ ♥ Theory matters!
!= tests. Have both Distrust client behavior, even if they are internal Version (APIs, protocols, disk formats) from the start. Support mixed-mode operations. Checksum all the things Error handling, circuit breakers, backpressure, leases, timeouts Automation shortcuts taken while in a rush will come back to haunt you Release stability is o"en tied to system stability. Iron out your deploy process Link alerts to playbooks Consolidate system configuration (data bags, config file, etc) tl;dr ♥ ♥ Operators determine resilience
Jordan West, Caitie McCaffrey, Camille Fournier, Mike O'Neill, Neha Narula, Joao Taveira, Tyler McMullen, Zac Duncan, Nathan Taylor, Ian Fung, Armon Dadgard, Peter Alvaro, Peter Bailis, Bruce Spang, Matt Whiteley, Alex Rasmussen, Aysulu Greenberg, Elaine Greenberg, and Greg Bako.