200+ DEVELOPERS 500+ SERVERS 2 DATACENTERS Ruby on Rails ONE OF THE LARGEST RAILS DEPLOYMENTS IN THE WORLD 3000+ CONTAINERS RUNNING AT ANY TIME 10,000+ MAX CHECKOUTS PER MINUTE 12+ DEPLOYS PER DAY 300M unique visits/month LEAGUE OF APPLE, EBAY AND AMAZON
= Post.all end private def fetch_user User.find(session[:user_id]) # return nil if session store is down # in this example sessions are in Redis rescue Redis::BaseError nil end
and easily covers as many bugs as it uncovers Production testing means it’s too late and difficult to reproduce Network-level simulation in development and test environments would give full, reproducible confidence
Master Available Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded Resiliency Matrix
def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
is low Little’s law: capacity reduced Need ways to fail even faster! Setting timeouts problematic, super low (<10ms) for frequent resources doesn’t account for natural outliers
process- based servers Ensures controlled access to resources under increase in response time; easy to reason about impact Fails faster than circuit breakers when timeout is high
for evented (especially) and multithreaded servers You’ve observed these problems in production (you’re now equipped to) High timeouts to some data stores and/or services because of legitimate outliers
Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill Nodes (Chaos Monkey) Latency Monkey Application-Specific Fallbacks Region Gorilla
and implement application-specific fallbacks Not everyone needs circuit breakers and bulkheads, this may be premature for your application Be careful when introducing new dependencies
Ben Rex Furneaux from the Noun Project container by Creative Stall from the Noun Project people by Wilson Joseph from the Noun Project mesh network by Lance Weisser from the Noun Project Conductor by By Luis Prado from the Noun Project Jar by Yazmin Alanix from the Noun Project Broken Chain by Simon Martin from the Noun Project Book by Ben Rex Furneaux from the Noun Project network by Jessica Coccimiglio from the Noun Project server by Creative Stall from the Noun Project components by icons.design from the Noun Project switch button by Marco Olgio from the Noun Project Pile of leaves (autumn) by Aarthi Ramamurthy Bridge by Toreham Sharman from the Noun Project collaboration by Alex Kwa from the Noun Project converge by Creative Stall from the Noun Project change by Jorge Mateo from the Noun Project person by Brian Dys Sahagun from the Noun Project water faucet by Yaroslav Samoilov from the Noun Project cash register by Gergely Korinek from the Noun Project lungs by Joris Hoogendoorn Hour Glass by Arthur Shalin from the Noun Project Brooklyn Bridge at Night by Dennis Leung