Chaos Engineering: When the network breaks @ NGINX Conf 2019

Chaos Engineering: When The Network Breaks Tammy Butow Principal SRE,
Gremlin @tammybutow

Every system is becoming a distributed system. THE PROBLEM

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness
in our systems.

Inject something harmful to build an immunity.

We test proactively, instead of waiting for an outage.

Deﬁne the Blast Radius

What is value of Chaos Engineering?

Improved Incident Management

Fire drills prepare us to respond quickly, calmly, and safely.

Measuring the Cost of Downtime Cost = R + E
+ C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantiﬁable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min

Network Chaos

Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss
03 Blackhole

Hipster Shop Architecture

35.238.163.103 Hipster Shop Demo

Latency Injection Demo

Hipster Shop Datadog Latency Attack 1 Container Experiment #4 payments
200 ms HTTP 400/500 errors

Latency Attack on Payment Container on AWS EKS

Hipster Shop Datadog Latency 1 instance Experiment #5 1 instance
200 ms HTTP 400/500 errors

Latency Attack 1 instance on AWS EKS

Packet Loss Demo

Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss
60 seconds 70% Experiment #2 `kubernetes-dashboard` Slower responses, but ultimately success

Packet Loss Attack 1 container on AWS EKS

Blackhole Attack Demo

Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 payments
120 Seconds HTTP 400/500 errors

Blackhole Attack Payment Container on AWS EKS

Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 catalogue
60 Seconds HTTP 400/500 errors

Blackhole Attack Catalogue Container on AWS EKS

How to communicate results of your Chaos Engineering experiments?

Was it expected? Chaos Engineering uncovers unknown side effects. Was
it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.

Fix the issues. Whether code, configuration or process - iterate
and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.

Where can you get started?

Join us @ Chaos Conf chaosconf.io twitter.com/chaosconf San Francisco, September
26, 2019 Special code: “insider” for $49 tickets @tammybutow @gremlininc

35 Thank You Tammy Butow Principal SRE, Gremlin [email protected] @tammybutow

Chaos Engineering: When the network breaks @ NG...

Chaos Engineering: When the network breaks @ NGINX Conf 2019

Tammy Bryant Butow

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript

Chaos Engineering: When The Network Breaks Tammy Butow Principal SRE,

Every system is becoming a distributed system. THE PROBLEM

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness

Inject something harmful to build an immunity.

We test proactively, instead of waiting for an outage.

Deﬁne the Blast Radius

What is value of Chaos Engineering?

Improved Incident Management

Fire drills prepare us to respond quickly, calmly, and safely.

Measuring the Cost of Downtime Cost = R + E

Network Chaos

Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss

Hipster Shop Architecture

35.238.163.103 Hipster Shop Demo

Latency Injection Demo

Hipster Shop Datadog Latency Attack 1 Container Experiment #4 payments

Latency Attack on Payment Container on AWS EKS

Hipster Shop Datadog Latency 1 instance Experiment #5 1 instance

Latency Attack 1 instance on AWS EKS

Packet Loss Demo

Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss

Packet Loss Attack 1 container on AWS EKS

Blackhole Attack Demo

Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 payments

Blackhole Attack Payment Container on AWS EKS

Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 catalogue

Blackhole Attack Catalogue Container on AWS EKS

How to communicate results of your Chaos Engineering experiments?

Was it expected? Chaos Engineering uncovers unknown side effects. Was

Fix the issues. Whether code, configuration or process - iterate

Where can you get started?

Join us @ Chaos Conf chaosconf.io twitter.com/chaosconf San Francisco, September

35 Thank You Tammy Butow Principal SRE, Gremlin [email protected] @tammybutow