Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: When The Network Breaks - Ta...

Chaos Engineering: When The Network Breaks - Tammy Bryant Butow (Gremlin) - ETE 2021

ABOUT THIS TALK
Chaos engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos engineering lets you compare what you think will happen to what actually happens in your systems. You literally break things on purpose to learn how to build more resilient systems.

In this session, Tammy leads a walk‑through of network chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, she illustrates new ways to use it to improve the resilience of your network and services. She describes how other companies are using chaos engineering and the positive results the companies have had using chaos to create reliable distributed systems.

Tammy begins by explaining chaos engineering and its principles. She then asks why many engineering teams (including Netflix, Gremlin, Dropbox, National Australia Bank, Twilio, and more) use chaos engineering and how every engineering team can use it to create reliable systems. She shows how to get started using chaos engineering with your own team as you explore the tools to measure success and the chaos tools and new chaos features built into cloud services. She explains how to use wargame environments to learn about chaos engineering and how to practice chaos engineering on Kubernetes, Redis, Kafka, and more.

Other topics include how to use monitoring tools combined with chaos engineering to help you create reliable distributed systems, where you can learn more, and how to join the chaos community.

Tammy Bryant Butow

May 05, 2021
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. Hello new and old friends @tambryantbutow You’ll find me on

    twitter: @tambryantbutow and on LinkedIn. Happy to answer questions : )) Principal SRE @ Gremlin Co-Founder @ Girl Geek Academy Previously SRE Manager @ Dropbox DigitalOcean, National Australia Bank, Queensland University of Technology… and more.
  2. Have you ever thought the network was causing an incident

    but struggled to prove it? @tambryantbutow
  3. Thoughtful planned experiments designed to help you prove networking issues

    and get them resolved quickly @tambryantbutow yes
  4. For example, say somebody is occasionally throttling your traffic, how

    do you discover and prove this? Answer: Network Chaos Engineering @tambryantbutow generate network traffic observe network traffic run latency attack review results
  5. @tambryantbutow wow You realise this looks like misconfigured QoS -

    only throttling you during specific time windows.
  6. What is QoS? Quality of service (QoS) controls and manages

    network resources by setting priorities for specific types of data on the network. If QoS is misconfigured, it can add latency as the wrong packets may be getting delayed in a buffer when other traffic is present @tambryantbutow gtk
  7. Now we can resolve the issue by fixing the misconfigured

    QoS policy and putting in place a notification system for service owners. We then re-run our Chaos Engineering Scenario to ensure the fix works. @tambryantbutow yes
  8. @tambryantbutow There are lots of other benefits you can get

    from practicing Chaos Engineering on the network. Find monitoring and observability gaps, validate dependencies, train teams for on-call & get more sleep.
  9. @tambryantbutow My favourite benefit is a reduction in MTTD (mean

    time to detection). I care about this so much, I wrote a book on it with friends! gremlin.com/oreilly-reducing-mttd- for-high-severity-incidents/
  10. Service Not Found Architecture @tammyxbryant Does blackholing a non-critical path

    service like the Recommendation Service cause unexpected failures for critical services like the Product Catalogue or Frontend?
  11. @tambryantbutow Does blackholing a non-critical path service like the Ad

    Service result in graceful degradation of the customer experience?
  12. @tambryantbutow Does blackholing a non-critical path service like the Recommendations

    Service result in graceful degradation of the customer experience?
  13. @tambryantbutow Our requests for product pages are cancelled because the

    first product page request is stalled and unable to complete successfully. This continues for the duration of the recommendation catalogue outage due to a previously unknown dependency on product assets that are unavailable.
  14. Major Incident @tambryantbutow Yes, our experiment was not successful and

    our results were not what we expected them to be. We’ll need to fix these dependency issues to ensure this doesn’t happen again.
  15. @tambryantbutow Due to the packet loss attack on the shopping

    cart, when trying to add items to the cart the user will be given a 500 Internal Server “Failed To Add To Cart”.
  16. Was it expected, was it detected, was it mitigated, can

    we fix the issues? Can we automate this? How can we best share our findings and results? @tambryantbutow
  17. Thanks new and old friends @tambryantbutow You can find me

    on twitter: @tambryantbutow and on LinkedIn. Bonus sticker gift pack from Gremlin: gremlin.com/talk/ete