Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: Why Breaking Things Should B...

Chaos Engineering: Why Breaking Things Should Be Practiced

As presented at the AWS Summit in Dubai - with Qais Ammouri, Head of Technology at Almosafer.

With the wide adoption of micro-services and large-scale distributed systems, architectures have grown increasingly complex and hard to understand. Worse, the software systems running them have become extremely difficult to debug and test, increasing the risk of outages. With these new challenges, new tools are required and since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal failures before they become outages. In this talk, we will make an introduction to chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more robust systems.

Adrian Hornsby

April 17, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Chaos Engineering: Why breaking things should be practiced Adrian Hornsby Sr. Technical Evangelist Amazon Web Services Qais Ammouri Head of Technology Almosafer @adhorn
  2. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Been there?
  3. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Distributed Systems are hard Amazon Twitter Netflix
  4. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  5. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  6. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Partial failure mode
  7. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T How do we build resilient software systems?
  8. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T People Application Network & Data Infrastructure
  9. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  10. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  11. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Chaos engineering https://github.com/Netflix/SimianArmy
  12. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attack • Human attack https://www.gremlin.com https://github.com/Netflix/SimianArmy https://chaostoolkit.org
  13. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  14. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.
  15. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved. Chaos Engineering
  16. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Steady State Hypothesis Design & Run Experiment Fix Build Resilient Systems Verify & Learn
  17. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Build Resilient Systems
  18. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  19. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Our sales were less than 1 million SAR 2012 It all started from a handful of people between Riyadh and Egypt. In 2012, Almosafer started between Egypt and Riyadh with focus on hotels through social media and call center.
  20. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T We grew to 70 employees & our sales reached to 74 million SAR 2015 Al Tayyar Travel Group (now Seera Group) acquired 60% of Almosafer…
  21. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T We grew to 1000+ employees & Our sales exceeded 1.3 billion SAR 2018 Crossing the billion line. Becoming the largest OTA in Saudi, fully acquired by Seera Group In 2018, Almosafer became largest OTA in Saudi Arabia in the flight market.
  22. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. AWS Largest KSA Client and First in EKS in the MENA
  23. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Before Chaos Engineering
  24. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  25. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  26. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Monitoring (Eagle Eye) Tech Capabilities Culture
  27. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Start with people • Try to avoid the word “Chaos” when talking to your business .
  28. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Start with people • Try to avoid the word “Chaos” when talking your business . • Embrace failure, and fix it.
  29. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Start with people • Try to avoid the word “Chaos” with your business . • Embrace failure, and fix it. • Replace: “If it fails” with “when it fails”.
  30. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Start with people • Try to avoid the word “Chaos” when talking your business . • Embrace failure, and fix it. • Replace: “If it fails” with “when it fails”. • Everything fails, at least once!
  31. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Start with people • Try to avoid the word “Chaos” when talking your business . • Embrace failure, and fix it. • Replace: “If it fails” with “when it fails”. • Everything fails, at least once! • Do fire drills, at least once a month.
  32. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Resiliency in Almosafer • Monitor everything, or die trying.
  33. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Resiliency in Almosafer • Monitor everything, or die trying . • Architect with failure in mind, it is not an edge case.
  34. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Resiliency in Almosafer • Monitor everything, or die trying . • Architect with failure in mind, it is not an edge case. • Resiliency starts in the frontend, avoid blocking UI.
  35. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Resiliency in Almosafer • Monitor everything, or die trying . • Architect with failure in mind, it is not an edge case. • Resiliency starts in the frontend, avoid blocking UI. • Automation testing is not a “nice to have” it is a “Must have”.
  36. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Resiliency in Almosafer • Monitor everything, or die trying . • Architect with failure in mind, it is not an edge case. • Resiliency starts in the frontend, avoid blocking UI. • Automation testing is not a luxury product. • Use circuit breaking - timeouts, retries and fallbacks.
  37. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates.

    All rights reserved. Redundancy is fundamental. • Don’t put your eggs in the same basket be multiregional and multi AZs .
  38. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  39. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Steady State
  40. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  41. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  42. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  43. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Hypothesis
  44. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  45. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Disclaimer! Don’t make an hypothesis that you know will break you!
  46. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Design & Run Experiment
  47. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  48. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  49. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..
  50. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Verify & Learn
  51. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  52. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of … NOT ENOUGH
  53. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?
  54. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Rules to remember! 1. Failure requires multiple faults 2. There is no isolated ‘cause’ of an accident. 3. There are multiple contributors to accidents.
  55. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T DON’T blame that one person …
  56. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  57. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T Changing culture takes time! Be patient…
  58. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S U M M I T More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy • https://medium.com/@adhorn
  59. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved.
  60. Thank you! S U M M I T © 2019,

    Amazon Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby @adhorn https://medium.com/@adhorn