Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: getting out of the starting ...

Chaos Engineering: getting out of the starting blocks

Architectures are growing increasingly distributed and hard to understand. As a result, software systems have become extremely difficult to debug and test, which increases the risk of failure. With these new challenges, chaos engineering ha become attractive to many organizations as a mechanism for underling the behavior of systems under expected circumstances.

Whilst interest is growing, few have managed to build sustainable chaos engineering practices. In this talk, I will review the state of chaos engineering, the issues customers are facing, based on my learning as an AWS Solution Architect and Technologist focusing on Chaos Engineering and explain why I started to build tools to help with failure injection.

Adrian Hornsby

January 23, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos Engineering: Getting out of the starting blocks Adrian Hornsby Principal Technical Evangelist Amazon Web Services
  2. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What currently prevents the wide adoption of chaos engineering in your organization?
  3. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Why is production chaos?
  4. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #0 - DON’T CALL IT CHAOS ENGINEERING.
  5. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #0 - DON’T CALL IT CHAOS ENGINEERING.
  6. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #1 - DON’T FOCUS ON CHAOS ENGINEERING, LOOK AT THE BIGGER PICTURE.
  7. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Good intentions never work [...]
  8. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Because people already had good intentions
  9. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. If good intentions don’t work, what does?
  10. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Toyota will not allow any defect that they know about to go down the manufacturing line.
  11. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Source: http://www.autoexpress.co.uk/toyota/prius/34615/japanese-earthquake-hits-car-production
  12. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Andon Customer Service
  13. • Erroneously listed recharge cable as included • Andon cord

    pulled and page corrected • Contacts per unit go from 33% to 3.7%
  14. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. "Good intentions never work, you need good mechanisms to make anything happen." Jeff Bezos
  15. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. People have good intention to start with!
  16. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Good Mechanisms ≈ Complete Processes Tools Adoption Audit
  17. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #2 - CHANGE BEGINS WITH UNDERSTANDING.
  18. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What are the top 5 “painful” reasons for your fires?
  19. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 1. It is always DNS 2. Configuration drift 3. SSL Certificate expiration 4. Deployment failure 5. Failed link to 3rd party provider
  20. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Anatomy of a COE • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • What lessons did you learn? • What corrective actions are you taking?
  21. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Audit Weekly Operational Metrics Review • Continuous inspection mechanism • Maintains focus on operations • Foundation of a healthy operations program Typical Agenda - typically divided into fifteen-minute slots • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices
  22. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Policy Engine • Automated risk and opportunity analyzer • Identifies potential risks to availability, infrastructure, security and more • Highlights opportunities to optimize resource utilization • Extensible and configurable • Provides a view into policy compliance • Allows acknowledgment • Reports roll-up the organization hierarchy Mechanism to propagate local learnings globally
  23. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #3 - CHOOSE YOUR TROJAN HORSE.
  24. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Find the right team to start with: Not the best (improvements are harder) Not the worse (they have bigger problems)
  25. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Choose the metrics to measure improvement: MTTR is __always__ a good default.
  26. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #4 - OVER-INDEX ON THE HYPOTHESIS.
  27. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE
  28. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #5 - INTRODUCE CHAOS ENGINEERING EARLY IN THE JOURNEY.
  29. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Start simple and local!! $ docker stop 94a214bbeebd
  30. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health
  31. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t 60s https://kernel.ubuntu.com/~cking/stress-ng/
  32. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Adding latency to the network $ tc qdisc add dev eth0 root netem delay 300ms
  33. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Blocks DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP
  34. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. #6 - BLAST-RADIUS REDUCTION MINDSET.
  35. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Verification: 1. Disaster Recovery & backups 2. Auto scaling 3. Multi-AZ 3. Fault tolerance & self healing 4. People
  36. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Getting out of the starting blocks.
  37. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Tools Processes Culture Technology
  38. Thank you! © 2019, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Adrian Hornsby https://medium.com/@adhorn adhorn