Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Resilient Applications using Chaos En...

Building Resilient Applications using Chaos Engineering on AWS

Architectures have grown increasingly complex and hard to understand. Worse, the software systems running them have become extremely difficult to debug and test, increasing the risk of outages. With these new challenges, new tools are required and since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal failures before they become outages. In this talk, I will talk about chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more robust systems. I will demo the tools and methods used to inject failures in order to make systems more resilient to failure.

Adrian Hornsby

January 30, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building Resilient Applications using Chaos Engineering on AWS Adrian Hornsby Principal Technical Evangelist Amazon Web Services
  2. Because that is the average amount of one hour of

    downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research
  3. • A volunteer firefighter • Created GameDay in 2006 to

    purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon
  4. “Simian Army to keep our cloud safe, secure, and highly

    available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Rise of the monkeys https://github.com/Netflix/SimianArmy
  5. Chaos engineering is NOT about breaking things randomly without a

    purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.
  6. Break your systems on purpose. Find out their weaknesses and

    fix them before they break when least expected.
  7. What is steady state? • ”normal” behavior of your system

    https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  8. What if…? “What if this load balancer breaks?” “What if

    Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  9. Failure injection • Start small & build confidence • Application

    level (exceptions, errors, …) • Host level (services, processes, …) • Resource attacks (CPU, memory, IO, …) • Network attacks (dependencies, latency, packet loss…) • AZ attack • Region attack • People attack
  10. Rules of thumbs • Start very small • As close

    as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back (corrupt or incorrect data)
  11. Quantifying the result of the experiment • Time to detect?

    • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  12. Postmortems – COE (Correction of Errors) • What happened? •

    What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.)
  13. Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t

    60s https://kernel.ubuntu.com/~cking/stress-ng/
  14. Other fun things to do • Fill up disk •

    Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with /etc/hosts
  15. https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly

    • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army
  16. The Chaos Toolkit • Simplifying Adoption of Chaos Engineering •

    An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines
  17. ToxiProxy • HTTP API • Build for Automated testing in

    mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy
  18. Fault Injection Queries for Amazon Aurora SQL commands issued to

    simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html
  19. Fault Injection Queries for Amazon Aurora SQL commands issued to

    simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };
  20. SSM Run (send) Command $ aws ssm send-command --document-name "cpu-stress"

    --document-version "1" --targets '[{"Key":"InstanceIds","Values":[ " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’ --parameters '{"duration":["60"],"cpu":["0"]}’ --timeout-seconds 600 --max-concurrency "50" --max-errors "0" --output-s3-bucket-name "adhorn-chaos-ssm-output" --region eu-west-1
  21. Big challenges to chaos engineering • Chaos Engineering won’t make

    your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry. • Starting is perceived as hard!
  22. Big challenges to chaos engineering Mostly Cultural • no time

    or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  23. Thank you! © 2020, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Adrian Hornsby https://medium.com/@adhorn adhorn