Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improve resiliency and performance with control...

Improve resiliency and performance with controlled chaos engineering - Stockholm Chaos & Reliability Engineering June 8 2021

Presented at Stockholm Chaos & Reliability Engineering, June 8th, 2021.

@gunnargrosch
LinkedIn

AWS Fault Injection Simulator is a fully managed chaos engineering service that helps you improve application resiliency by making it easy and safe to perform controlled chaos engineering experiments on AWS. In this session, we'll see how chaos engineering with AWS FIS can help improve our application's resiliency and performance, uncover hidden issues, expose blind spots, and more. We will also look at how to automate experiments to run them continuously.

AWS Well-Architected Framework
https://aws.amazon.com/architecture/well-architected/

AWS Fault Injection Simulator
https://aws.amazon.com/fis/

AWS FIS Documentation
https://docs.aws.amazon.com/fis/

AWS FIS Samples
https://github.com/aws-samples/aws-fault-injection-simulator-samples

Gunnar Grosch

June 08, 2021
Tweet

More Decks by Gunnar Grosch

Other Decks in Technology

Transcript

  1. © 2021, Amazon Web Services, Inc. or its Affiliates. Gunnar

    Grosch @gunnargrosch Improve resiliency and performance with controlled chaos engineering
  2. © 2021, Amazon Web Services, Inc. or its Affiliates. Agenda

    • Challenges with distributed systems • Why is chaos engineering hard? • Introducing AWS Fault Injection Simulator (FIS) • Key features • Use cases • Multiple demos along the way
  3. © 2021, Amazon Web Services, Inc. or its Affiliates. Distributed

    systems are complex Message Message Reply Reply Server Network Client https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
  4. © 2021, Amazon Web Services, Inc. or its Affiliates. Traditional

    testing is not enough TESTING = VERIFYING A KNOWN CONDITION Unit testing of components Tested in isolation to ensure function meets expectations Functional testing of integrations Each execution path tested to assure expected results
  5. © 2021, Amazon Web Services, Inc. or its Affiliates. And

    it can get more complicated… IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device logfile ROTATE logfile.0 ROTATE logfile.1 ROTATE logfile.2 ROTATE logfile.3 ROTATE logfile.n ROTATE
  6. © 2021, Amazon Web Services, Inc. or its Affiliates. S

    O I T R E S S B S E R V E M P R O V E Chaos engineering Improve resilience and performance Uncover hidden issues Expose blind spots Monitoring, observability, and alarm And more
  7. © 2021, Amazon Web Services, Inc. or its Affiliates. Phases

    of chaos engineering Steady state Hypothesis Run experiment Verify Improve
  8. © 2021, Amazon Web Services, Inc. or its Affiliates. Why

    is chaos engineering difficult? Difficult to ensure safety Stitch together different tools and homemade scripts 1 Agents or libraries required to get started 3 2 Difficult to reproduce “real-world” events (multiple failures at once) 4
  9. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS

    Fault Injection Simulator Safeguards Real-world conditions Easy to get started Fully managed chaos engineering service
  10. © 2021, Amazon Web Services, Inc. or its Affiliates. No

    need to integrate multiple tools and homemade scripts or install agents Use the AWS Management Console or the AWS CLI Use pre-existing experiment templates and get started in minutes Easily share it with others Easy to get started
  11. © 2021, Amazon Web Services, Inc. or its Affiliates. Real-world

    conditions Run experiments in sequence of events or in parallel Target all levels of the system (host, infrastructure, network, etc.) Real faults injected at the service control plane level!
  12. © 2021, Amazon Web Services, Inc. or its Affiliates. Safeguards

    “Stop conditions” alarms Integration with Amazon CloudWatch Built-in rollbacks Fine-grain IAM controls
  13. © 2021, Amazon Web Services, Inc. or its Affiliates. Components

    Experiment templates Experiments Actions Targets
  14. © 2021, Amazon Web Services, Inc. or its Affiliates. Actions

    are the fault injection actions executed during an experiment aws:<service-name>:<action-type> Actions include: • Fault type • Targeted resources • Timing relative to any other actions • Fault-specific parameters, such as duration, rollback behavior, or the portion of requests to throttle Actions
  15. © 2021, Amazon Web Services, Inc. or its Affiliates. Actions

    "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "parameters": { "startInstancesAfterDuration": "PT2M" }, "targets": { "Instances": "RandomInstancesInAZ" } }, ”Wait": { "actionId": " aws:fis:wait", "parameters": { "duration": "PT1M", }, "startAfter": [ "StopInstances" ] }, }
  16. © 2021, Amazon Web Services, Inc. or its Affiliates. Targets

    define one or more AWS resources on which to carry out an action Targets include: • Resource type • Resource IDs, tags, and filters • Selection mode (e.g., ALL, RANDOM) Targets
  17. © 2021, Amazon Web Services, Inc. or its Affiliates. "targets":

    { ”RandomInstancesInAZ": { "resourceType": "aws:ec2:instance", "resourceTags": { ”Env": ”test" }, "filters" : [ { "path": "Placement.AvailabilityZone", "values": ["us.east.1a"] }, { "path": "State.Name", "values": ["running"] }, { "path": "VpcId", "values": ["vpc-0123456789"] } ] "selectionMode": ”COUNT(2)" } Targets
  18. © 2021, Amazon Web Services, Inc. or its Affiliates. Experiment

    templates define an experiment and are used in the start-experiment request Experiment templates include: • Actions • Targets • Stop condition alarms • IAM role • Description • Tags Experiment templates
  19. © 2021, Amazon Web Services, Inc. or its Affiliates. {

    "tags”: { "Name": "StopAndRestartRandomInstance" }, "description": "Stop and Restart One Random Instance", "roleArn": "arn:aws:iam::0123456789:role/MyFISExperimentRole”, "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": " "arn:aws:cloudwatch:us-east-1:0123456789:alarm:No_Traffic" } ], "targets": { "myInstance": { "resourceTags": { ”Env": ”test" }, "resourceType": "aws:ec2:instance", "selectionMode": ”COUNT(1)" } }, "actions": { "StopInstances": { "actionId": "aws:ec2:stop-instances", "description": "stop the instances", "parameters": { "startInstancesAtEnd": "true”, "duration": "PT2M", }, "targets": { "Instances": "myInstance" } } } } Experiment templates Description IAM role Stop conditions Targets Actions Name
  20. © 2021, Amazon Web Services, Inc. or its Affiliates. Experiment

    template A Stop conditions Targets Actions Action 1 Action 2 Amazon CloudWatch alarm i-aaaa i-bbbb i-cccc Specific EC2 instances Experiment template B Stop conditions Targets Actions Action 3 Action 1 Action 2 Amazon CloudWatch alarms All EC2 instances with “chaos-ready” tag Experiment templates
  21. © 2021, Amazon Web Services, Inc. or its Affiliates. Experiments

    are snapshot of the experiment template when it was first launched with couple additions Experiments include: • Snapshot of the experiment • Creation and start time • Status • Execution ID • Experiment template ID • IAM role ARN Experiments
  22. © 2021, Amazon Web Services, Inc. or its Affiliates. Server

    error (EC2) Stop, reboot, and terminate instance(s) (EC2) API throttling Increased memory or CPU load (EC2) Kill process (EC2) Latency injection (EC2) Container instance termination (ECS) Increase memory or CPU consumption per task (ECS) Terminate nodes (EKS) Database stop, reboot, and failover (RDS) And more to come in 2021 Supported fault injections
  23. © 2021, Amazon Web Services, Inc. or its Affiliates. Use

    cases One-off experiments Periodic game days Automated experiments
  24. © 2021, Amazon Web Services, Inc. or its Affiliates. Use

    cases One-off experiments Periodic game days Automated experiments
  25. © 2021, Amazon Web Services, Inc. or its Affiliates. Use

    cases One-off experiments Periodic game days Automated experiments
  26. © 2021, Amazon Web Services, Inc. or its Affiliates. Use

    cases One-off experiments Periodic game days Automated experiments
  27. © 2021, Amazon Web Services, Inc. or its Affiliates. Automated

    experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments
  28. © 2021, Amazon Web Services, Inc. or its Affiliates. Automated

    experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments
  29. © 2021, Amazon Web Services, Inc. or its Affiliates. Automated

    experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments
  30. © 2021, Amazon Web Services, Inc. or its Affiliates. Automated

    experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments
  31. © 2021, Amazon Web Services, Inc. or its Affiliates. Automated

    experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments
  32. © 2021, Amazon Web Services, Inc. or its Affiliates. Use

    cases One-off experiments Periodic game days Automated experiments
  33. © 2021, Amazon Web Services, Inc. or its Affiliates. Resources

    AWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/ AWS Fault Injection Simulator https://aws.amazon.com/fis/ AWS FIS Documentation https://docs.aws.amazon.com/fis/ AWS FIS Samples https://github.com/aws-samples/aws-fault-injection-simulator-samples