Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Recovery Testing vs Chaos Engineering

Disaster Recovery Testing vs Chaos Engineering

Avatar for Yury Nino

Yury Nino

January 25, 2026
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Chaos is about complete disorder and confusion. Engineering is about

    designing, building, and usage of engines, machines, and structures.
  2. Chaos Engineering It is the discipline of experimenting on a

    system in order to build confidence in the system's capability to withstand turbulent conditions in production. https://principlesofchaos.org
  3. Disaster is an unexpected event that causes significant destruction or

    adverse consequences. Recovery a return to a normal state of health, mind, or strength. Testing is the action of checking that someone or something is working as it is expected.
  4. Disaster Recovery Testing It is a program performed internally at

    Google, in which a group of engineers plan and execute real and fictitious outages to test the effective response of the involved teams.
  5. If I join Cha practices? Am I here to see

    the same that I could have found in a dictionary?
  6. Focus: Proactive identification of systemic weaknesses in complex systems under

    turbulent conditions. Focus: Test reactive response and recovery procedures in the event of a catastrophic failure or disaster. Disaster Recovery Testing Chaos Engineering
  7. Method: Controlled experiments are conducted in production, introducing failures (e.g.,

    network latency, server outages) to observe system behavior and uncover hidden vulnerabilities. Method: Involves simulating a major outage (e.g., data center failure) and testing the ability to restore systems and data within a defined recovery time objective (RTO).. Disaster Recovery Testing Chaos Engineering
  8. Goal: To build confidence in the system's ability to withstand

    unexpected disruptions and improve overall resilience. Goal: To minimize downtime and data loss in the event of a major disruption, ensuring business continuity. Disaster Recovery Testing Chaos Engineering
  9. It is not because I work for Google, but let

    me talk a little bit more about DiRT.
  10. Disaster Recovery Testing Google tests software and systems, but also

    people, preparation, processes, and response tools. It's about learning and finding single points of failure—therefore the scope of services and systems is broad. Intentionally disrupt services in order to know how to respond and provide reliability. Established in 2006 to exercise response to production emergencies.
  11. Software Modifying live service configurations, or bringing up services with

    known bugs Infrastructure Stress testing large complex architectures, validating SLOs, and ensuring resilience is maintained during disruption. Access Controls Including security, compliance, and privacy. What does Google Test? People and Workflows Removing people who might have knowledge or experience.
  12. Disaster Recovery Testing Testing resilience of a specific system or

    product [no expected impact external]. Tier 3 Testing resilience dependencies of a shared system or product. Tier 2 Testing resilience of organizational response to an enterprise level event. Tier 1
  13. Tier 3 Example: Deploy a bad configuration file. Scenario A

    bad configuration file is included in the next release generating more CPU and Memory consumption. This impacts only the users of an experimental feature in a product. Response • Incident management protocols from service’s owners. • The continuous testing of the services defined by the owners. • Validating disaster readiness and response of a service and the team. • Identification and expansion of standard tests that can be used to de-risk Tier 2 and Tier 1 testing. What can you learn? • If a team is able to effectively perform IMAG. • If a service is resilient to a specific class of failure. • If the service is not overly dependent on a specific resource.
  14. Tier 2 Example: Run at Service Level Scenario An unusually

    large traffic spike degrades the latency of a heavily used shared internal service. The service remains barely within its published SLA for an extended period. Response • Communication to key service consumers. • Incident management protocols from service’s owners. • Emergency serving capacity increases. • Graceful degradation and external messaging to customers. What can you learn? • Do service consumers tolerate worst case scenarios, or do they assume the average experience as a baseline? • Do your alerting and monitoring systems behave the way you want for both service providers and consumers in this scenario?
  15. Scenario Redeploy an application that uses Apache Log4j2 with version

    2.14.0, which has a security vulnerability [CVE-2021-44228], launch a script that exploit this vulnerability and validate that your controls are able to monitor and generate alerts. Response • Security team invokes incident management protocols and business continuity plan. • All impacted users are notified. • Support staff isolate impacted workstations and issue emergency alternative OS laptops. • New policies on the fly. Higher demand on shared computing resources. What can you learn? • Creativity and a culture that promotes flexibility helps a lot. • Communication matters, especially when time is limited. • Expect the unexpected. Backup (and restore) essential data automatically. Validate backup and restoration!!! Tier 1 Example: Hacked!
  16. Blackholing internal traffic Add a VPC rule to a cloud

    project that routes traffic destined for IP addresses of some of their hosts in-region (say, VMs running MySQL DBs) to some other project where the traffic is dropped. Reversion is by deleting the rule. Chief risk is that more traffic than expected is captured by the rule, so the outage is bigger than planned. Redirect Traffic Away from a region using a Load Balancer The customer can remove a backend from a load balancers load balancers and drain connections. This can be used to simulate failover from one zone or one region to another by removing a managed instance group that includes all the resources in one zone or region, forcing the LB to send the traffic to another region. The risk is that the resources making up the GCLB backends will still remain up but will not serve any traffic. Practical Examples
  17. You told me that this talk would be about CE

    vs DiRT Yes, but I have to sell DiRT.
  18. Dimension: Objective To ensure the system can recover from catastrophic

    failures (e.g., data center outages). To understand system behavior and improve reliability by intentionally causing failures in a controlled manner. Disaster Recovery Testing Chaos Engineering
  19. Dimension: Scope Focuses on business continuity plans, backup systems, and

    recovery processes. Focuses on distributed systems’ robustness, identifying weaknesses proactively. Disaster Recovery Testing Chaos Engineering
  20. Dimension: Scenario Simulates severe, often rare, events such as data

    loss, hardware failures, or site outages. Simulates everyday failures like service crashes, latency spikes, or network disruptions. Disaster Recovery Testing Chaos Engineering
  21. Dimension: Methodology Involves predefined tests and drills to validate recovery

    plans and timelines. Uses experiments designed to stress the system in various ways, often run continuously or periodically. Disaster Recovery Testing Chaos Engineering
  22. Dimension: Automation Often relies on manual or semi-automated procedures and

    is usually not fully automated. Highly automated, with tools such as Chaos Monkey that run experiments autonomously. Disaster Recovery Testing Chaos Engineering
  23. Dimension: Frequency Conducted periodically (e.g., annually or quarterly) as a

    planned event. Can be conducted frequently, even as part of normal operations, to gain ongoing insights. Disaster Recovery Testing Chaos Engineering
  24. Dimension: Measurement Evaluates recovery metrics like Recovery Time Objective (RTO)

    and Recovery Point Objective (RPO). Assesses system performance, resilience, and the ability to handle stress under failure scenarios. Disaster Recovery Testing Chaos Engineering
  25. Dimension: Focus Ensures data integrity, service continuity, and minimal downtime

    during recovery. Tests system’s ability to self-heal, maintain service levels, and minimize the blast radius of issues. Disaster Recovery Testing Chaos Engineering
  26. Dimension: Example Testing if a backup system can restore data

    after a simulated data center failure. Simulating server outages to see if a distributed system can balance load and maintain performance. Disaster Recovery Testing Chaos Engineering