Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering - Google Next Community Summi...

Chaos Engineering - Google Next Community Summit 2018 - Tammy Butow

Chaos Engineering
Google Next Community Summit 2018
Tammy Butow

Avatar for Tammy Bryant Butow

Tammy Bryant Butow

July 23, 2018
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. @tammybutow Principal SRE @ Gremlin Previously: SRE Manager @ Dropbox

    Databases, Block Storage, Code Workflows. Used Chaos Engineering to get a 10x reduction in incidents! HELLO!
  2. GREMLIN • Gremlin launched in Dec 2017. github.com/gremlin • We

    are practitioners of Chaos Engineering. We break things on purpose! • We build software that helps engineers build more reliable systems through failure injection. gremlin.com @tammybutow #GoogleNext18
  3. “Thoughtful planned experiments designed to reveal the weaknesses in our

    systems.” @tammybutow #GoogleNext18 - KOLTON ANDRUS, CEO @ GREMLIN PREVIOUSLY ENGINEER @ NETFLIX & AMAZON CHAOS ENGINEERING
  4. @tammybutow #GoogleNext18 • You can inject chaos at any layer

    of your stack to increase system resilience. • Injecting failure will also train your engineering teams for on-call. • Include engineers, engineering managers, designers, PMs, TPMs, VPs and more! API 
 APPLICATION CACHING
 DATABASE OPERATING SYSTEM HARDWARE 
 RACK NETWORK / POWER
 FULL-STACK CHAOS ENGINEERING
  5. • Eventually systems will break for you in many undesired

    ways. • Be proactive and break them first on purpose with controlled chaos. • Advanced Chaos Engineering involves doing Chaos Engineering with CI/CD @tammybutow #GoogleNext18 CONTROLLED CHAOS ENGINEERING
  6. @tammybutow #GoogleNext18 1. Form a hypothesis 2. Baseline your metrics

    3. Consider the blast radius 4. Run your Chaos Engineering experiment 5. Measure the results of your experiment 6. Find & fix issues or scale the experiment HOW TO RUN A CHAOS EXPERIMENT
  7. @tammybutow #GoogleNext18 1. Identify your top 5 critical services 2.

    Choose one of these services (e.g. Kafka) 3. Whiteboard the service with your team 4. Select the experiment: resource/state/network 5. Determine the scope: number of machines/impact HOW TO CHOOSE A CHAOS EXPERIMENT
  8. @tammybutow #GoogleNext18 • Availability — 500s • Service specific KPIs

    • System metrics: CPU, IO, DISK • Customer complaints WHAT SHOULD YOU MEASURE?
  9. @tammybutow #GoogleNext18 • Resources: CPU, DISK, IO & Memory •

    State: Processes, Shutdown & Time Travel • Network: Blackhole, DNS, Latency & Packet Loss POD/CONTAINER CHAOS ENGINEERING ⚡
  10. @tammybutow #GoogleNext18 cd scripts ./burncpu.sh chaos $ chaos $ HELLO

    WORLD OF CHAOS ENGINEERING github.com/tammybutow/chaosengineeringbootcamp
  11. @tammybutow #GoogleNext18 • We can increase CPU, Disk, Memory &

    IO consumption • Good to catch problems before they turn into high severity incidents and downtime for customers. • Chaos Engineering enables you to proactively monitor your monitoring for issues. RESOURCE CHAOS ENGINEERING
  12. @tammybutow #GoogleNext18 There are many ways to perform process chaos

    engineering experiments: • Kill one process • Loop kill a process • Spawn a new process • Fork bomb You can also do Time Travel Chaos Engineering! STATE CHAOS — PROCESS & TIME
  13. @tammybutow #GoogleNext18 • Kill self • Kill a container from

    the host • Use one container to kill another container • Use one container to kill several containers • Use several containers to kill several containers STATE CHAOS — PODS/CONTAINERS
  14. @tammybutow #GoogleNext18 NETWORKING CHAOS — DNS • Perform regular DNS

    failover • Ensure you can handle DNS outages without impacting customers • Use Chaos Engineering to ensure your team are trained to handle DNS issues
  15. @tammybutow #GoogleNext18 MOAR NETWORKING CHAOS • Latency — Inject latency

    into egress network traffic. • Packet Loss — Induce packet loss into egress network traffic. • Blackhole — Drops network traffic.
  16. THANKS! @tammybutow Principal SRE @ Gremlin Join us in the

    Chaos Slack: gremlin.com/slack Start breaking things: gremlin.com/community Come to chaosconf.io