Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering Bootcamp

Chaos Engineering Bootcamp

These are the slides from the Chaos Engineering Bootcamp I ran at Velocity 2017 in San Jose #VelocityConf

Tammy Bryant Butow

June 20, 2017
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. + LAYING THE FOUNDATION (9:00 - 10:30) + MORNING BREAK

    (10:30 - 11:00) + CHAOS TOOLS (11:00 - 11:30) + ADVANCED TOPICS + Q & A (11:30 - 12:30) THE CHAOS BOOTCAMP 4
  2. • DROPBOX • NETFLIX • DIGITALOCEAN THANKS TO • GOOGLE

    • AMAZON • NATIONAL AUSTRALIA BANK • DATADOG 5
  3. CHAOS ENGINEERING IS THE DISCIPLINE OF EXPERIMENTING ON A DISTRIBUTED

    SYSTEM IN ORDER TO BUILD CONFIDENCE IN THE SYSTEM’S CAPABILITY TO WITHSTAND TURBULENT CONDITIONS IN PRODUCTION. WHAT IS CHAOS ENGINEERING 7
  4. CHAOS ENGINEERING CAN BE THOUGHT OF AS THE FACILITATION OF

    EXPERIMENTS TO UNCOVER SYSTEMIC WEAKNESSES. 8
  5. 1. DEFINE STEADY STATE 2. HYPOTHESIZE STEADY STATE WILL CONTINUE

    3. INTRODUCE VARIABLES THAT REFLECT REAL WORLD EVENTS 4. TRY TO DISPROVE THE HYPOTHESIS PRINCIPLES OF CHAOS ENGINEERING 9
  6. DISTRIBUTED SYSTEMS HAVE NUMEROUS SYSTEM PARTS. HARDWARE AND FIRMWARE FAILURES

    ARE COMMON. OUR SYSTEMS AND COMPANIES SCALE RAPIDLY HOW DO YOU BUILD A RESILIENT SYSTEM WHILE YOU SCALE? 
 WE USE CHAOS! WHY DO DISTRIBUTED SYSTEMS NEED CHAOS? 10
  7. YOU CAN INJECT CHAOS AT
 ANY LAYER TO INCREASE
 SYSTEM

    RESILIENCE AND SYSTEM KNOWLEDGE. FULL-STACK CHAOS INJECTION CACHING HARDWARE DATABASE APPLICATION RACK 11
  8. 1. NETFLIX 2. DROPBOX 3. GOOGLE 4. NATIONAL AUSTRALIA BANK

    5. JET WHO USES CHAOS ENGINEERING? 12
  9. HANDS-ON TUTORIAL (LET’S JUMP IN!) NOW IT IS TIME TO

    CREATE CHAOS. WE WILL ALL BE DOING A HANDS-ON ACTIVITY WHERE WE INJECT FAILURE. 14
  10. EVERYONE HAS A DIGITALOCEAN
 SERVER, USERNAME AND PASSWORD.
 1. LOGIN

    WITH TERMINAL 2. VISIT YOUR IP IN YOUR BROWSER TIME TO USE YOUR SERVER 15
  11. YOU MUST BE MEASURING METRICS AND REPORTING ON THEM TO

    IMPROVE YOUR SYSTEM RESILIENCE. 16
  12. THE LACK OF PROPER MONITORING IS NOT USUALLY THE SOLE

    CAUSE OF A PROBLEM, BUT IT IS OFTEN A SERIOUS CONTRIBUTING FACTOR. AN EXAMPLE IS THE NORTHEAST BLACKOUT OF 2003. COMMON ISSUES INCLUDE: + HAVING THE WRONG TEAM DEBUG + NOT ESCALATING + NOT HAVING A BACKUP ON-CALL 18
  13. 19

  14. A LACK OF ALARMS LEFT OPERATORS UNAWARE OF THE NEED

    TO RE-DISTRIBUTE POWER AFTER OVERLOADED TRANSMISSION LINES HIT UNPRUNED FOLIAGE. THIS TRIGGERED A RACE CONDITION IN THE CONTROL SOFTWARE. 20
  15. 1. AVAILABILITY — 500s 2. SERVICE SPECIFIC KPIs 3. SYSTEM

    METRICS: CPU, IO, DISK 4. CUSTOMER COMPLAINTS WHAT SHOULD YOU MEASURE 21
  16. 1. UNDERSTAND SYSTEM 2. DETERMINE SLAs/SLOs/KPIs 3. SETUP MONITORING 4.

    INJECT CHAOS 5. MEASURE RESULTS 6. LEARN 7. INCREASE SYSTEM RESILIENCE CASE STUDY: KUBERNETES SOCK SHOP 22
  17. 1. DATADOG IS UP AND READY 2. THE AGENT IS

    ALREADY REPORTING METRICS FOR YOU! LUCKY YOU. 
 YOUR MONITORING IS ALREADY UP. 23
  18. 24

  19. 1. CHOOSE A SIMIAN ARMY SCRIPT LET’S INJECT KNOWN CHAOS

    $cd ~/SimianArmy/src/main/resources/scripts 26
  20. 1. CHOOSE A SIMIAN ARMY SCRIPT LET’S INJECT KNOWN CHAOS

    cd ~/SimianArmy/src/main/resources/scripts chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ls burncpu.sh faildynamodb.sh filldisk.sh networklatency.sh burnio.sh failec2.sh killprocesses.sh networkloss.sh faildns.sh fails3.sh networkcorruption.sh nullroute.sh 27
  21. $vim burncpu.sh #!/bin/bash # Script for BurnCpu Chaos Monkey cat

    << EOF > /tmp/infiniteburn.sh #!/bin/bash while true; do openssl speed; done EOF # 32 parallel 100% CPU tasks should hit even the biggest EC2 instances for i in {1..32} do nohup /bin/bash /tmp/infiniteburn.sh & done 28
  22. LET’S INJECT KNOWN CHAOS chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ls burncpu.sh faildynamodb.sh filldisk.sh networklatency.sh

    burnio.sh failec2.sh killprocesses.sh networkloss.sh faildns.sh fails3.sh networkcorruption.sh nullroute.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ chmod +x burncpu.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ./burncpu.sh nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: nohup: nohup: appending output to 'nohup.out' appending output to 'nohup.out' 29
  23. 1. KILL WHAT I RAN AS CHAOS USER LET’S STOP

    THE KNOWN CHAOS pkill -u chaos 32
  24. 1. WE KILL MYSQL PRIMARY 2. WE KILL MYSQL REPLICA

    3. WE KILL THE MYSQL PROXY WHAT KIND OF CHAOS DO WE INJECT AT DROPBOX? 35
  25. WE USE SEMI SYNC, GROUP REPLICATION AND WE CREATED A

    TOOL CALLED AUTO REPLACE TO DO CLONES AND PROMOTIONS. HOW DO WE MAKE MYSQL RESILIENT TO KILLS? 36
  26. LET’S INJECT KNOWN CHAOS chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ls burncpu.sh faildynamodb.sh filldisk.sh networklatency.sh

    burnio.sh failec2.sh killprocesses.sh networkloss.sh faildns.sh fails3.sh networkcorruption.sh nullroute.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ chmod +x burncpu.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ./burncpu.sh nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: nohup: nohup: appending output to 'nohup.out' appending output to 'nohup.out' 39
  27. LET’S GO BACK IN TIME TO LOOK AT WORST OUTAGE

    STORIES WHICH THEN LED TO THE INTRODUCTION OF CHAOS ENGINEERING. 45
  28. + SO MANY WORST OUTAGE STORIES ARE THE DATABASE. +

    I LEAD DATABASES AT DROPBOX & WE DO CHAOS. + FEAR WILL NOT HELP YOU SURVIVE “THE WORST OUTAGE”. + DO YOU TEST YOUR ALERTS & MONITORING? WE DO. + HOW VALUABLE IS A POSTMORTEM IF YOU DON’T HAVE ACTION ITEMS AND DO THEM? NOT VERY. QUICK THOUGHTS….. 48
  29. CHAOS @ UBER UBER’S WORST OUTAGE EVER: 1. MASTER LOG

    REPLICATION TO S3 FAILED 2. LOGS BACKED UP ON PRIMARY 3. ALERTS FIRE TO ENGINEER BUT THEY ARE IGNORED 4. DISK FILLS UP ON DATABASE PRIMARY 5. ENGINEER DELETES UNARCHIVED WAL FILES 6. ERROR IN CONFIG PREVENTS PROMOTION — Matt Ranney, UBER, YOW 2015 49
  30. 50

  31. CHAOS @ UBER + UBER BUILT UDESTROY TO SIMULATE FAILURES.

    + DIDN’T USE NETFLIX SIMIAN ARMY AS IT WAS AWS-CENTRIC. + ENGINEERS AT UBER DON’T LIKE FAILURE TESTING (ESP. DATABASES) ……THIS IS DUE TO THEIR WORST OUTAGE EVER: — Matt Ranney, UBER, YOW 2015 51
  32. + CHAOS MONKEY + JANITOR MONKEY + CONFORMITY MONKEY CHAOS

    @ NETFLIX SIMIAN ARMY CONSISTS OF SERVICES (MONKEYS) IN THE CLOUD FOR GENERATING VARIOUS KINDS OF FAILURES, DETECTING ABNORMAL CONDITIONS, AND TESTING THE ABILITY TO SURVIVE THEM. THE GOAL IS THE KEEP THE CLOUD SAFE, SECURE AND HIGHLY AVAILABLE. 52
  33. GITLAB’S WORST OUTAGE EVER… KEEPS REPEATING
 CHAOS @ GITLAB 1.ACCIDENTAL

    REMOVAL OF DATA FROM PRIMARY DATABASE 2.DATABASE OUTAGE DUE TO PROJECT_AUTHORIZATIONS HAVING TOO MUCH BLOAT 3.CI DISTRIBUTED HEAVY POLLING AND EXCESSIVE ROW LOCKING FOR SECONDS TAKES GITLAB.COM DOWN 4.SCARY DATABASE SPIKES https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ 53
  34. “RESILIENCE HAS TO BE DESIGNED. HAS TO BE TESTED. IT’S

    NOT SOMETHING THAT HAPPENS AROUND A TABLE AS A SLEW OF EXCEPTIONAL ENGINEERS ARCHITECT THE PERFECT SYSTEM. PERFECTION COMES THROUGH REPEATEDLY TRYING TO BREAK THE SYSTEM” — VICTOR KLANG, TYPESAFE CHAOS @ TYPESAFE 57
  35. DECIDED TO REDUCE DATABASE CAPACITY IN AWS. RESULTED IN AN

    OUTAGE AT 3:21AM. PAGERDUTY WAS MISCONFIGURED AND PHONES WERE ON SILENT. CHAOS @ BUILDKITE NOBODY WOKE UP DURING THE 4 HOUR OUTAGE….. 58
  36. “A DATABASE INDEX OPERATION RESULTED IN 90 MINUTES OF INCREASINGLY

    DEGRADED AVAILABILITY FOR THE STRIPE API AND DASHBOARD. IN AGGREGATE, ABOUT TWO THIRDS OF ALL API OPERATIONS FAILED DURING THIS WINDOW.” CHAOS @ STRIPE https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc 60
  37. INTRODUCING CHAOS IN A CONTROLLED WAY WILL RESULT IN ENGINEERS

    BUILDING INCREASINGLY RESILIENT SYSTEMS. HAVE I CONVINCED YOU? 61
  38. THERE ARE MANY MORE YOU CAN READ ABOUT HERE:
 OUTAGES

    HAPPEN. https://github.com/danluu/post-mortems 62
  39. CHAOS MONKEY YOU SET IT UP AS A CRON JOB

    THAT CALLS CHAOS MONKEY ONCE A WEEKDAY TO CREATE A SCHEDULE OF TERMINATIONS. HAS BEEN AROUND FOR MANY
 YEARS! USED AT BANKS, E-COMMERCE
 STORES, TECH COMPANIES + MORE 63
  40. 74

  41. 75

  42. CHAOS KONG TAKES DOWN AN ENTIRE AWS REGION.
 NETFLIX CREATED

    IT BECAUSE AWS
 HAD NOT YET BUILT THE ABILITY TO 
 TEST THIS. 
 
 AWS REGION OUTAGES DO HAPPEN! 78
  43. CHAOS FOR KUBERNETES ASOBTI, AN ENGINEER @ BOX CREATED https://github.com/asobti/kube-monkey

    IT RANDOMLY DELETES KUBERNETES PODS
 IN THE CLUSTER ENCOURAGING AND 
 VALIDATING THE DEPLOYMENT OF FAILURE-RESILIENT SYSTEMS. 79
  44. A SUITE OF TOOLS FOR KEEPING 
 YOUR CLOUD OPERATING

    IN TOP
 FORM. CHAOS MONKEY IS THE FIRST
 MEMBER. OTHER SIMIANS INCLUDE
 JANITOR MONKEY & CONFORMITY 
 MONKEY.
 
 https://github.com/Netflix/SimianArmy SIMIAN ARMY 80
  45. GREMLIN PROVIDES “FAILURE AS A 
 SERVICE”. IT FINDS WEAKNESSES

    
 IN YOUR SYSTEM BEFORE THEY
 END UP IN THE NEWS. 
 
 LIKE A VACCINATION, THEY SAFELY 
 INJECT HARM INTO YOUR SYSTEM
 TO BUILD IMMUNITY TO FAILURE. GREMLIN INC 81 https://gremlininc.com/
  46. • GOOD TO USE: • MYSQL • ORCHESTRATOR • GROUP

    REPLICATION • SEMI SYNC CHAOS ENGINEERING FOR DATABASES https://github.com/github/orchestrator 84
  47. THINK ABOUT WHAT FAILURE YOU
 CAN INJECT AND THEN CATCH.

    
 
 WE DO THIS WITH MAGIC POCKET
 AT DROPBOX. CHAOS ENGINEERING WITH GO 86
  48. THIS PROJECT WAS STARTED FOR THE PURPOSE OF
 CONTROLLED FAILURE

    INJECTION DURING 
 GAME DAYS. GO CLIENT TO THE CHAOS MONKEY REST API 87 https://github.com/mlafeldt/chaosmonkey go get -u github.com/mlafeldt/chaosmonkey/lib
  49. A TOOL FOR “INTUITION ENGINEERING” 
 TO HELP YOU VISUALIZE

    YOUR 
 NETWORK AND TRAFFIC.
 
 CREATED BY NETFLIX. VIZCERAL 88 https://github.com/Netflix/vizceral
  50. LOOK FORWARD TO SEEING YOU AT CHAOS COMMUNITY DAYS AND

    HEARING FROM YOU IN THE SLACK COMMUNITY AND ON THE MAILING LISTS. YOUR TOOL HERE! 98