Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2016.05 Meetup #2][TALK #3] Hassy Veldstra - C...

[2016.05 Meetup #2][TALK #3] Hassy Veldstra - Chaos Llama – a modern take on the Netflix Chaos Monkey

Hassy Veldstra's talk slides: Chaos Llama – a modern take on the Netflix Chaos Monkey

All rights belong to Hassy Veldstra.

DevOps Lisbon

May 16, 2016
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. # whoami • Backend Node.js & DevOps engineer • Interested

    in reliability and performance • artillery.io
  2. Chaos Engineering • . • “The only way to prepare

    for system failures is to simulate them”
  3. Chaos Engineering • . • “The only way to prepare

    for system failures is to simulate them” • .
  4. Chaos Engineering • Modern systems are complex • Consequence: failure

    modes are complex • Common weaknesses: • Improper fallback settings when a subsystem is down
  5. Chaos Engineering • Modern systems are complex • Consequence: failure

    modes are complex • Common weaknesses: • Improper fallback settings when a subsystem is down • Retry storms (thundering herd) from improperly tuned timeouts
  6. Chaos Engineering • Modern systems are complex • Consequence: failure

    modes are complex • Common weaknesses : • Improper fallback settings when a subsystem is down • Retry storms (thundering herd) from improperly tuned timeouts • Cascading failures when a SPOF fails
  7. Chaos Engineering • Modern systems are complex • Consequence: failure

    modes are complex • Common weaknesses: • Improper fallback settings when a subsystem is down • Retry storms (thundering herd) from improperly tuned timeouts • Cascading failures when a SPOF fails • Improperly tuned connection pools
  8. Chaos Engineering • Modern systems are complex • Consequence: failure

    modes are complex • Common weaknesses: • Improper fallback settings when a subsystem is down • Retry storms (thundering herd) from improperly tuned timeouts • Cascading failures when a SPOF fails • Improperly tuned connection pools • Many more
  9. Chaos Engineering • The only way to build a resilient

    system is to stress test it with failure and fix weaknesses
  10. Chaos Engineering 1. Define the “steady state” 1. P95 response

    time 2. Error rates (e.g. HTTP 500) 3. E-commerce app: completed orders
  11. Chaos Engineering 1. Define the “steady state” 1. P95 response

    time 2. Error rates (e.g. HTTP 500) 3. E-commerce app: completed orders 2. Inject failure into the system
  12. Chaos Engineering 1. Define the “steady state” 1. P95 response

    time 2. Error rates (e.g. HTTP 500) 3. E-commerce app: completed orders 2. Inject failure into the system 3. Monitor changes in the steady state
  13. Chaos Engineering 1. Define the “steady state” 1. P95 response

    time 2. Error rates (e.g. HTTP 500) 3. E-commerce app: completed orders 2. Inject failure into the system 3. Monitor changes in the steady state 4. Fix weaknesses
  14. Advanced Chaos Engineering 1. Do it in production 2. Automate

    these chaos experiments to run continuously
  15. Chaos Monkey - Downsides • Large codebase (thousands of lines

    of Java) • Requires an EC2 instance • Lots of config to get going
  16. Chaos Monkey - Downsides • Large codebase (thousands of lines

    of Java) • Requires an EC2 instance • Lots of config to get going • Quick start guide is 16 pages
  17. Llama vs Monkey • Serverless vs Need an EC2 instance

    • Easy to set up vs Lots of config
  18. Llama vs Monkey • Serverless vs Need an EC2 instance

    • Easy to set up vs Lots of config • Tiny (400 LoC) vs Huge (thousands)
  19. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code
  20. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code • AWS takes care of running it (“serverless”)
  21. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code • AWS takes care of running it (“serverless”) • The piece of code is a “lambda function”
  22. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code • AWS takes care of running it (“serverless”) • The piece of code is a “lambda function” • JS (Node), Python, Java
  23. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code • AWS takes care of running it (“serverless”) • The piece of code is a “lambda function” • JS (Node), Python, Java • A (Node) process in a container running the snippet behind the scenes
  24. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code • AWS takes care of running it (“serverless”) • The piece of code is a “lambda function” • JS (Node), Python, Java • A (Node) process in a container running the snippet behind the scenes • Basically CGI
  25. AWS Lambda 101 • Run code in response to events

    (S3, Dynamo etc) • You give AWS a snippet of code • AWS takes care of running it (“serverless”) • The piece of code is a “lambda function” • JS (Node), Python, Java • A (Node) process in a container running the snippet behind the scenes • Basically CGI • 300s max execution, up to 100 concurrently
  26. AWS Lambda 101 • Perfect for scripting things in your

    AWS environment • e.g. DIY dynamic DNS system for instances not behind an ELB
  27. AWS Lambda 101 • Perfect for scripting things in your

    AWS environment • e.g. DIY dynamic DNS system for instances not behind an ELB • Also great for small one-off tasks • E.g. resizing images uploaded to S3
  28. AWS Lambda 101 • Perfect for scripting things in your

    AWS environment • e.g. DIY dynamic DNS system for instances not behind an ELB • Also great for small one-off tasks • E.g. resizing images uploaded to S3 • People also building APIs with it
  29. Chaos Llama • Install: npm install –g llama-cli • Configure:

    • Create a role for the lambda function
  30. Chaos Llama • Install: npm install –g llama-cli • Configure:

    • Create a role for the lambda function • Deploy! llama deploy -r $lambda-role-arn
  31. Chaos Llama • Install: npm install –g llama-cli • Configure:

    • Create a role for the lambda function • Deploy! llama deploy -r $lambda-role-arn • This is SAFE
  32. Chaos Llama • Install: npm install –g llama-cli • Configure:

    • Create a role for the lambda function • Deploy! llama deploy -r $lambda-role-arn • This is SAFE • Llama will do nothing by default
  33. Going further • Increase network latency (temporarily) • Increase memory

    usage • Increase CPU load • Can’t do this from a lambda function L
  34. Llama Agent • Simple RPC server (written in Go) •

    Listens on a port; accepts (whitelisted) commands
  35. Llama Agent • Simple RPC server (written in Go) •

    Listens on a port; accepts (whitelisted) commands • Commands correspond to shell scripts
  36. Llama Agent • Simple RPC server (written in Go) •

    Listens on a port; accepts (whitelisted) commands • Commands correspond to shell scripts • Easy to install on a server
  37. Llama Agent - Security • Small codebase (<100 LoC) •

    Memory-safe (Go) – no RCE • Whitelisted commands – no shell injections
  38. Llama Agent - Security • Small codebase (<100 LoC) •

    Memory-safe (Go) – no RCE • Whitelisted commands – no shell injections • Restrict access to within a VPC only
  39. Llama Agent - Security • Small codebase (<100 LoC) •

    Memory-safe (Go) – no RCE • Whitelisted commands – no shell injections • Restrict access to within a VPC only • The worst that can happen?
  40. Llama Agent - Security • Small codebase (<100 LoC) •

    Memory-safe (Go) – no RCE • Whitelisted commands – no shell injections • Restrict access to within a VPC only • The worst that can happen? • Very experimental – not for production use yet
  41. Summary • Chaos Engineering = more resilient systems • Chaos

    Llama is a modern take on Chaos Monkey • Easier to run thanks to AWS Lambda
  42. Summary • Chaos Engineering = more resilient systems • Chaos

    Llama is a modern take on Chaos Monkey • Easier to run thanks to AWS Lambda
  43. Extra: Money Llama • Untagged instances (story) • Detached EBS

    volumes (story) • Cycle dev environments on a schedule