Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliably Scaling to your First Million Users

Reliably Scaling to your First Million Users

Ho Ming Li

May 02, 2019
Tweet

More Decks by Ho Ming Li

Other Decks in Technology

Transcript

  1. 1 Reliably Scaling to your First Million Users with Chaos

    Engineering Ho Ming Li Principal Solutions Architect, Gremlin [email protected] @horeal Adobe CE Meetup May 2019
  2. 2 Ho Ming Li Principal Solutions Architect, Gremlin [email protected]

    Assist in “Digital Transformation” • Share Architectural Best Practices • Share Operational Best Practices • Facilitate GameDays Quite possibly the only Solutions Architect who became a Technical Account Manager at AWS. @HoReaL @GremlinInc
  3. @horeal @gremlininc I’m not Netflix. I’m not Amazon. I’m not

    _________ . I don’t have their problems. … okay.
  4. @horeal @gremlininc Do you want to scale? Do you want

    to be… the next Amazon, or the next Netflix?
  5. @horeal @gremlininc The Journey (Hardcore aka Reality) Build “MVP”, found

    bugs, fix bugs, build new features, found bugs, fix bugs, build new features, P0 hard down, on call hero saves the day, build new features, P1 incident, fix bugs, P0 hard down, customer complains, fix bugs, new bugs show up, fix new bugs, P2 issue, build new features, P1 incident, fix bugs, product release, P2 issue came back as P0 hard down… Frustrated customer looks for alternative, churn rate increases, business struggles…
  6. @horeal @gremlininc Build and Operate, so that your service actually

    works serving your Customer. Service Down means No Value to customers. Downtime Sucks!
  7. @horeal @gremlininc Can be costly... The head of San Francisco’s

    Municipal Transportation Agency is stepping down amid the fallout from a 10-hour meltdown that choked the city on Friday, drawing anger from City Hall.
  8. @horeal @gremlininc Let’s begin our Journey. * disclaimer - user

    numbers will vary for your particular service
  9. @horeal @gremlininc In the beginning... 1 to 100 Users… Maybe?

    M-V-P DEPLOYMENT Rsync? Heroku? ENVIRONMENT Your Laptop ARCHITECTURE Monolith
  10. @horeal @gremlininc Chaos Engineering? Back Burner (low priority) Level Undefined

    FOCUS Time to Market APPROACH Functional MVP DESIRE Proving out the Idea
  11. @horeal @gremlininc First taste of scaling... 100 to 1000+ Users

    Scale DEPLOYMENT CI/CD Pipelines ENVIRONMENT Dev → Stage → Prod ARCHITECTURE 3-tier (Front End, Back End, Data Store)
  12. @horeal @gremlininc Do we know when a host goes away?

    How do we remove/patch hosts? Can we replace smaller host to a bigger host? Can we scale out/in to add/remove more hosts? Any auto-healing mechanism if hosts fail health check? Is “S.T.O.N.I.T.H.” in our toolbox?
  13. @horeal @gremlininc CHAOS MONKEY-ESQUE Host Takedown Level 0 FOCUS Detect

    and Remediate (Manual to Auto) APPROACH Random host failing DESIRE Ability to replace hosts
  14. @horeal @gremlininc Takedown Shutdown and Reboot a host $ shutdown

    -r $ gremlin shutdown -r # AVAILABLE WITH Killing a process $ pkill httpd $ gremlin attack process_killer -p httpd
  15. @horeal @gremlininc General Architectural Guidance Set up (and verify! )

    Monitoring & Alerting Leverage Multiple Zones Identify Stateful vs Stateless hosts Replication of State Scale out for Stateless
  16. @horeal @gremlininc Operational Challenges 1,000 to 10,000+ Users Operational Excellence

    DEPLOYMENT CI/CD Pipelines ENVIRONMENT Multiple, Mixed, Hybrid ARCHITECTURE Monolith, 3-Tier, Managed Services
  17. @horeal @gremlininc Which resource is the workload bounded upon? What

    threshold do we trigger scaling? How long does it take to scale? What is the user experience upon encountering failure? How can we improve this user experience?
  18. @horeal @gremlininc Other Host Failures Resource Constraints Level 1 FOCUS

    Alerting and Basic Operations APPROACH Disciplined: benchmark, measure DESIRE Prepare for host-level failures
  19. @horeal @gremlininc CPU $ gremlin attack cpu # AVAILABLE WITH

    $ while :; do :; done $ stress --cpu 2 --timeout 60 $ dd if=/dev/zero of=/dev/null conv=sync $ yes > /dev/null &
  20. @horeal @gremlininc Disk (capacity and IO) $ gremlin attack disk

    $ gremlin attack io $ fallocate -l 10G outfile $ dd if=/dev/urandom of=/tmp/outfile bs=$((1024*1024)) count=1024 $ gremlin attack memory $ stress -m 1 --vm-bytes 1G Memory
  21. @horeal @gremlininc General Guidance Establish incident management practice and process

    Discuss Backup and Recovery (BCP/DR) Run those exercises!!!
  22. @horeal @gremlininc Dependency Pain 100,000+ Users DEPLOYMENT Centralize, or decentralize?

    ENVIRONMENT Kubernetes is the new hotness ARCHITECTURE Heavily adopting microservices “Digital Transformation”
  23. @horeal @gremlininc What if THEY fail? Network Failures Level 1.5

    FOCUS APPROACH DESIRE Error handling Experiments (GameDay) Prepare for high impact events
  24. @horeal @gremlininc Network Network Gremlin $ gremlin attack latency Traffic

    Control (TC) $ tc qdisc add dev eth0 root netem delay 1000ms 500ms Iptable iptables -A OUTPUT -p tcp -d 157.240.0.0/16 -j DROP PF (Mac) block quick from any to 157.240.0.0/16
  25. @horeal @gremlininc “Control” Complexity 1 mil Users DEPLOYMENT Balancing Agility

    and Quality ENVIRONMENT Mixed, K8S on Multiple Providers ARCHITECTURE uServices, OSS, a bit of everything The Unknown
  26. @horeal @gremlininc Putting it all together... Host Takedown Resource Limits

    Network Failures Unknown → Known FOCUS APPROACH DESIRE Business Metrics User Experience Automated Experiments New Manual Experiments Verifiable Resilience!
  27. 42 Ho Ming Li [email protected] @HoReaL @GremlinInc Thank you! Reliably

    Yours tinyurl.com/chaoseng meetup.com/pro/chaos