Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better Reliability through Observability (and E...

Better Reliability through Observability (and Experimentation)

In this presentation, Julie Gunderson (Sr. Reliability Advocate at Gremlin), and I look at how to improve your service's reliability through experimentation.

This version of the talk was given at a KubeCon Europe (Valencia) in May 2022.

---

Companion Code: github.com/ksatirli/better-reliability-through-observability-and-experimentation

Kerim Satirli

May 19, 2022
Tweet

Video

More Decks by Kerim Satirli

Other Decks in Technology

Transcript

  1. Who we are ▪ Sr. Reliability Advocate at ▪ devopsdays

    Boise organizer ▪ avid mushroom-hunter ▪ came to Spain by plane Julie Gunderson @julie_gund Gremlin
  2. ▪ Sr. Developer Advocate at ▪ recovering conference organizer ▪

    aerial photography aficionado ▪ came to Spain by plane Who we are Kerim Satirli @ksatirli
  3. OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES AUTOMATED VISIBILITY

    SURVEILLANCE STATE 29 23 11 1 ? WHAT DOES "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns uhm. ! marketing "
  4. OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES WHAT DOES

    "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns INFORMATION YOU DIDN’T THINK YOU NEEDED BUT COULD ACTUALLY SOLVE YOUR PROBLEM unknown-unknowns 4 5
  5. If you can't measure it, you can't understand it. If

    you can’t interpret it, you can’t harden it.
  6. ▪ observe baseline metrics ▪ formulate hypothesis given a nominal

    state, does this function the way we expect it to? Science and Chaos
  7. what parts of the system will be impacted by an

    experiment? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius Science and Chaos
  8. when to stop experimenting and revert back to a nominal

    state? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions Science and Chaos
  9. ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast

    radius ▪ set abort conditions ▪ analyze the results what learnings can be derived from the experiment? Science and Chaos
  10. ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast

    radius ▪ set abort conditions ▪ analyze the results ▪ learn and improve share the results and derive actions from the data Science and Chaos
  11. how to simulate Latency the time it takes to service

    a request ▪ programmatically inject delays ▪ change DNS and network settings ▪ switch to different geo zones Signals and Simulations
  12. Errors the rate of requests that fail to complete correctly

    how to simulate ▪ terminate services ▪ revoke access credentials ▪ change system clock and timezones Signals and Simulations
  13. Traffic the demand that is placed on a system at

    any point how to simulate ▪ create traffic spikes with tooling ▪ change load-balancing to create hot spots ▪ re-deploy on over-subscribed compute Signals and Simulations
  14. Saturation the measure of system utilization and constraints how to

    simulate ▪ alter scaling logic to delay triggering ▪ fill up empty disk space and memory ▪ run stress or consume.exe Signals and Simulations
  15. Take-aways ▪ this was never about any one tool ▪

    codify resources and processes ▪ method over madness ▪ culture breeds reliability (words matter)
  16. Normal Accidents Living with High-Risk Technologies Charles Perrow 1984 Fatal

    Defect Chasing Killer Computer Bugs Ivars Peterson 1995 Accelerate Building and Scaling High-Perf Orgs Nicole Forsgren et al. 2018