Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From here to resilience - a travel guide

From here to resilience - a travel guide

This slide deck starts with the observation that many companies claim to be resilient but only few of them really are.

Then a prototypical journey of a regular company IT department from leaving availability to operations to a truly resilient organization is laid it. Along the way several interim stops are discussed in terms of their goals, leading questions, typical measure and tradeoffs until the peak of advanced resilience is reached eventually.

Additionally, it is discussed if it is always necessary to aim for the peak or if one of the interim stops also may be okay depending on the context. Finally, a quick oversight is presented that can help to figure out where an organization currently is regarding resilience.

As always, the voice track of the presentation is missing. Nevertheless, I hope it still is useful for you and gives you a few ideas to ponder on your own journey towards resilience.

Uwe Friedrichsen

May 30, 2024
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. resilience The ability to successfully cope with adverse events and

    situations, including 1. handling expected adverse events and situations (robustness) 2. handling unexpected adverse events and situations (surprise) 3. improving due to adverse events and situations (anti-fragility) resilient software design Designing and building software-based systems in ways that improve their dependability and thus support resilience according to the definition above
  2. Sources of failure (examples) • Hardware failure • Central process

    becomes latent • Firmware bug in infrastructure component • Cyberattack • Critical software bug • Triple redundant data center cooling fails at once • Competitor launches a disruptive new product • …
  3. Is your IT prepared to handle all those sources of

    failure (and many more) swiftly, successfully and gracefully?
  4. Valley of feature-completeness • Status quo for many organizations •

    Dev is responsible for feature delivery • Ops is responsible for availability • Dev budget is reserved for implementing business features • NFRs besides maintainability are “outsourced” to ops
  5. Valley of feature-completeness • Core driver • Maximize business feature

    throughput • Leading questions • Is the business requirement implemented correctly? • How can we implement features faster?
  6. Valley of feature-completeness • Typical measures • Everything ops can

    influence • Redundant hardware and infrastructure components, load balancer with failover, cluster, HA hardware • Strict handover rules for dev artifacts • Long pre-production testing phases to check for potential production problems
  7. Valley of feature-completeness • Trade-offs • Nice for Dev (less

    to take care of) • Not so nice for Ops (expected to run software reliably that was created without availability as quality goal)
  8. Valley of feature-completeness • When to use • Monolithic and

    isolated IT systems, exchanging data via batch interfaces • When to avoid • Distributed, interconnected system landscapes, communicating over online interfaces (the default today)
  9. Valley of feature-completeness • Blind spot • Everything besides business

    features (and maintainability) • Reality of today’s system landscapes • Availability is treated as SEP * * Somebody else’s problem
  10. Plateau of stability • Core driver • Avoid failure •

    Leading questions • How can I avoid the failure of my application/service? • How can I detect a failure and automatically fail over? • How can I avoid an overload situation? • How can I detect an overload situation and fix it by automatically scaling up?
  11. Plateau of stability • Typical measures • Redundant service deployment

    • Timeout, error checking, retry, circuit breaker, failover • Rate limiting, back pressure • Autoscaling • Measures often statically preconfigured, utilizing middleware whenever possible • Focus on technical measures only
  12. Plateau of stability • Impact radius • Technology only •

    Collaboration modes • Valley of feature-completeness à Ops alone • Plateau of stability (basic) à Dev | Ops (mostly independent) • Plateau of stability (advanced) à Dev & Ops (feedback loop)
  13. Plateau of stability • Trade-offs • Relatively easy to reach

    • Often supported by middleware and infrastructure means • Quite good availability achievable • If system parts fail, recovery (and detection) often takes long • Works best with ops-dev feedback loop • Works good with an economies of scale business model
  14. Plateau of stability • When stability is fine • Not

    too high availability needs (< 3 Nines) • Planned downtimes possible • System is not distributed internally • When stability is not sufficient • Higher availability needs (> 3 Nines) • System distributed internally (e.g., microservices) • Safety-critical systems
  15. An example You: “How do you handle the situation if

    the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  16. Variants of the trap • Infrastructure components will never fail

    • E.g., OS, schedulers, routers, switches, … • Middleware components will never fail • E.g., message queues, databases, … • All encompassing applications and services will never fail • No message loss, latency, response failures, …
  17. The “100% available” trap in a nutshell “Everything works perfectly,

    all the time. Nothing ever fails.” Successor of the “Ops is responsible for availability” mindset
  18. Continuous partial failure is the normal state of affairs. --

    Michael Nygard Source: https://www.cognitect.com/blog/2016/2/3/the-new-normal-failure-is-a-good-thing
  19. Availability = MTTF MTTF + MTTR MTTF: Mean Time To

    Failure MTTR: Mean Time To Recovery Our overall aim is to maximize availability Stability thinking is assuming that MTTF can be increased unlimited and thus MTTR can be ignored Robustness thinking is accepting that increasing MTTF is limited and thus MTTR must be reduced to further increase availability
  20. Failures modes (excerpt) • Crash failure • Overload failure •

    Omission failure • Timing failure • Response failure • Byzantine failure • Software bugs • Firmware bugs • Security vulnerabilities • …
  21. Effects of failure modes (excerpt) • Lost or incomplete messages

    • Duplicate messages • Latency up to complete standstill • Out-of-order message arrival • Partial, out-of-sync local memory • Split brain • Persistent malfunction • Data corruption or loss • Confidential information leak
  22. Plateau of robustness • Core driver • Maximize availability (embrace

    failure) • Leading questions • What can go wrong and how can I respond to it? • What can I do if a remote service is not available? • How can I detect and handle invalid requests (when being called) and responses (when calling)? • How can I fix bugs and other defects quickly?
  23. Plateau of robustness • Typical measures • Fallback • Complete

    parameter checking • Minimize startup time • Deployment automation (CI/CD, IaC/IfC, …) • Application and business level monitoring • Focus extended to business domain
  24. Plateau of robustness • Trade-offs • More effort needed to

    reach • Affects not only systems, but also processes a bit • Change of mindset required (from avoid failures to embrace failures) • Tight ops-dev collaboration required • Very high availability achievable • Works good with an economies of speed business model
  25. Plateau of robustness • When robustness is fine • Higher

    availability needs (> 3 Nines) • System distributed internally (e.g., microservices) • When robustness is not sufficient • Safety-critical systems • Very high availability needs in highly uncertain technical environments • High innovation speed required in highly uncertain business environments
  26. Plateau of robustness • Impact radius • Technology, business domain,

    touching processes • Blind spot • The limits of perception
  27. Known knowns Things we know and are aware of We

    usually take these topics into account Unknown knowns Things we implicitly know but are not aware of Known unknowns Things we do not know and are aware of that we do not know them Unknown unknowns Things we do not know and are not aware of that we do not know them We definitely miss these topics We may take these topics into account We may be aware we ignored these topics
  28. Socio-technical system The IT systems and the encompassing organization creating,

    running and changing them Technical system The IT system landscape Suitable to respond to adverse events and situations including surprises (resilience) Suitable to respond to expected adverse events and situations (robustness)
  29. High-plateau of basic resilience • Core driver • Expect the

    unexpected • Leading questions • How can I maximize the odds of detecting and responding quickly to an unexpected error before it becomes a failure? • Which resources does my IT organization need to be able to respond quickly and successfully to adverse surprises? • How can I organize best to be able to respond quickly and successfully to adverse surprises? • How do I balance resilience and efficiency?
  30. High-plateau of basic resilience • Typical measures • Self-organized teams

    • Fire drills & chaos engineering • Slack in the system • Observability • Organic computing and residuality theory may support • Focus extended to whole socio-technical system
  31. High-plateau of basic resilience • Trade-offs • High effort needed

    to reach • Affects the whole socio-technical system • Usually needs to reshape collaboration at system boundaries • Allows for reliable very high availability even in the face of unexpected adverse situations • Enables very high innovation speed without compromising dependability even in highly uncertain environments
  32. High-plateau of basic resilience • When to use • Safety-critical

    systems • Very high availability in highly uncertain technical environments • High innovation speed in highly uncertain business environments
  33. High-plateau of basic resilience • Impact radius • Technology, business

    domain, processes, organization • Focus on withstanding and quick recovery • Blind spot • Standing still
  34. Withstand Resist adversities Adapt Learn and improve Recover Quickly recover

    Transform Radically change Covered by basic resilience Resilience response types Covered by advanced resilience (Anti-Fragility)
  35. Peak of advanced resilience • Core driver • Adversity as

    opportunity to improve • Leading questions • How do I need to adapt at all levels to improve my ability to handle adverse situations successfully? • Is adaptation enough or do I need a more radical change to reduce my vulnerability to adverse situations? • How can I establish a continually learning and improving organization? • How do I need to shift and change my organizational boundaries to become less vulnerable to adverse situations?
  36. Peak of advanced resilience • Typical measures • Culture of

    continuous learning and improvement • System thinking (improving the system, not only the parts) • Leapfrogging • Perception shift from nuisance to opportunity • Impact radius • Technology, business domain, processes, organization • Focus on all resilience response types (as needed)
  37. Peak of advanced resilience • Trade-offs • Effort required comparable

    to basic resilience • Requires different mindset regarding adverse situations • Adverse situations make you stronger • Will eventually affect the whole company (won't stop at the boundaries of the IT organization) • When to use • Prepare for a successful endless game in an increasingly uncertain world ("VUCA")
  38. Plateau of stability Plateau of robustness High-plateau of basic resilience

    Peak of advanced resilience Visited Core driver Blind spot When (not) to go for it Valley of feature- completeness Maximize business feature throughput Availability is SEP • Okay for isolated systems • Not advisable for distributed, online communicating system landscapes (which is the norm) Avoid failure 100% available trap • Not too high availability demands (< 3 Nines) • Planned downtimes possible • System not distributed internally Maximize availability Limits of perception • High availability demands (> 3 Nines) • System distributed internally (e.g., microservices) Expect the unexpected Standing still • Safety-critical systems • High availability in unpredictable technical environments • Uncertain business environments Adversity as opportunity to improve • Successful endless game in an increasingly uncertain world Remember: Business and IT have become inseparable —
  39. Depending on your task and your needs, you do not

    always need to aim for the peak
  40. However, in the long run only those will thrive and

    survive in an increasingly VUCA world who aim for the peak Remember: Business and IT have become inseparable
  41. Collaboration Availability Failures Responses Systems Surprises Feature- complete Stability Robustness

    Basic resilience Advanced resilience Impact — Ops alone — — — Technical — Business logic Processes à Minimize MTTR All failure types à à Recover + + + à à à à à à Adapt Transform + Technology Dev | Ops Dev + Ops Maximize MTTF Crash Overload Known à Withstand Processes Organization à à à Unknown Socio-technical + + à
  42. Wrap-up • Resilience is not what you think it is

    • The difficult path up Mount Resilience • How far to go and when it is okay to stop • Understanding where you are