Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Resilient software design - The past, the prese...

Resilient software design - The past, the present and the future

This slide deck is a bit of stocktaking and a look ahead regarding resilience in IT and resilience software design. It starts with the past - how resilience was treated inside and outside of IT.

Then it moves to the current state where it became a bit quit around resilient software design while at the same time microservices are everywhere and the complexity of the overall IT system landscape explodes, every day becoming a bit worse. One would expect that the problems become apparent for everyone but surprisingly they do not.

Still, companies start to realize at the business level that they might need more resilience due to unexpected adverse events and uncertainty becoming the norm. And as business and IT have become inseparable due to the ongoing digital transformation, it also affects IT. But as old habits die hard, investments are still scarce.

In the third part, the slide deck looks at the future of resilience and resilient software design. The prediction is that resilience alongside with sustainability will become the most important topics of IT (and the companies they support) in the ongoing 21st century. Still, to get there some homework needs to be done first. The slide deck lists a selection.

Finally, a few recommendations where to start and how to organize your way towards a resilient IT organization including a robust IT system landscape are given.

As always, the slide deck does not contain the voice track which means a lot of details are missing. Still, I hope it gives you a few ideas to ponder ...

Uwe Friedrichsen

November 17, 2022
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Resilient Software Design The past, the present and the future

    Uwe Friedrichsen – codecentric AG – 2013-2022
  2. Fault tolerance (early days) • Fault tolerance started decades ago

    • SAPO (1950s) • NASA LLNM computing (1960s), e.g., for Apollo and Voyager • F14 CADC (1970s) • Telecommunication switches (1970s) • Tandem Computers, Inc. (1970s) • Fault tolerance typically solved at hardware and OS level • Software development usually only affected marginally
  3. Fault tolerance (continued) • Boom with rise of cloud and

    microservices (early 201x) • E.g., Netflix OS (especially Hystrix) • More software development attention • Called “resilience”, but still focus on fault-tolerance • Meanwhile more infrastructure-level support • E.g., service meshes, API gateways, cloud infrastructure • Often neglected for the sake of cost-efficiency
  4. Resilience outside of IT • Multidisciplinary field • Psychological resilience

    • Organizational resilience • Supply chain resilience • Ecological resilience • Resilience engineering (safety) • Materials resilience • Cyber resilience (security) • ...
  5. Resilience outside of IT (continued) • Often different focus •

    Robustness/fault tolerance: Handle known failure modes • Resilience: Adapt to unknown failure modes (“surprises”) • Still, no generally accepted definition • Definition depends on the field • Sometimes robustness is part of it, sometimes it is not • Often related to safety and dependability • Common ground: Resilience is more than just robustness
  6. Bottom line (past) • Broad multidisciplinary field • No generally

    accepted definition • Used as synonym for fault-tolerance in IT • Usually expected to be solved at infrastructure level • Often ignored for the sake of maximizing cost-efficiency
  7. Ignoring the effects of distribution • Architects ignore effects of

    distribution • Developers ignore effects of distribution • Everyone else expects things to become faster and cheaper • Development expects infrastructure to solve the problems • Operations curses and is stressed out
  8. You build it. You ignore it. Build things as you

    like and neglect the consequences of your acting!
  9. Effects of distributed systems • Distributed systems introduce non-determinism regarding

    • Execution completeness • Message ordering • Communication timing • You will be affected by this at the application level • Don’t expect your infrastructure to hide all effects from you • Better know how to detect if it hit you and how to respond
  10. Infrastructure level means • Detect if a peer does not

    (timely) respond • Retry accessing the peer • Try to access a different instance from failover group • Try to fire up new instances • After instance loss is detected • If load exceeds a certain level (“autoscale”) • Throttle incoming requests • Notify administrators if additional action is required • …
  11. Infrastructure level limitations • Not all failure modes supported (e.g.,

    response failures) • Not all patterns supported (e.g., idempotency, fallback) • Not ubiquitously available (e.g., on-premises autoscale) • Often support from application level required (e.g., metrics) • Only undifferentiated, coarse-grained actions possible
  12. The question no longer is if failures will hit you.

    The only question left is when and how bad they will hit you.
  13. System landscape complexity • New development projects only focus on

    local optimization • Ignoring impact on complexity of whole system landscape • Leads to disproportionate increase in complexity • New paradigms only focus on their advantages • Ignoring effects on complexity of whole system landscape • Leads to disproportionate increase in complexity • Only a matter of time until IT will collapse beyond repair
  14. This requires resilience thinking beyond application robustness But it also

    requires more focus on application robustness, i.e., resilient software design
  15. Bottom line (present) • Understanding grows that resilience is needed

    at all levels • Complexity of IT landscapes has become a problem • Still, investments are scarce • “It’s going to be alright” mindset still prevalent
  16. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • In the end, all resilience proponents have the same goal • Debates about the “right” definition only confuse other people • Makes it harder to spread the ideas and their implementation
  17. resilience The ability to successfully cope with adverse events and

    situations, including 1. handling expected adverse events and situations (robustness) 2. handling unexpected adverse events and situations (surprise) 3. improving due to adverse events and situations (anti-fragility) resilient software design Designing and building software-based systems in ways that improve their dependability and thus support resilience according to the definition above
  18. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Maximizing efficiency cripples resilience
  19. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Maximizing efficiency cripples resilience • Short-term thinking compromises resilience • Focus on minimizing short-term development costs compromises resilient software design • Huge change of ingrained mindset
  20. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Understand resilience in IT • Resilience is a socio-technical topic • Cannot be solved at the technical level alone • Cannot be solved with tools or products • Technology can only support
  21. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Understand resilience in IT • Understand resilient software design • Cannot be solved at the infrastructure level • Requires tight ops-dev feedback loops to be effective • Without a proper functional design, nothing else matters
  22. Some recommendations • Regarding system design • Mind the functional

    design • Strive for functional independence of runtime units • Then augment with resilience patterns • Domain-driven design can support
  23. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Simplify! – Complexity is the enemy of resilience • Coordinate infrastructure, application and organization level measures
  24. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Regarding IT organization and processes • Establish short feedback loops across the IT value chain • Make resilience a continuous improvement process • Include chaos engineering
  25. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Regarding IT organization and processes • Regarding product functionality • Simplify! –Keep the product simple • Regularly remove features that are rarely or not used • Implement business metrics
  26. “Perfection is achieved, not when there is nothing more to

    add, but when there is nothing left to take away.” -- Antoine de Saint-Exupery
  27. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Regarding IT organization and processes • Regarding product functionality • Regarding humans • Provide great user experience for all types of users • Provide training for all parties along the IT value chain
  28. More to ponder • Organic computing • Interplay between resilience

    and sustainability • Interplay between resilience and security • Resilience beyond robustness, withstanding and recovery
  29. Summing up • Resilience is huge multidisciplinary topic • Started

    as fault-tolerance in IT • Had a little hype a few years ago • Will become essential topic of the 21st century • Much more than fault-tolerance or robustness alone • Awareness increases • Yet currently little investments • Lots of homework to be done
  30. The future is already here – it's just not evenly

    distributed. ― William Gibson