Upgrade to Pro — share decks privately, control downloads, hide ads and more …

THE FUTURE OF CHAOS ENGINEERING: IN PURSUIT OF ...

Chaos Conf
September 26, 2019

THE FUTURE OF CHAOS ENGINEERING: IN PURSUIT OF THE UNKNOWN UNKNOWNS

Crystal Hirschorn, Conde Nast

"Systems fail all the time" goes the popular mantra in Reliability and Resilience engineering fields. Given this premise, industry leading organizations' practices have accelerated and matured several degrees to where we were even a few years ago. Organizations are beginning to stretch beyond their homegrown approaches to building organizational resilience to leveraging the expertise within the industry, and integrating approaches directly into the software deployment lifecycle through commoditized Chaos services.

However, our systems and organizations keep growing in complexity under the ever-increasing pressure for efficiency and scale. Our architectural approaches and paradigms keep shifting to cope with the complexity of domains such as wide adoption of micro services and Serverless development approaches.

A current limiting factor in running Chaos experiments is their contrived nature - we must think ahead what could go wrong. Is this true to experience? What about the sense of surprise that usually pervades failure situations? How can we facilitate more random, generative experiments?

In this talk, Crystal will offer where our Chaos and Resilience practices must evolve to keep pace with the challenges of growing complexity.

Chaos Conf

September 26, 2019
Tweet

More Decks by Chaos Conf

Other Decks in Technology

Transcript

  1. The Future of Chaos Engineering In Pursuit of the Unknown

    Unknowns Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn
  2. "Complexity doesn't allow us to think in linear, unidirectional terms

    along which progress or regress could be plotted."
  3. BRAZIL AUSTRALIA ARABIA CHINA FRANCE GERMANY INDIA ITALY JAPAN KOREA

    LATIN AMERICA NETHER LANDS POLAND PORTUGAL SOUTH AFRICA SPAIN TAIWAN THAILAND TURKEY UK UKRAINE HUNGARY BULGARIA ICELAND ROMANIA CZECH REP SLOVAKIA MEXICO RUSSIA
  4. COMPLICATED Known Unknowns SIMPLE Known knowns COMPLEX Unknown Unknowns CHAOTIC

    Unknowables Emergent Practice Good Practice Novel Practice Best Practice Disorder
  5. Organisational Pressures and Constraints Regulators Policies Economics Competition Governance Logistics

    Management Outside influences Internal (org) influences Operator influences Efficiency Trade Offs Automation Time criticality Esoteric knowledge Mental models Ergonomics OpEx vs CapEx pressures Lacking details Culture norms Geopolitical Vendors Societal culture Workload Cognitive switching
  6. Actions. Other sources for learning opportunities. Action 1 Description: Gaps

    identified in architectural knowledge. Mary will do a 2 weeks rotation to shadow and pair on team Orion. Artefacts: Whiteboard diagrams from post-incident review Owner: Orion Action 2 Description: Incident Management process did not flow in expected order. Escalations were delayed. Schedule more role playing and game days. Artefacts: Game Day template Incident Management Process Owner: SRE Action 3 Description: Too many graphs are being displayed in single dashboard. Many are not easily discernible by product engineering. Zenith to work with Orion and Hydra teams on system metrics visualisation strategy. Artefacts: DataDog dashboard (timestamped to match incident timings) Owner: Zenith
  7. ████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗ ██╗ ██╗ ██████╗ ██╗

    ██╗ ╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝ ╚██╗ ██╔╝██╔═══██╗██║ ██║ ██║ ███████║███████║██╔██╗ ██║█████╔╝ ╚████╔╝ ██║ ██║██║ ██║ ██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚██╔╝ ██║ ██║██║ ██║ ██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗ ██║ ╚██████╔╝╚██████╔╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn