the incident? • Mitigation Steps — What steps were performed to restore service health? • Detection Failure — Why did monitoring not detect the incident? • Mitigation Failure — What challenges delayed incident mitigation? • Automation Opportunities — What automation can help improve service resilience? • Lessons for Resiliency — What lessons were learnt about the service’s behavior and improving resiliency? 5
and techniques to proactively mitigate many types of incidents • About 35% of incidents were filtered out because did not have complete postmortem • Microsoft-Teams only incidents 6
code or configuration bugs, a majority (60%) were caused due to non-code related issues in infrastructure, deployment, and service dependencies. • 40 % = Code Bug (27.0 %) + Config Bug (12.5 %) 9
traffic failover account for more than 40% of incidents, indicating their popularity for quick mitigation. • 40 % = Rollback (22.4 %) + Infra Change (21.1 %) 12
the lessons learnt, a signi fi cant ≈20% feedback indicated improved documentation, training, and practices for better incident management and service resiliency. • 20 % = Behavioral Change (11.8 %) + Documents/Training (7.9 %) 24
root caused to code bugs, i.e., it is inherently difficult to monitor regressions introduced due to code changes. • => For code changes, we should improve testing rather than relying on monitoring. 26
by monitoring today, were associated with dependency failures • => There is a need to introduce/increase monitoring coverage and observability across related services. 27
rollback compared to a lesser 21% mitigated with a configuration fix; i.e., A large portion of misconfigurations are due to recent changes • => They can be identified by rigorous configuration testing. 28
mitigation, expected improvements in documentation and training. • => Just like with source code, we need to design new metrics and methods to monitor documentation quality. Also, automating repeating mitigation tasks can reduce manual effort and on-call fatigue. 29
due to manual deployment steps, expected automated mitigation steps to manage service infrastructure (like traffic-failover, node reboot, and auto-scaling). 30
Mitigation strategy — Rollback (22.4 %) > We've rolled back a network change Root cause — Database/Network (10.5 %) > We’re monitoring the service as the rollback takes effect 34
https://asatarin.github.io/talks/2023-01-how-to-fight-incidents/ • “How to fight production incidents?: an empirical study on a large-scale cloud service” paper https://dl.acm.org/doi/10.1145/3542929.3563482 36
on Mastodon https://discuss.systems/@asatarin • Profession profile https://www.linkedin.com/in/asatarin/ • Other public talks https://asatarin.github.io/talks/ 37