$30 off During Our Annual Pro Sale. View Details »

How to Fight Production Incidents?

How to Fight Production Incidents?

Andrey Satarin

January 25, 2023
Tweet

More Decks by Andrey Satarin

Other Decks in Programming

Transcript

  1. How to Fight Production Incidents?
    An Empirical Study on a Large-scale Cloud
    Service
    By Supriyo Ghosh, Manish Shetty, Chetan Bansal, Suman Nath

    Presented by Andrey Satarin (@asatarin)

    January, 2023

    https://asatarin.github.io/talks/2023-01-how-to-
    fi
    ght-incidents/

    View Slide

  2. Outline
    • Methodology


    • Root causes and mitigation


    • What causes delays in response?


    • Lessons learnt


    • Multi-dimensional analysis


    • Conclusions
    2

    View Slide

  3. Methodology
    3

    View Slide

  4. Incidents to study
    • 152 incidents from Microsoft Teams


    • Analyze root causes, detection and mitigation approaches


    • Only incidents with complete postmortem report


    • High severity only: 1 incident SEV0, ~30% SEV1, ~70% SEV2
    4

    View Slide

  5. Factors to study
    • Root Cause — What issue caused the incident?


    • Mitigation Steps — What steps were performed to restore service health?


    • Detection Failure — Why did monitoring not detect the incident?


    • Mitigation Failure — What challenges delayed incident mitigation?


    • Automation Opportunities — What automation can help improve service resilience?


    • Lessons for Resiliency — What lessons were learnt about the service’s behavior and
    improving resiliency?
    5

    View Slide

  6. Threat to validity
    • Microsoft already uses some effective tools and techniques to proactively
    mitigate many types of incidents


    • About 35% of incidents were filtered out because did not have complete
    postmortem


    • Microsoft-Teams only incidents
    6

    View Slide

  7. Root causes and mitigation
    7

    View Slide

  8. Root causes
    • Code Bug — 27.0 %


    • Dependency Failure — 16.4 %


    • Infrastructure — 15.8 %


    • Deployment Error — 13.2 %
    8
    • Con
    fi
    g Bug — 12.5 %


    • Database/Network — 10.5 %


    • Auth Failure — 4.6 %

    View Slide

  9. Finding #1
    • While 40% incidents were root caused to code or configuration bugs,
    a majority (60%) were caused due to non-code related issues in
    infrastructure, deployment, and service dependencies.


    • 40 % = Code Bug (27.0 %) + Config Bug (12.5 %)
    9

    View Slide

  10. Mitigation steps
    • Rollback - 22.4 %


    • Infra Change - 21.1 %


    • External Fix - 15.8 %


    • Con
    fi
    g Fix - 13.2 %
    10
    • Ad-hoc Fix - 11.8 %


    • Code Fix - 7.9 %


    • Transient - 7.9 %

    View Slide

  11. Finding #2
    • Although 40% incidents were caused by code/configuration bugs, nearly
    80% of incidents were mitigated without a code or con
    fi
    guration
    fi
    x.


    • 80 % = 100 % - Config Fix (13.2 %) - Code Fix (7.9 %)
    11

    View Slide

  12. Finding #3
    • Mitigation via roll back, infrastructure scaling, and traffic failover account for
    more than 40% of incidents, indicating their popularity for quick mitigation.


    • 40 % = Rollback (22.4 %) + Infra Change (21.1 %)
    12

    View Slide

  13. What causes delays in response?
    13

    View Slide

  14. Finding #5
    • The time-to-detect code bugs and dependency failures is significantly
    higher than other root causes, indicating inherent difficulties in monitoring
    such incidents.
    14

    View Slide

  15. Finding #6
    • Manually fixing code and configuration take significantly higher time-to-
    mitigate, when compared to rolling back changes. This supports the
    popularity of the latter method for mitigation.
    15

    View Slide

  16. Detection failure
    • Not Failed — 52.0 %


    • Unclear — 11.8 %


    • Monitor Bug — 10.5 %


    • No Monitors — 8.6 %
    16
    • Telemetry Coverage — 8.6 %


    • Cannot Detect — 4.6 %


    • External Effect — 4.0 %

    View Slide

  17. Finding #7
    • 17 % of incidents either lacked monitors or telemetry coverage, both of
    which result in significant detection delays.


    • 17 % = No Monitors (8.6 %) + Telemetry Coverage (8.6 %)
    17

    View Slide

  18. Mitigation failure category
    • Not Failed — 27.6 %


    • Unclear — 27.6 %


    • Documents-Procedures — 10.5 %


    • Deployment Delay — 10.5 %
    18
    • Manual Effort — 9.2 %


    • Complex Root Cause — 7.2 %


    • External Dependency — 7.2 %

    View Slide

  19. Finding #8
    • While complex root causes can affect time-to-mitigate, 30% of incidents
    had mitigation delays even after identifying the root cause due to poor
    documentation, procedures, and manual deployment steps.
    19

    View Slide

  20. Lessons learnt
    20

    View Slide

  21. Automation opportunities
    • Unclear — 32.2 %


    • Manual Test — 25.7 %


    • None — 15.1 %


    • Auto Alert/Triage — 15.1 %


    • Con
    fi
    g Test — 5.9 %


    • Auto Deployment — 5.9 %
    21

    View Slide

  22. Finding #9
    • Improving testing was a popular choice for automation opportunities,
    over monitoring, indicating a need to reduce incidents by identifying issues
    before they reach production services.
    22

    View Slide

  23. Lesson learnt category
    • Unclear — 37.5 %


    • Improve Monitoring — 15.8 %


    • Behavioral Change — 11.8 %


    • External Coordination — 10.5 %
    23
    • Improve Testing — 9.9 %


    • Documents/Training — 7.9 %


    • Auto Mitigation — 6.6 %

    View Slide

  24. Finding #10
    • While improving monitoring/testing accounts for majority of the lessons
    learnt, a signi
    fi
    cant ≈20% feedback indicated improved documentation,
    training, and practices for better incident management and service
    resiliency.


    • 20 % = Behavioral Change (11.8 %) + Documents/Training (7.9 %)
    24

    View Slide

  25. Multi-dimensional analysis
    25

    View Slide

  26. Finding #11
    • 70% of incidents with no monitors were root caused to code bugs, i.e., it is
    inherently difficult to monitor regressions introduced due to code changes.


    • => For code changes, we should improve testing rather than relying on
    monitoring.
    26

    View Slide

  27. Finding #12
    • 42% of incidents that cannot be detected by monitoring today, were
    associated with dependency failures


    • => There is a need to introduce/increase monitoring coverage and
    observability across related services.
    27

    View Slide

  28. Finding #13
    • 47% of configuration bugs mitigated with a rollback compared to
    a lesser 21% mitigated with a configuration fix; i.e., A large portion of
    misconfigurations are due to recent changes


    • => They can be identified by rigorous configuration testing.
    28

    View Slide

  29. Finding #14
    • 21% of incidents where manual effort delayed mitigation, expected
    improvements in documentation and training.


    • => Just like with source code, we need to design new metrics and methods
    to monitor documentation quality. Also, automating repeating mitigation
    tasks can reduce manual effort and on-call fatigue.
    29

    View Slide

  30. Finding #15
    • 25% of incidents where mitigation delay was due to manual deployment
    steps, expected automated mitigation steps to manage service
    infrastructure (like traffic-failover, node reboot, and auto-scaling).
    30

    View Slide

  31. Conclusions
    31

    View Slide

  32. Conclusions
    • 152 incident reports studied


    • Identified potential automation opportunities


    • Multi-dimensional analysis uncovers important insights for improving
    reliability



    32

    View Slide

  33. 33
    https://twitter.com/MSFT365Status/status/1618178407316987905

    View Slide

  34. Today’s outage
    > We've rolled back a network change

    Mitigation strategy — Rollback (22.4 %)


    > We've rolled back a network change

    Root cause — Database/Network (10.5 %)


    > We’re monitoring the service as the rollback takes effect

    34

    View Slide

  35. References
    35

    View Slide

  36. References
    • Self reference for this talk (slides, video, etc)

    https://asatarin.github.io/talks/2023-01-how-to-fight-incidents/


    • “How to fight production incidents?: an empirical study on a large-scale
    cloud service” paper https://dl.acm.org/doi/10.1145/3542929.3563482
    36

    View Slide

  37. Contacts
    • Follow me on Twitter @asatarin


    • Follow me on Mastodon https://discuss.systems/@asatarin


    • Profession profile https://www.linkedin.com/in/asatarin/


    • Other public talks https://asatarin.github.io/talks/
    37

    View Slide