Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Command

Avatar for Ben Sheldon Ben Sheldon
September 04, 2025
1

Incident Command

Avatar for Ben Sheldon

Ben Sheldon

September 04, 2025
Tweet

Transcript

  1. Outline Define Operations, Operators, and Incidents Establish My Authority Define

    Roles Describe Practices for operationalizing roles Intermission for Praxis Offer Tools that support roles and practices Laundry list Everything Else and Adjourn
  2. Crew management Plane Maintenance Business Process Development Infrastructure Business Process

    Pilots fly the plane Aviation IT/ICT Someone interacts with the production system. Operations Management Operators
  3. “Incident”: when things go boom (small potatoes) Limited Impact Event

    (big melons) Business Continuity Event • A feature is inaccessible to a small number of users • Website is running slowly • Something is spelled wrong • Expectations don’t match reality • Entire service is temporarily unavailable • Temporary infrastructure or networking failure • Significant misconfiguration • Everything gets deleted • Major contract compliance failure • Unrecoverable data • Critical service partner (AWS, Twilio, Github) has business continuity-level failure Note: “Events” are ambiguous because all systems run with a certain level of accepted failure (P2s).
  4. Operator Roles Role: Commander Communications Contributor Responsibilities: Go-to person for

    reporting on the state of the incident Establishes impact, and manages expectations of recovery Requests resources and sets objectives. Maintains a regular cadence (30min) of outbound updates Coordinates messaging with internal-external partners (e.g. client/partner success teams) Investigates and remediates the problem. In practice: “I am IC” To IC: “SitRep?” “I’ll take comms” “I can help”
  5. Practices • Clearly declare that an incident (or incidents) is

    in progress. “GetCalfresh.org is unavailable.” • Declaratively narrate actions, assumption of roles, and handoffs.“I am logging into the production server”; “I am IC”, “You have comms”; “I have comms” • Establish command before requesting “all hands on deck” • Be aware of your limitations (experience, physical, external obligations) and no-shame escalate. • Do NOT gripe (now) about how previous bad decisions led to this • Write down the steps you take/took to recover for next time in a reliable place Act to restore confidence in the system-as-a-whole.
  6. “Do what a person of good character would do”1 1.

    Pritchard, M; Broom, T. “The Concrete Sumo: Exigent Decision-Making in Engineering”. Science and Engineering Ethics 1999 October; 5(4): 541-567. Intermission
  7. “Working the problem” is NASA-speak for descending one decision tree

    after another, methodically looking for a solution until you run out of oxygen. We practice the “warn, gather, work” protocol for responding to fire alarms so frequently that it doesnʼt just become second nature; it actually supplants our natural instincts. So when we heard the alarm on the Station, instead of rushing to don masks and arm ourselves with extinguishers, one astronaut calmly got on the intercom to warn that a fire alarm was going off – maybe the Russians couldnʼt hear it in their module – while another went to the computer to see which smoke detector was going off. No one was moving in a leisurely fashion, but the response was one of focused curiosity; as though we were dealing with an abstract puzzle rather than an imminent threat to our survival. To an observer it might have looked a little bizarre, actually: no agitation, no barked commands, no haste." Chris Hadfield - “An Astronautʼs Guide to Life on Earth”
  8. ...and all the rest. • Monitoring • Forensic environments •

    Drilling and scenarios • Postmortems • SLOs / SLAs • Error budgets • Determining Impact • Risk management • Incident Prioritization • Cynefin Thanks!