Shit hit the fan—now what?
You know to build resilient systems and make small, planned changes, but computers (and humans) still fail. How do you deal with such failures? How do you recover?
Enter the Incident Commander. Adapted from the government and military’s incident response process, the Incident Commander handles the technical triage and orchestration necessary to get a swift resolution during crisis. The IC process focuses on clear communication, delegation, and trust between teams working in harmony.
New Relic has used the IC process for over two years, iterating and refining the process as we go. We train all our engineers to be ICs and have used this process to handle small deployment hiccups to network outages. We’ve built tools to support and archive our incident responses and have seen significant improvement in our understanding and response to such situations.
This talk will discuss the IC role, why you want it, how we iterated over it, lessons learned in the field, and the tools we built to support it.