SRE at Google uses an incident management protocol to manage outages of the complex distributed systems we manage. These are stressful times that involve a fundamental human factor, when many people can be involved and that the regular organisation might not help unlocking the whole potential of the persons involved.
From phycological safety, to training and protocols, this talk is meant to discuss everything that goes on when services are not available.