engineering strategy and methodology. The term SRE was coined by Ben Treynor (Google) in 2003. Site Reliability Engineering involves both ops work -- tickets, on-call & manual tasks -- and development work -- internal tooling, SRE tools and building automatic systems. The percentage of time spent on ops/development depends on the needs of your organisation. It’s an important metric to track! Over time the ops % for each system should decrease. 4 @tammybutow
MITIGATION PREVENTION CLOSURE DETECTION Alert & page for SEV Discover source of SEV Introduce fix and mitigate impact of SEV TTD (Time to Detection) TTI (Total time of Impact) GameDay to replicate SEV and confirm fix is reliable Alert & page for SEV Understand root cause and complete all SEV action items TTR (Time to Recovery) TTD (Time to Detection) TBF (Time between failures) ROLES & RESPONSIBILITIES Incident Manager On-Call (IMOC) Tech Lead On-Call (TLOC) The IMOC leads and coordinate the SEV team through the SEV lifecycle. The TLOC settles in the trenches and stays laser-focused on technical problem solving
0 Slow Walrus Owner: IMOC (), TLOC () Status: Final/Draft Incident Date: Published Date: Executive Summary Impact: Root causes: Problem Summary: Duration of problem: Product(s) affected: % of product affected: User Impact: Revenue Impact: Detection: Resolution: Root Causes & Trigger: Timeline / Recovery efforts: Lessons Learned: What went well? What went poorly? • Outage • Recovery Where did we get lucky? Action Items: Glossary: Appendix: