What is SRE

1 What is SRE? Tammy Butow Principal SRE @ Gremlin

2 1. What is SRE? 2. SRE Phases 3. SRE
Use Cases 4. SRE Success Stories Agenda Product Development Capacity Planning Testing + Release Procedures Postmortem Analysis Incident Response Monitoring @tammybutow

3 What is SRE? @tammybutow

What is SRE? Site Reliability Engineering (SRE) is a software
engineering strategy and methodology. The term SRE was coined by Ben Treynor (Google) in 2003. Site Reliability Engineering involves both ops work -- tickets, on-call & manual tasks -- and development work -- internal tooling, SRE tools and building automatic systems. The percentage of time spent on ops/development depends on the needs of your organisation. It’s an important metric to track! Over time the ops % for each system should decrease. 4 @tammybutow

- Andrew Widdowson (SRE @ Google) “Our work is like
being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.” 5 @tammybutow

What is SRE? 6 @tammybutow Ops Dev 50% Time 50%
Time A day in the life of an SRE

What is SRE? 7 @tammybutow Ops Dev 80% Time 20%Time
A day in the life of an SRE

What is SRE? 8 @tammybutow Ops Dev 25% 50% A
day in the life of an SRE

What is SRE? 9 @tammybutow Ops Dev 25% 50% ?
25% This is time I can potentially share with another team! A day in the life of an SRE

10 SRE Phases @tammybutow

SRE Phases 11 @tammybutow Plan Code Test Build Deploy Operate
Productionize Integration Monitor

12 SRE Use Cases @tammybutow

13 @tammybutow Product Development Capacity Planning Testing + Release Procedures
Postmortem Analysis Incident Response Monitoring 1 2 3

14 SRE Use Case 1: Incident Response @tammybutow

SRE Use Case 1: Incident Response 15 @tammybutow DETECTION DIAGNOSIS
MITIGATION PREVENTION CLOSURE DETECTION Alert & page for SEV Discover source of SEV Introduce fix and mitigate impact of SEV TTD (Time to Detection) TTI (Total time of Impact) GameDay to replicate SEV and confirm fix is reliable Alert & page for SEV Understand root cause and complete all SEV action items TTR (Time to Recovery) TTD (Time to Detection) TBF (Time between failures) ROLES & RESPONSIBILITIES Incident Manager On-Call (IMOC) Tech Lead On-Call (TLOC) The IMOC leads and coordinate the SEV team through the SEV lifecycle. The TLOC settles in the trenches and stays laser-focused on technical problem solving

16 SRE Use Case 2: Postmortem Analysis @tammybutow

SRE Use Case 2: Postmortem Analysis 17 @tammybutow Postmortem: SEV
0 Slow Walrus Owner: IMOC (), TLOC () Status: Final/Draft Incident Date: Published Date: Executive Summary Impact: Root causes: Problem Summary: Duration of problem: Product(s) affected: % of product affected: User Impact: Revenue Impact: Detection: Resolution: Root Causes & Trigger: Timeline / Recovery efforts: Lessons Learned: What went well? What went poorly? • Outage • Recovery Where did we get lucky? Action Items: Glossary: Appendix:

SRE Use Case 2: Postmortem Analysis 18 @tammybutow Incident Database
Postmortem Analysis Dashboard Postmortems Postmortem Database

19 SRE Use Case 3: Incident Reproduction @tammybutow

SRE Use Case 3: Incident Reproduction 20 @tammybutow Postmortem Gremlin
Scenarios Incident Reproduction Results Automate Gremlin Scenarios

21 SRE Success Stories @tammybutow

SRE Success Stories: Dropbox 22 @tammybutow 10x reduction in incidents
in 3 months No SEV 0s for 12+ months Reduction in on-call time % Increase in team engagement

SRE Success Stories: Gremlin 23 @tammybutow Regular monthly GameDays Identification
of 10+ critical issues Reduction in on-call training time Increase in team knowledge

24 Join the community gremlin.com/slack @tammybutow

Thank You [email protected] linkedin.com/in/tammybutow/ @tammybutow

What is SRE

What is SRE

Tammy Bryant Butow

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript

1 What is SRE? Tammy Butow Principal SRE @ Gremlin

2 1. What is SRE? 2. SRE Phases 3. SRE

3 What is SRE? @tammybutow

What is SRE? Site Reliability Engineering (SRE) is a software

- Andrew Widdowson (SRE @ Google) “Our work is like

What is SRE? 6 @tammybutow Ops Dev 50% Time 50%

What is SRE? 7 @tammybutow Ops Dev 80% Time 20%Time

What is SRE? 8 @tammybutow Ops Dev 25% 50% A

What is SRE? 9 @tammybutow Ops Dev 25% 50% ?

10 SRE Phases @tammybutow

SRE Phases 11 @tammybutow Plan Code Test Build Deploy Operate

12 SRE Use Cases @tammybutow

13 @tammybutow Product Development Capacity Planning Testing + Release Procedures

14 SRE Use Case 1: Incident Response @tammybutow

SRE Use Case 1: Incident Response 15 @tammybutow DETECTION DIAGNOSIS

16 SRE Use Case 2: Postmortem Analysis @tammybutow

SRE Use Case 2: Postmortem Analysis 17 @tammybutow Postmortem: SEV

SRE Use Case 2: Postmortem Analysis 18 @tammybutow Incident Database

19 SRE Use Case 3: Incident Reproduction @tammybutow

SRE Use Case 3: Incident Reproduction 20 @tammybutow Postmortem Gremlin

21 SRE Success Stories @tammybutow

SRE Success Stories: Dropbox 22 @tammybutow 10x reduction in incidents

SRE Success Stories: Gremlin 23 @tammybutow Regular monthly GameDays Identification

24 Join the community gremlin.com/slack @tammybutow

Thank You [email protected] linkedin.com/in/tammybutow/ @tammybutow