Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE NEXT 2022: Sensible Incident Management for...

SRE NEXT 2022: Sensible Incident Management for Software Startups

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Programming

Transcript

  1. Who? Name: Takayuki Watanabe Affiliation: Launchable, Inc. Role: Software Engineer

    Sns: Blog: blog.takanabe.tokyo GitHub: takanabe Twitter: @takanabe_w Interests: - Developer Productivity - Site Reliability Engineering - Sustainability Engineering SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 2
  2. Your takeaways You can understand: • Incident management has a

    life cycle. • Incident response roles and structures exist to embody 3T mental models. • Choosing strategies and tools makes incident managements at startups sensible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 3
  3. Out of scope • Fundamental SRE terminology (e.g. SLO, SLI,

    Error budget, Postmortem) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 4
  4. Disclaimer • This session refers a lot of exis0ng incident

    management and SRE prac0ces. • But contains a lot of opinionated ideas and philosophy as well. • So, the ideas might contradict to some people's. • Let's discuss on TwiAer using #srenext with @takanabe_w SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 5
  5. Today's agenda • About Launchable • Does a startup need

    incident management? • Dissect incident management prac8ces. • 3T mental models and life cycles • How can we improve incident management? • Choosing right strategies and tools SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 6
  6. Chapter 1: About Launchable SRE NEXT 2022 / presented by

    Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 7
  7. What is Launchable? A SaaS accelera)ng so.ware development cycles. SRE

    NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 9
  8. What is Launchable? Current focus is machine learning based test

    selec0ons by: • Predic(ng a meaningful subset of tests. • Iden(fying flaky tests. • Visualizing test trends with metrics. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 10
  9. What is Launchable? e.g. Reordering tests based on likelihood of

    failures. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 11
  10. Our team size • Launchable is a startup • 2

    CEOs + 15 employees • So3ware engineer (7 people) • Product manager • Marke>ng • Sales • etc... SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 12
  11. Phases and the number of so0ware engineers Note: the numbers

    are es/mated by the presenter based on previous experiences. • Phase 0: Founding ~ 4 so3ware engineers • Phase 1: 5 ~ 10 so3ware engineers • Phase 2: 11 ~ so3ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 13
  12. My SRE NEXT 2022 is about ... SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 14
  13. Incident management at so#ware startups SRE NEXT 2022 / presented

    by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 15
  14. Does a startup need incident management? SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 16
  15. Yes, it's obvious if products have customers. SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 17
  16. Do you have enough engineering members? SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 18
  17. Learning from previous careers1 • I've worked at various sizes

    and stages. • Company A: +300,000 people • Company B: +400 people (Joined when they had +300 people) • Company C: +150 people (Joined when they only had less than 10 people) • Product developments are always the highest priority concerns. • OperaHon improvement != Product development velocity degradaHon. • We will never have enough engineering members to improve opera;ons. Never. 1 SRE NEXT 2020: Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 20
  18. Are speed and quality trade-off? • I personally don't think

    so 2 3. • I believe sensible incident management accelerates our development velocity. 3 A Philosophy of So.ware Deisgn, Chapter 3: Working Code Isnt' Enough, pp. 13 - 18. 2 mar&nFowler.com: Is High Quality So;ware Worth the Cost? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 21
  19. Can we reframe the original ques3on? • We want to

    reframe "Does a startup need incident management?" to: • Which incident management processes won't change even for rapid developments? • Which processes should we improve? • Let's dissect incident management prac=ces in the industry. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 22
  20. Chapter 2: Dissect incident management prac/ces SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 23
  21. What is incident management? Incident management • High level and

    overall process for handling incidents in an organiza5on. Incident response • Part of incident management for actual technical steps including detec5on, repor5ng, mi5ga5on, and recovery during incidents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 24
  22. Examples of terminology 4 • CAN Reports • Deputy •

    Execu3ve Swoop • Grenade Thrower • Incident Commander (IC) • Resolver • Severity • Scribe • Subject Ma4er Expert (SME) 4 h$ps:/ /response.pagerduty.com/training/glossary/ SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 27
  23. Examples of roles at Google 5 6 6 Anatomy of

    an Incident Google’s Approach to Incident Management for Produc;on Services, Chapter 4: Mi;ga;on and Recovery, pp. 31-32. 5 Google SRE Workbook, Chapter 9: Incident Response SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 29
  24. Examples of roles at PagerDuty 7 8 8 Google SRE

    Workbook, Chapter 9: Incident Response 7 PagerDuty Incident Response Documenta6on, Different Roles - SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 30
  25. Can we translate these prac.ces into more higher level concepts?

    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 32
  26. Chapter 3: 3T mental models and life cycles SRE NEXT

    2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 33
  27. Examples of roles at PagerDuty SRE NEXT 2022 / presented

    by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 34
  28. Command role • Responsibility is managing incident responses to align

    in organiza5ons. • Understand ongoing opera5ons • Understand who is doing what • Delegate sub-commander responsibility to others if necessary. • Make incident response tangible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 35
  29. Liason role • Responsibility is smooth repor1ng and communica1ons. •

    For both internally and externally. • Make incident response transparent. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 36
  30. Opera&on role • Responsibility is actual technical ac2vi2es to solve

    issues. • Focus on triage, analysis, mi2ga2on and recovery. • Communica2on with rest of organiza2ons is not a primary concern. • In many cases, operators produce root causes of incidents but don't blame them. • Nobody wants to cause incidents. • All par2cipants focus on assigned roles based on chain of trust. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 37
  31. 3T mental models for incident response The incident response roles

    embody 3T mental models. • Transparency • Keep informa-on of incident responses reachable for everybody. • Tangibility • Manage status of incidents. • Manage who handles what. • Trust • Believe everybody makes best efforts during incidents. • Don't blame anybody because nobody wants to cause incidents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 38
  32. High level view of incident management cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 39
  33. High level view of incident management cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 40
  34. High level view of incident management cycles Examples: • Incident

    management policy • Documenta3on • Repor3ng mechanism • Observability • Aler.ng policy • Incident response training SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 41
  35. High level view of incident management cycles Examples: • Aler&ng

    • Triage • Root-cause analysis • Escala'ons • Opening war rooms SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 42
  36. High level view of incident management cycles Examples: • Rollback

    deployment (mi3ga3on) • Kill slow queries (mi3ga3on) • Fix bug (recovery) • Add index to tables (recovery) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 43
  37. High level view of incident management cycles Examples: • Addi%onal

    triage • Prepara%on for postmortems • Postmortems • Handle ac*on items raised at postmortems SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 44
  38. Chapter 4: How can we improve incident management? SRE NEXT

    2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 46
  39. Where should we invest our 0me? SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 47
  40. Where should we invest our 0me? SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 48
  41. Key %mes of incident response SRE NEXT 2022 / presented

    by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 49
  42. Key %mes of incident response • Time to detect (TTD)

    • Time to engagement (TTE) • Time to fix (TTF) • Time to repair/recovery (TTR) • Time between failures (TBF) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 50
  43. Time to detect (TTD) SRE NEXT 2022 / presented by

    Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 51
  44. Time to engagement (TTE) SRE NEXT 2022 / presented by

    Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 52
  45. Time to fix (TTF) SRE NEXT 2022 / presented by

    Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 53
  46. Time to recovery (TTR) SRE NEXT 2022 / presented by

    Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 54
  47. Time between failures (TBF) SRE NEXT 2022 / presented by

    Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 55
  48. Which &me do we want to improve? SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 56
  49. TTD at Launchable Current status • We've already had several

    detec0on mechanisms using Datadog and Sentry. Solu%on • Introduc*on of SLO and Error Budget makes our aler*ng criteria more clear. • But don't forget "Law of diminishing returns" to make decisions. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 57
  50. TTE at Launchable Current status • Easy enough to no.ce

    during office hours at Slack channels. • We don't have on-call rota.ons ATM, which makes TTE uncontrollable. Solu%on • Apply follow-the-sun strategy to cover wide-range hours. • Introducing on-call rota:ons and pager. • But we don't feel it's necessary now. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 58
  51. TTF at Launchable Current status • We don't have enough

    observability mechanisms • Depending on each developer's debug skill • During this window, developers cannot spend .me on product developments. Solu%on • Introducing more team-shared observability dashboards. • Introducing more observability mechanism to drill down root causes. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 59
  52. Which &me do we want to improve? TTF improvement brings

    us high returns with small efforts. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 60
  53. Which do we op+mize? MTTR vs MTBF • Short MTTR

    and long MTBF are the best • Short MTTR but short MTBF = Incidents frequently occur but are recovered quickly. • Long MTTR but long MTBF = Incidents don't occur frequently but once occur, they aren't recovered soon. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 61
  54. Startups should focus on MTTR improvement • There is no

    evolu.on without high cadence itera.ons at startups. • TTD and TTE are difficult to improve for us. • Reducing TTF results in reducing MTTR. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 62
  55. Do we have other +mes we haven't ar+culated? SRE NEXT

    2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 63
  56. Hidden key )mes of incident life cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 64
  57. Hidden key )mes of incident life cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 65
  58. Hidden key )mes of incident life cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 66
  59. Hidden key )mes of incident life cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 67
  60. Hidden key )mes of incident life cycles SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 68
  61. Hidden key )mes of incident life cycles Don't underes,mate the

    ,mes we spend as post-incident ac,vi,es. • Time to (addi,onal )triage (TTT) • Time to learn (TTL) • Time to improvement (TTI) • Time to prepara,on (TTP) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 69
  62. Power ques*on: Which process do you hate? SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 70
  63. Which process do you hate? I personally don't want to

    spend 0me on the following processes. • Addi$onal triages to dig root causes. • Prepara$on for learning ( != I don't like joining postmortem sessions). • Maintainance of incident management processes. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 71
  64. Why addi(onal triages? • Startups don't have enough observability mechanisms.

    • We some:mes cannot find root causes (this is acceptable). • We tend to spend a lot of :me here in that situa:on. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 72
  65. Why prepara)on for learning? • Prepara'on for team-wise learning sessions

    take 'me. • Documen'ng for Postmortems. • Copy & paste dances to create 'meline sca=ered various places. • Timeline needs to consider 'me-zones. • There is a gravity which prevent people from announcing incident casually. • For starups, the most important ac'vi'es are learning as a team. • If TTL is long, people cannot announce incidents casually. • As a result, postmortems ruin short MTTR with high cadence learning itera'ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 73
  66. Why maintenance of incident management processes? • Maintainance of incident

    management processes contains: • Upda.ng incident management policy. • Improving incident management structures. • Upda.ng documents. • Training people to align with the updates. • Characteris.cally, incidents don't occur frequently, • Too tough to memorize incident response processes for everybody. • In urgent situa.on, people don't read documents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 74
  67. Can we reduce TTI? • It's depending on ac0on items

    coming from postmortems. • No teams can handle all ac0on items we discussed during postmortems. • Common an0-pa<ern: people create too many ac0on items and assign without priori0es. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 75
  68. Approach for unbalanced ac0on items • Think of engineering members'

    capacity • Priori7ze and classify the work9 10 10 Anatomy of an incident management, Chapter 5 9 Postmortem Ac,on Items: Plan the Work and Work the Plan, USENIX SRECon 2017 SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 76
  69. Importance / Size / Urgency (ISU) Matrix • Assignee's confidence

    is also valuable to declare. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 77
  70. ISU Matrix on GitHub Projects SRE NEXT 2022 / presented

    by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 78
  71. My focus is reduc-on of TTT, TTL, and TTP SRE

    NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 79
  72. Chapter 5: Choosing right strategies and tools SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 80
  73. Phases and the number of so0ware engineers Note: the numbers

    are es/mated by the presenter based on previous experiences. • Phase 0: Founding ~ 4 so3ware engineers • Phase 1: 5 ~ 10 so3ware engineers • Phase 2: 11 ~ so3ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 81
  74. Let's reframe the original ques3on again • Reframe "Does a

    startup need incident management?" • At startups, how can we: • Build an incident management structure enforcing the 3T mental models? • Improve the ":mes" of the incident management life cycle? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 82
  75. Evolu&on of incident management at Launchable Improvement target Ac/ons from

    phase 0 to 1 Ac/ons from phase 1 to 2 Transparency - Encourage push communica3on - Encourage pull communica3on - Create war rooms - Share status pages Tangibility - Automate parts of incident response flow - Automate en3re incident response flow - Introduce incident lead role Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow - Introduce on-call rota3ons - Expand follow-the-sun coverages Time to Fix (TTF) - Introduce observability - Improve observability Time to Triage (TTT) - Introduce observability - Improve observability Time to Learn (TTL) - Introduce postmortem template - Generate postmortem Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies - Self-service incident response trainings SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 83
  76. Phase 0: Founding ~ 4 so2ware engineers SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 84
  77. No strategy • Product does not have customers. • We

    don't need incident responses. • Build incident management structure based on product growth. • All members do everything if necessary. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 85
  78. Incident management system SRE NEXT 2022 / presented by Takayuki

    Watanabe@Launchable, Inc. (May 15, 2022) 86
  79. Phase 1: 5 ~ 10 so-ware engineers SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 87
  80. Environmental changes from phase 0 to 1 • When products

    have customers, we need an incident management. • The more so8ware engineers join, the more incidents happen. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 88
  81. Strategy Make everything simple and easy to follow SRE NEXT

    2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 89
  82. Incident management changes from phase 0 to 1 Improvement target

    Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2 Transparency - Encourage push communica/on - Encourage pull communica3on - Create war rooms - Share status pages Tangibility - Automate parts of incident response flow - Automate en3re incident response flow - Introduce incident lead role Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow - Introduce on-call rota3ons - Expand follow-the-sun coverages Time to Fix (TTF) - Introduce observability - Improve observability Time to Triage (TTT) - Introduce observability - Improve observability Time to Learn (TTL) - Introduce postmortem template - Generate postmortem Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies - Self-service incident response trainings SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 90
  83. Incident management system (phase 0 to 1) SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 91
  84. Incident management system (phase 0 to 1) SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 92
  85. Founda'on of incident management policies • We maintain policies on

    Confluence. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 93
  86. Founda'on of incident management policies • We maintain policies on

    Confluence. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 94
  87. Automa'on of incident escala'ons • We escalate incidents using Slack

    Workflow. • We handle incidents in Slack channel and Google Meet. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 95
  88. Automa'on of incident escala'ons • We escalate incidents using Slack

    Workflow. • We handle incidents in Slack channel and Google Meet. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 96
  89. Automa'on of incident escala'ons • We escalate incidents using Slack

    Workflow. • We handle incidents in Slack channel and Google Meet. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 97
  90. Introduc)on of postmortem • We keep all postmortems on Confluence.

    • We create a new postmortem page using a Confluence template feature. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 98
  91. Introduc)on of postmortem • We keep all postmortems on Confluence.

    • We create a new postmortem page using a Confluence template feature. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 99
  92. Very simple and easy to follow SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 100
  93. Postmortems as strong fact-based data • Even we cannot solve

    root causes, you can use the postmortems as data. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 101
  94. Can we improve the incident management? SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 102
  95. e.g. Does this what human should take care? SRE NEXT

    2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 103
  96. e.g. Does this what human should take care? • We

    don't have solid policy but policy does not scale. • Employees are living in Japan and US. • Sharing all informa>on on Slack is easy to miss. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 104
  97. e.g. Do we need roles? SRE NEXT 2022 / presented

    by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 105
  98. Phase 2: 11 ~ ?? so-ware engineers SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 106
  99. Environmental changes from Phase 1 to 2 • Our products

    have more customers. • The more so3ware engineers join, the more incidents happen. • Increase of employees and >me-zone gaps make sync and push-style communica>ons tough. • In the first place, Launchable encourages async and wriEen communica>ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 107
  100. Strategy • Enforce incident management policies by so4ware not by

    documents. • Involve appropriate people based on pull-style communica8ons. • Use the current tool chains in the company. • Too many new tools degrade teams' performance. • Use Slack as interac?ve communica?on places to keep flow info. • Use Confluence to keep stock info (non-urgent communica?ons). SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 108
  101. Incident management system (phase 1 to 2) SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 109
  102. Incident management system (phase 1 to 2) SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 110
  103. Incident management system (phase 1 to 2) SRE NEXT 2022

    / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 111
  104. SaaS: incident.io • h#ps:/ /incident.io/ SRE NEXT 2022 / presented

    by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 112
  105. SaaS: Grafana Incident • h#ps:/ /go2.grafana.com/incident-beta-interest.html SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 115
  106. OSS: monzo/response OSS version of incident.io • h#ps:/ /github.com/monzo/response •

    h#ps:/ /monzo.com/blog/2019/07/08/how-we-respond-to-incidents/ SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 116
  107. in-house tool: Slack App + Web App • It's not

    so difficult to implement Slack App and Web App for this purpose. • But... I want to use my ?me for other stuff. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 117
  108. SaaS vs OSS vs in-house tool • We want to

    maximize developers' disposal 5me for product developments. • We don't want to increase cogni5ve loads. • OSS and in-house tool needs code and document maintenance. • OSS and in-house tool needs evangelical ac5vi5es for this type of tools. • Use SaaS if money allows (Buy, Not Build) • Salaries for so@ware engineers are way more expensive than SaaS cost. • SaaS improves their features as their business. • SaaS maintains documents as product features. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 118
  109. incident.io covers wide improvement targets Improvement target Ac/ons from phase

    0 to 1 Ac/ons from phase 1 to 2 Transparency - Encourage push communica3on - Encourage pull communica/on - Create war rooms - Share status pages Tangibility - Automate parts of incident response flow - Automate en/re incident response flow - Introduce incident lead role Trust - Introduce blameless culture - Split lead and opera/on roles for complex incidents Time to Engagement (TTE) - Automate incident announcements - Automate en/re incident response flow - Introduce on-call rota3ons - Expand follow-the-sun coverages Time to Fix (TTF) - Introduce observability - Improve observability Time to Triage (TTT) - Introduce observability - Improve observability Time to Learn (TTL) - Introduce postmortem template - Generate postmortem Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies - Self-service incident response trainings SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 119
  110. Central channel for all incidents incident.io can share all incidents

    in the specified Slack channel. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 120
  111. Dedicated war rooms (Slack channel) incident.io handles all tasks we

    want to complete for incident response ini4aliza4ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 121
  112. Dedicated war rooms (Slack channel) incident.io can assist incident responses.

    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 122
  113. Dedicated Slack channel (closing incident) At the end of incident

    responses, incident.io tells us what we need to be done next. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 123
  114. Status updates at central channel incident.io automa-cally syncs the latest

    status of incidents at the central channel. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 124
  115. Postmortem genera,on incident.io can collect ,melines from war rooms and

    generates postmortems. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 125
  116. Postmortem genera,on We can generate a postmortem documents using incident.io.

    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 126
  117. Postmortem genera,on We can collect *melines from dedicated Slack channels.

    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 127
  118. Self-training mode incident.io has a mode to walk though dummy

    incident responses on Slack. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 129
  119. Introduc)on of lead role • We need communica-on leads when

    incidents are complex • However, for most of incident, a single person can be responsible for opera-ons and communica-ons. • So, adding a lead role only is prudent so we don't make incident managements overly complex. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 130
  120. We have more rooms to improve! SRE NEXT 2022 /

    presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 131
  121. Recap • Incident management has a life cycle. • Prepara6on

    -> Detec6on -> Recovery -> Post-incident ac6ons -> Prepara6on • Incident response roles and structures exist to embody 3T. • Transparency • Tangibility • Trust • Choosing strategy and tools makes incident managements at startups sensible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 132
  122. Incident management 1. Atlassian: Understanding incident response roles and responsibili5es,

    h8ps:/ /www.atlassian.com/incident-management/incident- response/roles-responsibili5es 2. PagerDuty Incident Response Training, h8ps:/ /response.pagerduty.com/training/overview/. 3. Anatomy of an Incident, Ayelet Sachto, Adrienne Walcer, and Jessie Yang, 2022. 4. US Federal Emergency Management Agency, Emergency Management Ins5tute ICS Resource Center, h8ps:/ /training.fema.gov/ emiweb/is/icsresource/. 5. The Na5onal Ins5tute of Standards and Technology SP 800-61, Computer Security Incident Handling Guide, h8p:/ /dx.doi.org/ 10.6028/NIST.SP.800-61r2. 6. Introduc5on: Incident Response overview, Gov UK Na5onal Cyber Security Centre, h8ps:/ /www.ncsc.gov.uk/collec5on/incident- management/incident-response 7. Incident Review and Postmortem Best Prac5ces, h8ps:/ /newsle8er.pragma5cengineer.com/p/incident-review-best-prac5ces 8. Incident Review Prac5ces [The Pragma5c Engineer Newsle8er], h8ps:/ /docs.google.com/spreadsheets/d/1GPINipdf- l2H05iKOUbpkrqwlZ61ZCJDnwY5iE8LtRM/edit#gid=0 SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 135
  123. SRE 1. Google SRE book Chapter 14 - Managing Incidents,

    h>ps:/ /sre.google/sre-book/managing-incidents/ 2. Postmortem AcEon Items: Plan the Work and Work the Plan, Sue Lueder and Betsy Beyer (Google), USENIX SRECon 2017, h>ps:/ /www.usenix.org/conference/srecon17americas/program/presentaEon/lueder. 3. Google SRE book Chapter 15 - Postmortem Culture: Learning from Failure, h>ps:/ /sre.google/sre-book/postmortem-culture/. 4. Postmortem Metadata Index, h>ps:/ /postmortems.app/. 5. The Art of SLOs, Google Site Reliability Engineering, h>ps:/ /sre.google/resources/pracEces-and-processes/art-of-slos/ 6. danluu/post-mortems: A collecEon of postmortems, h>ps:/ /github.com/danluu/post-mortems. 7. Great Incident Review Examples, The PragmaEc Engineer, h>ps:/ /blog.pragmaEcengineer.com/postmortem-best-pracEces/#great- incident-review-examples SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 136
  124. DevOps performance metrics 1. Accelerate: The Science of Lean So4ware

    and DevOps: Building and Scaling High Performing Technology OrganizaDons, 2018. 2. GoogleCloudPlaKorm/fourkeys, hNps:/ /github.com/GoogleCloudPlaKorm/fourkeys 3. Are you an Elite DevOps performer? Find out with the Four Keys Project, Google Cloud, hNps:/ /cloud.google.com/blog/products/ devops-sre/using-the-four-keys-to-measure-your-devops-performance 4. DORA DevOps Quick Check., hNps:/ /www.devops-research.com/quickcheck.html SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 137
  125. SaaS and OSS 1. Datadog, h,ps:/ /www.datadoghq.com/blog/incident-response-with-datadog/ 2. incident.io, h,ps:/

    /incident.io/ 3. jeli, h,ps:/ /www.jeli.io/ 4. monzo/response, h,ps:/ /monzo.com/blog/2019/07/08/how-we-respond-to-incidents 5. Etsy/morgue, h,ps:/ /github.com/etsy/morgue SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 138