Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Observability — Extending Into Incident Response
Search
Narimichi Takamura
October 27, 2025
Technology
1
61
Observability — Extending Into Incident Response
Observability Conference Tokyo 2025の登壇資料です。
https://o11ycon.jp/
Narimichi Takamura
October 27, 2025
Tweet
Share
More Decks by Narimichi Takamura
See All by Narimichi Takamura
インシデントキーメトリクスによるインシデント対応の改善 / Improving Incident Response using Incident Key Metrics
nari_ex
1
11k
組織的なインシデント対応を目指して〜成熟度評価と改善のステップ〜 / Towards an Organized Incident Response - Maturity Assessment and Improvement Steps -
nari_ex
7
9k
Waroomの開発モチベーションと今後のロードマップ / Waroom development motivation and roadmap
nari_ex
1
1.6k
Engineering with Business Impact
nari_ex
2
320
How We Foster Reliability in Diversity
nari_ex
14
13k
SRE Practices in Organizations
nari_ex
16
10k
Hardening におけるトラブルシューティング / Troubleshooting in Hardening
nari_ex
1
360
私が Engineering Manager になるまでに経験してきたこと、大切にしてきたこと / Lecture materials for Introduction to Venture Business at UEC
nari_ex
0
250
運用技術者組織の設計と運用 / Design and operation of operational engineer organization
nari_ex
11
10k
Other Decks in Technology
See All in Technology
AI時代の開発を加速する組織づくり - ブログでは書けなかったリアル
hiro8ma
1
290
「改善」ってこれでいいんだっけ?
ukigmo_hiro
0
410
「最速」で Gemini CLI を使いこなそう! 〜Cloud Shell/Cloud Run の活用〜 / The Fastest Way to Master the Gemini CLI — with Cloud Shell and Cloud Run
aoto
PRO
1
170
知覚とデザイン
rinchoku
1
440
ViteとTypeScriptのProject Referencesで 大規模モノレポのUIカタログのリリースサイクルを高速化する
shuta13
2
170
Digitization部 紹介資料
sansan33
PRO
1
5.7k
serverless team topology
_kensh
3
190
現場データから見える、開発生産性の変化コード生成AI導入・運用のリアル〜 / Changes in Development Productivity and Operational Challenges Following the Introduction of Code Generation AI
nttcom
1
460
AI時代におけるデータの重要性 ~データマネジメントの第一歩~
ryoichi_ota
0
710
Data Hubグループ 紹介資料
sansan33
PRO
0
2.2k
Introdução a Service Mesh usando o Istio
aeciopires
1
280
CNCFの視点で捉えるPlatform Engineering - 最新動向と展望 / Platform Engineering from the CNCF Perspective
hhiroshell
0
140
Featured
See All Featured
The Cult of Friendly URLs
andyhume
79
6.6k
For a Future-Friendly Web
brad_frost
180
10k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
115
20k
RailsConf 2023
tenderlove
30
1.3k
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.9k
Navigating Team Friction
lara
190
15k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.1k
Thoughts on Productivity
jonyablonski
70
4.9k
Documentation Writing (for coders)
carmenintech
75
5.1k
Transcript
None
2
גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE
as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3
SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε •
ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮ • ΠϯγσϯτϚωδϝϯτͷվળ 4
WaroomʢΘΔʔΉʣ • h#ps:/ /waroom.com • ৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ
লྗԽ͕Ͱ͖Δ 5
6
7
8
ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰվળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱΉ͔͍ͣ͠ => IR SaaSͷ࡞Γख /
SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ͠·͢ɻ => ऴ൫ͰʢιϑτΣΞͰͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ౿ΈࠐΜͰ͓͠·͢ɻ 9
ຊެԋͷλʔήοτ • o11y ͷվળޮՌΛఆྔతʹࣔ͢ϓϥΫςΟεʹڵຯ͕͋Δํ • IR ͷՄࢹԽʹڵຯ͕͋Δํ • ʮo11y Λ
IR ͷྖҬ֦ு͢Δ͜ͱʯʹڵຯ͕͋Δํ 10
ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4.
TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬద༻͢Δ 11
1. Ϟνϕʔγϣϯ 12
͍: ͦͷԾઆຊͳͷ͔ 1. γεςϜͷՄ؍ଌੑΛվળ͢Δ 2. ෳࡶͳγεςϜͷ෦ঢ়ଶΛਪଌɾѲͰ͖ΔΑ͏ʹͳΔ 3. ൃੜ࣌ʹݪҼಛఆ͕ਝʹͳΓ෮چ͕࣌ؒ͘ͳΔ ← ί
Ϩ 13
Γ͍ͨ͜ͱ2͚ͭͩ • Where: Ͳ͜ ͕վળͨ͠ͷ͔ • How much: Ͳͷఔ վળͨ͠ͷ͔
14
ΦϒβʔόϏϦςΟʹΑͬͯߦΘΕͨ ΠϯγσϯτରԠͷվળޮՌΛ ఆྔతʹදݱ͍ͨ͠ 15
෮چ࣌ؒͷॖʹޮՌ͕͋Δͣ → MTTR Λଌఆ͢Ε͍͍ͷͰʁ 16
2. MTTRͷ 17
MTTRʢฏۉ෮چ࣌ؒʣ ͱ • ো͕ൃੜ͔ͯ͠Βम෮·ͨ෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair,
Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ 1 MTTRʢฏۉ෮چ࣌ؒʣͱʁܭࢉํ๏ͱMTBFͱͷނোɾՔಇʹ ͓͚Δؔ 18
19
SREs should move away from defaul/ng to the assump/on that
MTTX can be useful. 20
MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳΒɺTTRΛॖ͢ΕMTTRॖ͞Ε Δͣ 21
MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩΛऔΔ •
diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22
23
݁Ռ: MTTR͕10%Ҏ্վળ͢Δͷ50ʙ60% 24
֤ΠϯγσϯτΛվળͯ͠MTTR͕վળ͠ͳ͍ཧ༝ • MTTRͷΈʹऑ͍ • ҰํͰɺΠϯγσϯτσʔλ"Β͖ͭ"͕ܹ͍͠ 25
Πϯγσϯτσʔλͷಛ3 • େ͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦൵ࢂͳΠϯγσϯτʹͳΔ • → ແ࡞ҝʹσʔληοτΛׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕MTTR ͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢
• ex. ෮چʹ5000ஹ͔͔࣌ؒΔΠϯγσ ϯτͷৼΓ͚ઌ͕ͲͪΒʹͳΔ͔Ͱ MTTRͷվળ۩߹େ෯ʹมΘΔ 3 The VOID Report 26
ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTRσʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 27
Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • ΠϯγσϯτނোظؒͷΒ͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR
ʹө͞ΕͮΒ͍ • ex. ʮࡢൺMTTR10%վળʂʯظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ • ※ ຖ·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ͕͋Δ(ϜϦ) • ݁ • MTTR վળͷධՁࢦඪͱͯ͠ʹཱͨͳ͍ • MTTRͷΈʹऑ͘ɺΠϯγσϯτσʔλΒ͖͕ܹ͍͔ͭ͠Β 28
ͳʹ͕ͩͬͨͷʁ ֤ཁૉͳ͍ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛͱʹվળͷՌΛ֬ೝ͢Δ͜ͱ → తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕
29
σʔλੳʢԾઆݕূܕʣͷྲྀΕ 30
MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 31
ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ߹ 32
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 33
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 34
͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹෆద • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ͑Δ͜ͱ͕Մೳ → TTRΑΓࡉ͔͍ϝτϦΫεͷधཁ͕ग़ͯ͘Δ
35
3. ࣮ફతͳ TTX ϝτϦΫε 36
Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ • ཏతͰ͋Δ • ཻ͕ࡉ͔͍ • ऩू͕ݱ࣮తͰ͋Δ 37
ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 38
39
TTXϝτϦΫεͷ՝ײ • ੈͷதʹࣄྫ͍͔ͭ͋͘Δ͕ɺఆٛ౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜ΛΈ߹ΘͤΑ͏ͱͯ͠ɺॏෳෆ͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺཏతͳఆٛΛࢦ͢ 40
TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 41
ϕετϓϥΫςΟεΛֶͿ 42
ΠϯγσϯτεςʔλεΛͬ͘͟Γఆٛ͢Δ 43
44
45
ϚΠϧετʔϯΛͱʹ TTXʹམͱ͠ࠐΉ 46
47
ϝτϦΫεऩू͍ͨΜ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ͑Δ͝ͱ ʹλΠϜελϯϓΛه͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷඇݱ࣮త • → Waroom ͰSlack
BotͰࣗಈऩू͍ͯ͠·͢ 48
ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕʣ Ξϥʔτൃੜ௨ Acknowledgedʢೝʣ νϟϯωϧ࡞ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ͚ʢPrecheck ͱResolu.onʣ
Recoveredʢ෮چʣ SlackͷΓͱΓ͔ΒAI͕அ͢Δ 49
4. TTXϝτϦΫεͷ׆༻ 50
ϝτϦΫεΛޮՌతʹ͏ͨΊʹ ੳͷతͱϝτϦΫεͷಛΛ߹ͤ͞Δ 51
52
ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝ վળࢪࡦ TTDetectʢݕʣ ൃੜ͔ͯ͠Βݕ·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ
͔͔Δ γϑτׂͷ໌֬ԽɺΦ ϯίʔϧ੍ͷಋೖ TTInves-gateʢௐࠪʣ োΓ͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ උ TTFixʢम෮ʣ োͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴԽ 53
54
യવͱͨ͠ԾઆΛͱʹɺ͔Β՝Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝ͷྫ ڞ௨ͷڥͳͷͰɺ৫ͷ֤ TTXͷҰఆͷͣ αʔϏενʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTXఆʹ͍ۙͣ ʢex. TTAͳΒ10Ҏ͘Β
͍ʣ ʢ࣮ʣணख͕શମతʹ͍ɺ ղܾࡦͷಛఆ͕શମతʹ͍ 55
56
57
5. o11y ΛΠϯγσϯτϨεϙϯεʹద༻͢ Δ 58
o11yΛIRద༻͢Δ2 • ΠϯγσϯτϨεϙϯεͷ෦ߏͷ Մ؍ଌੑΛ͞ΒʹߴΊΔ • TTXͷఆٛʹΑͬͯɺϝτϦΫεͳ Μͱͳ͘ಋೖࡁΈ • ϝτϦΫεɺϩάɺτϨʔεͷϓϥΫ ςΟεΛ׆༻͢Δ͜ͱͰվળͰ͖ͳ͍
ͩΖ͏͔ 2 Unveiling the black box with observability stack 59
Metrics 60
ബͬ͢ΒͱΔ"ยखམͪ"ײ • հͨ͠TTXϝτϦΫεɺ͍ͣΕTTRΛղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷॖ ʹ͚ͩয͕͍ͨͬͯΔ • SREࢹͰ αʔϏεͷ৴པੑ ͷ؍͕ॏཁ
• ex. ֶͼ͋Δ͔ɺ࠶ൃࢭ͞ΕΔ͔ • ϓϩμΫτӡӦࢹͰ ސ٬ͷ৴པੑ ͷ؍͕ॏཁ • ex. ސ٬ରԠेʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61
γεςϜ෮چରԠͱฒߦ͍ͯͬͯ͠Δ͜ͱ • ސ٬ͷઆ໌ɾࣄͷڞ༗ • Πϯγσϯτͷใࠂɾੳ • ࠜຊରࡦͷݕ౼ɾ࣮ࢪ => ݱঢ়ͩͱɺ্هͷ׆ಈͷ؍ଌείʔϓ֎ʹͳ͍ͬͯΔ 62
TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ త Incident Response Metrics ७ਮͳ෮چରԠͷ՝ಛఆɾվળࢦඪ Customer
Reliability Metrics ސ٬ରԠͷ՝ಛఆɾվળࢦඪ Learning Metrics ৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷੳ => ࠓճɺCustomer Reliability Metrrics ͷྫΛհ 63
64
Log 65
ରԠதͷΠϕϯτΛه͢Δ • ऩू • ୭͕ɾ͍ͭɾͲͷίϚϯυɾͲͷ அΛߦ͔ͬͨΛߏԽϩάԽ • ex. νϟοτɺεςʔλεมߋɺ֎෦ πʔϧʹΑΔΠϕϯτ࿈ܞ
• ׆༻ྫ • λΠϜϥΠϯੜεςʔλεϖʔ δΛࣗಈੜ ! 66
WaroomͷཪଆͰ४උ͕ਐߦத... 67
Trace 68
ରԠϓϩηεͷྲྀΕɺґଘؔ Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠ཧ • ΞΫγϣϯ͝ͱʹࡉԽͯ͠౷߹
• ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨఔΛಛఆ ✨ 69
πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog
→ AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ͍ͬͯΔͷରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰɺΑΓଟ͘ͷใ͕औಘՄೳʹʂ 70
AIϕʔεͷΠϯγσϯτϨε ϙϯε • AI͕ࣗવݴޠͰୡ͞Εͨ༰Λͱ ʹɺMCPαʔόʔ֎෦πʔϧͱ࿈ܞ ͠ͳ͕Β͞·͟·ͳૢ࡞Λߦ͏ • → ৗʹWaroomΛܦ༝ͯ͠ΞΫγϣϯ ͕ߦΘΕΔΑ͏ʹͳΓɺࡉ͔ͳΠϕϯ
τΛࣗಈతʹอଘͰ͖Δ 71
·ͱΊ 1. վળࢦඪͱͯ͠MTTRཱͨͳ͍ 2. ϝτϦΫε׆༻ɺతʙσʔλੳʹࢸΔ·Ͱͷ߹ੑ͕ॏཁ 3. มಈੑΛ͑ΔͨΊʹɺ͍ͷ۩ମԽͱϝτϦΫεͷࡉԽ͕ॏ ཁ 4. TTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏
5. o11yͷϓϥΫςΟεΛ࣋ͪࠐΉ͜ͱͰɺΑΓแׅతͳ؍ଌʹۙͮ͘ 72
͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷΈΛ࡞Δ ͷ͍ͨΜ • ͞ΒʹɺՄࢹԽج൫ͷߏங͍ͨΜ • ͞ΒʹɺϝτϦΫεΛΧςΰϦϥϕ ϧͰ෦நग़͢Δͷ͍ͨΜ •
→ ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ Topotal ͷϒʔε ʂ 73
͋Γ͕ͱ͏͍͟͝·ͨ͠