Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Observability — Extending Into Incident Response
Search
Narimichi Takamura
October 27, 2025
Technology
2
1k
Observability — Extending Into Incident Response
Observability Conference Tokyo 2025の登壇資料です。
https://o11ycon.jp/
Narimichi Takamura
October 27, 2025
Tweet
Share
More Decks by Narimichi Takamura
See All by Narimichi Takamura
インシデントキーメトリクスによるインシデント対応の改善 / Improving Incident Response using Incident Key Metrics
nari_ex
1
13k
組織的なインシデント対応を目指して〜成熟度評価と改善のステップ〜 / Towards an Organized Incident Response - Maturity Assessment and Improvement Steps -
nari_ex
7
9.5k
Waroomの開発モチベーションと今後のロードマップ / Waroom development motivation and roadmap
nari_ex
1
1.7k
Engineering with Business Impact
nari_ex
2
340
How We Foster Reliability in Diversity
nari_ex
14
13k
SRE Practices in Organizations
nari_ex
16
11k
Hardening におけるトラブルシューティング / Troubleshooting in Hardening
nari_ex
1
390
私が Engineering Manager になるまでに経験してきたこと、大切にしてきたこと / Lecture materials for Introduction to Venture Business at UEC
nari_ex
0
260
運用技術者組織の設計と運用 / Design and operation of operational engineer organization
nari_ex
11
10k
Other Decks in Technology
See All in Technology
dbt meetup #19 『dbtを『なんとなく動かす』を卒業します』
tiltmax3
0
130
【5分でわかる】セーフィー エンジニア向け会社紹介
safie_recruit
0
43k
Databricks (と気合い)で頑張るAI Agent 運用
kameitomohiro
0
340
「データとの対話」の現在地と未来
kobakou
0
970
Digitization部 紹介資料
sansan33
PRO
1
6.9k
APMの世界から見るOpenTelemetryのTraceの世界 / OpenTelemetry in the Java
soudai
PRO
0
200
Data Hubグループ 紹介資料
sansan33
PRO
0
2.8k
Snowflake Night #2 LT
taromatsui_cccmkhd
0
270
俺の失敗を乗り越えろ!メーカーの開発現場での失敗談と乗り越え方 ~ゆるゆるチームリーダー編~
spiddle
0
400
男(監査)はつらいよ - Policy as CodeからAIエージェントへ
ken5scal
4
640
Claude Codeと駆け抜ける 情報収集と実践録
sontixyou
2
1.2k
Databricksアシスタントが自分で考えて動く時代に! エージェントモード体験もくもく会
taka_aki
0
200
Featured
See All Featured
Statistics for Hackers
jakevdp
799
230k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
52k
Building Flexible Design Systems
yeseniaperezcruz
330
40k
Why Our Code Smells
bkeepers
PRO
340
58k
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
380
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
380
The Cult of Friendly URLs
andyhume
79
6.8k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.8k
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
110
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.6k
Practical Orchestrator
shlominoach
191
11k
Transcript
None
2
גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE
as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3
SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε •
ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮ • ΠϯγσϯτϚωδϝϯτͷվળ 4
WaroomʢΘΔʔΉʣ • h#ps:/ /waroom.com • ৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ
লྗԽ͕Ͱ͖Δ 5
6
7
8
ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰվળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱΉ͔͍ͣ͠ => IR SaaSͷ࡞Γख /
SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ͠·͢ɻ => ऴ൫ͰʢιϑτΣΞͰͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ౿ΈࠐΜͰ͓͠·͢ɻ 9
ຊެԋͷλʔήοτ • o11y ͷվળޮՌΛఆྔతʹࣔ͢ϓϥΫςΟεʹڵຯ͕͋Δํ • IR ͷՄࢹԽʹڵຯ͕͋Δํ • ʮo11y Λ
IR ͷྖҬ֦ு͢Δ͜ͱʯʹڵຯ͕͋Δํ 10
ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4.
TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬద༻͢Δ 11
1. Ϟνϕʔγϣϯ 12
͍: ͦͷԾઆຊͳͷ͔ 1. γεςϜͷՄ؍ଌੑΛվળ͢Δ 2. ෳࡶͳγεςϜͷ෦ঢ়ଶΛਪଌɾѲͰ͖ΔΑ͏ʹͳΔ 3. ൃੜ࣌ʹݪҼಛఆ͕ਝʹͳΓ෮چ͕࣌ؒ͘ͳΔ ← ί
Ϩ 13
Γ͍ͨ͜ͱ2͚ͭͩ • Where: Ͳ͜ ͕վળͨ͠ͷ͔ • How much: Ͳͷఔ վળͨ͠ͷ͔
14
ΦϒβʔόϏϦςΟʹΑͬͯߦΘΕͨ ΠϯγσϯτରԠͷվળޮՌΛ ఆྔతʹදݱ͍ͨ͠ 15
෮چ࣌ؒͷॖʹޮՌ͕͋Δͣ → MTTR Λଌఆ͢Ε͍͍ͷͰʁ 16
2. MTTRͷ 17
MTTRʢฏۉ෮چ࣌ؒʣ ͱ • ো͕ൃੜ͔ͯ͠Βम෮·ͨ෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair,
Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ 1 MTTRʢฏۉ෮چ࣌ؒʣͱʁܭࢉํ๏ͱMTBFͱͷނোɾՔಇʹ ͓͚Δؔ 18
19
SREs should move away from defaul/ng to the assump/on that
MTTX can be useful. 20
MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳΒɺTTRΛॖ͢ΕMTTRॖ͞Ε Δͣ 21
MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩΛऔΔ •
diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22
23
݁Ռ: MTTR͕10%Ҏ্վળ͢Δͷ50ʙ60% 24
֤ΠϯγσϯτΛվળͯ͠MTTR͕վળ͠ͳ͍ཧ༝ • MTTRͷΈʹऑ͍ • ҰํͰɺΠϯγσϯτσʔλ"Β͖ͭ"͕ܹ͍͠ 25
Πϯγσϯτσʔλͷಛ3 • େ͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦൵ࢂͳΠϯγσϯτʹͳΔ • → ແ࡞ҝʹσʔληοτΛׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕MTTR ͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢
• ex. ෮چʹ5000ஹ͔͔࣌ؒΔΠϯγσ ϯτͷৼΓ͚ઌ͕ͲͪΒʹͳΔ͔Ͱ MTTRͷվળ۩߹େ෯ʹมΘΔ 3 The VOID Report 26
ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTRσʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 27
Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • ΠϯγσϯτނোظؒͷΒ͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR
ʹө͞ΕͮΒ͍ • ex. ʮࡢൺMTTR10%վળʂʯظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ • ※ ຖ·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ͕͋Δ(ϜϦ) • ݁ • MTTR վળͷධՁࢦඪͱͯ͠ʹཱͨͳ͍ • MTTRͷΈʹऑ͘ɺΠϯγσϯτσʔλΒ͖͕ܹ͍͔ͭ͠Β 28
ͳʹ͕ͩͬͨͷʁ ֤ཁૉͳ͍ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛͱʹվળͷՌΛ֬ೝ͢Δ͜ͱ → తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕
29
σʔλੳʢԾઆݕূܕʣͷྲྀΕ 30
MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 31
ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ߹ 32
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 33
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 34
͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹෆద • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ͑Δ͜ͱ͕Մೳ → TTRΑΓࡉ͔͍ϝτϦΫεͷधཁ͕ग़ͯ͘Δ
35
3. ࣮ફతͳ TTX ϝτϦΫε 36
Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ • ཏతͰ͋Δ • ཻ͕ࡉ͔͍ • ऩू͕ݱ࣮తͰ͋Δ 37
ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 38
39
TTXϝτϦΫεͷ՝ײ • ੈͷதʹࣄྫ͍͔ͭ͋͘Δ͕ɺఆٛ౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜ΛΈ߹ΘͤΑ͏ͱͯ͠ɺॏෳෆ͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺཏతͳఆٛΛࢦ͢ 40
TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 41
ϕετϓϥΫςΟεΛֶͿ 42
ΠϯγσϯτεςʔλεΛͬ͘͟Γఆٛ͢Δ 43
44
45
ϚΠϧετʔϯΛͱʹ TTXʹམͱ͠ࠐΉ 46
47
ϝτϦΫεऩू͍ͨΜ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ͑Δ͝ͱ ʹλΠϜελϯϓΛه͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷඇݱ࣮త • → Waroom ͰSlack
BotͰࣗಈऩू͍ͯ͠·͢ 48
ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕʣ Ξϥʔτൃੜ௨ Acknowledgedʢೝʣ νϟϯωϧ࡞ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ͚ʢPrecheck ͱResolu.onʣ
Recoveredʢ෮چʣ SlackͷΓͱΓ͔ΒAI͕அ͢Δ 49
4. TTXϝτϦΫεͷ׆༻ 50
ϝτϦΫεΛޮՌతʹ͏ͨΊʹ ੳͷతͱϝτϦΫεͷಛΛ߹ͤ͞Δ 51
52
ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝ վળࢪࡦ TTDetectʢݕʣ ൃੜ͔ͯ͠Βݕ·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ
͔͔Δ γϑτׂͷ໌֬ԽɺΦ ϯίʔϧ੍ͷಋೖ TTInves-gateʢௐࠪʣ োΓ͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ උ TTFixʢम෮ʣ োͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴԽ 53
54
യવͱͨ͠ԾઆΛͱʹɺ͔Β՝Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝ͷྫ ڞ௨ͷڥͳͷͰɺ৫ͷ֤ TTXͷҰఆͷͣ αʔϏενʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTXఆʹ͍ۙͣ ʢex. TTAͳΒ10Ҏ͘Β
͍ʣ ʢ࣮ʣணख͕શମతʹ͍ɺ ղܾࡦͷಛఆ͕શମతʹ͍ 55
56
57
5. o11y ΛΠϯγσϯτϨεϙϯεʹద༻͢ Δ 58
o11yΛIRద༻͢Δ2 • ΠϯγσϯτϨεϙϯεͷ෦ߏͷ Մ؍ଌੑΛ͞ΒʹߴΊΔ • TTXͷఆٛʹΑͬͯɺϝτϦΫεͳ Μͱͳ͘ಋೖࡁΈ • ϝτϦΫεɺϩάɺτϨʔεͷϓϥΫ ςΟεΛ׆༻͢Δ͜ͱͰվળͰ͖ͳ͍
ͩΖ͏͔ 2 Unveiling the black box with observability stack 59
Metrics 60
ബͬ͢ΒͱΔ"ยखམͪ"ײ • հͨ͠TTXϝτϦΫεɺ͍ͣΕTTRΛղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷॖ ʹ͚ͩয͕͍ͨͬͯΔ • SREࢹͰ αʔϏεͷ৴པੑ ͷ؍͕ॏཁ
• ex. ֶͼ͋Δ͔ɺ࠶ൃࢭ͞ΕΔ͔ • ϓϩμΫτӡӦࢹͰ ސ٬ͷ৴པੑ ͷ؍͕ॏཁ • ex. ސ٬ରԠेʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61
γεςϜ෮چରԠͱฒߦ͍ͯͬͯ͠Δ͜ͱ • ސ٬ͷઆ໌ɾࣄͷڞ༗ • Πϯγσϯτͷใࠂɾੳ • ࠜຊରࡦͷݕ౼ɾ࣮ࢪ => ݱঢ়ͩͱɺ্هͷ׆ಈͷ؍ଌείʔϓ֎ʹͳ͍ͬͯΔ 62
TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ త Incident Response Metrics ७ਮͳ෮چରԠͷ՝ಛఆɾվળࢦඪ Customer
Reliability Metrics ސ٬ରԠͷ՝ಛఆɾվળࢦඪ Learning Metrics ৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷੳ => ࠓճɺCustomer Reliability Metrrics ͷྫΛհ 63
64
Log 65
ରԠதͷΠϕϯτΛه͢Δ • ऩू • ୭͕ɾ͍ͭɾͲͷίϚϯυɾͲͷ அΛߦ͔ͬͨΛߏԽϩάԽ • ex. νϟοτɺεςʔλεมߋɺ֎෦ πʔϧʹΑΔΠϕϯτ࿈ܞ
• ׆༻ྫ • λΠϜϥΠϯੜεςʔλεϖʔ δΛࣗಈੜ ! 66
WaroomͷཪଆͰ४උ͕ਐߦத... 67
Trace 68
ରԠϓϩηεͷྲྀΕɺґଘؔ Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠ཧ • ΞΫγϣϯ͝ͱʹࡉԽͯ͠౷߹
• ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨఔΛಛఆ ✨ 69
πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog
→ AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ͍ͬͯΔͷରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰɺΑΓଟ͘ͷใ͕औಘՄೳʹʂ 70
AIϕʔεͷΠϯγσϯτϨε ϙϯε • AI͕ࣗવݴޠͰୡ͞Εͨ༰Λͱ ʹɺMCPαʔόʔ֎෦πʔϧͱ࿈ܞ ͠ͳ͕Β͞·͟·ͳૢ࡞Λߦ͏ • → ৗʹWaroomΛܦ༝ͯ͠ΞΫγϣϯ ͕ߦΘΕΔΑ͏ʹͳΓɺࡉ͔ͳΠϕϯ
τΛࣗಈతʹอଘͰ͖Δ 71
·ͱΊ 1. վળࢦඪͱͯ͠MTTRཱͨͳ͍ 2. ϝτϦΫε׆༻ɺతʙσʔλੳʹࢸΔ·Ͱͷ߹ੑ͕ॏཁ 3. มಈੑΛ͑ΔͨΊʹɺ͍ͷ۩ମԽͱϝτϦΫεͷࡉԽ͕ॏ ཁ 4. TTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏
5. o11yͷϓϥΫςΟεΛ࣋ͪࠐΉ͜ͱͰɺΑΓแׅతͳ؍ଌʹۙͮ͘ 72
͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷΈΛ࡞Δ ͷ͍ͨΜ • ͞ΒʹɺՄࢹԽج൫ͷߏங͍ͨΜ • ͞ΒʹɺϝτϦΫεΛΧςΰϦϥϕ ϧͰ෦நग़͢Δͷ͍ͨΜ •
→ ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ Topotal ͷϒʔε ʂ 73
͋Γ͕ͱ͏͍͟͝·ͨ͠