$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
インシデントキーメトリクスによるインシデント対応の改善 / Improving Inciden...
Search
Narimichi Takamura
January 26, 2025
Technology
1
12k
インシデントキーメトリクスによるインシデント対応の改善 / Improving Incident Response using Incident Key Metrics
SRE Kaigi 2025の発表資料です。TTXメトリクスがメイントピックです。
https://2025.srekaigi.net/
Narimichi Takamura
January 26, 2025
Tweet
Share
More Decks by Narimichi Takamura
See All by Narimichi Takamura
Observability — Extending Into Incident Response
nari_ex
2
950
組織的なインシデント対応を目指して〜成熟度評価と改善のステップ〜 / Towards an Organized Incident Response - Maturity Assessment and Improvement Steps -
nari_ex
7
9.2k
Waroomの開発モチベーションと今後のロードマップ / Waroom development motivation and roadmap
nari_ex
1
1.7k
Engineering with Business Impact
nari_ex
2
330
How We Foster Reliability in Diversity
nari_ex
14
13k
SRE Practices in Organizations
nari_ex
16
11k
Hardening におけるトラブルシューティング / Troubleshooting in Hardening
nari_ex
1
370
私が Engineering Manager になるまでに経験してきたこと、大切にしてきたこと / Lecture materials for Introduction to Venture Business at UEC
nari_ex
0
250
運用技術者組織の設計と運用 / Design and operation of operational engineer organization
nari_ex
11
10k
Other Decks in Technology
See All in Technology
ESXi のAIOps だ!2025冬
unnowataru
0
390
MySQLのSpatial(GIS)機能をもっと充実させたい ~ MyNA望年会2025LT
sakaik
0
130
ハッカソンから社内プロダクトへ AIエージェント ko☆shi 開発で学んだ4つの重要要素
leveragestech
0
220
松尾研LLM講座2025 応用編Day3「軽量化」 講義資料
aratako
8
4.3k
Amazon Connect アップデート! AIエージェントにMCPツールを設定してみた!
ysuzuki
0
140
LayerX QA Night#1
koyaman2
0
270
Amazon Quick Suite で始める手軽な AI エージェント
shimy
2
1.9k
通勤手当申請チェックエージェント開発のリアル
whisaiyo
3
480
TED_modeki_共創ラボ_20251203.pdf
iotcomjpadmin
0
150
Snowflake導入から1年、LayerXのデータ活用の現在 / One Year into Snowflake: How LayerX Uses Data Today
civitaspo
0
2.5k
AI駆動開発ライフサイクル(AI-DLC)の始め方
ryansbcho79
0
190
なぜ あなたはそんなに re:Invent に行くのか?
miu_crescent
PRO
0
210
Featured
See All Featured
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
0
100
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
0
280
Mind Mapping
helmedeiros
PRO
0
39
What the history of the web can teach us about the future of AI
inesmontani
PRO
0
380
How to Talk to Developers About Accessibility
jct
1
86
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
130
Building the Perfect Custom Keyboard
takai
1
660
Unsuck your backbone
ammeep
671
58k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
51k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
32
AI: The stuff that nobody shows you
jnunemaker
PRO
1
27
The Limits of Empathy - UXLibs8
cassininazir
1
190
Transcript
None
2
גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE
as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷ Pla;num εϙϯαʔ 3
SRE as a Service • topotal.com/services/sre-as-a-service • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε • ࢧԉͷྫ
• SLI/SLOͷಋೖɾӡ༻վળ • CI/CDͷߏஙɾվળ • ΠϯγσϯτϚωδϝϯτͷվળ 4
WaroomʢΘΔʔΉʣ • waroom.com • ৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ লྗԽ͕Ͱ͖Δ
5
6
վળͷϑΟʔυόοΫΛߏங͢Δ 7
8
ΞδΣϯμ 1. MTTRͷ 2. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 3. TTX ϝτϦΫεͷ׆༻ྫ
4. ൃలతͳϝτϦΫε 9
1. MTTRͷ 10
MTTRʢฏۉ෮چ࣌ؒʣ ͱ • ো͕ൃੜ͔ͯ͠Βम෮·ͨ෮چ͢Δ ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair,
Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ૯ނোճ • Four Keys ͷࢦඪͷҰͭͰ͋Δ 1 MTTRʢฏۉ෮چ࣌ؒʣͱʁܭࢉํ๏ͱMTBFͱͷނোɾՔಇʹ ͓͚Δؔ 11
12
SREs should move away from defaul/ng to the assump/on that
MTTX can be useful. 13
MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳͷͰ͋ΕɺTTRΛվળʢॖʣ͢ΔͱMTTRվ ળ͞ΕΔͣ • ݕূ֓ཁ • σʔληοτΛ1:1Ͱׂ͠ɺยํTTRΛ10%վળɺ͏ยํͳʹ
͠ͳ͍ͰMTTRΛࢉग़ɾൺֱ͢Δ • MTTR͕10%վળ͞ΕΔ͔Ͳ͏͔Λ֬ೝ͢Δ 14
MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩΛऔΔ •
diff = MTTR(unmodified) - MTTR(modified) • diff > 0 => MTTRվળ • diff < 0 => MTTRѱԽ 5. 1ʙ4Λ10ສճ܁Γฦ͢ 2 σʔληοτɺ༗໊ͳΠϯλʔ ωοτاۀ3ࣾͷΠϯγσϯτες ʔλεμογϡϘʔυ͔Βऔಘ 15
Πϯγσϯτσʔλͷಛ3 • େ͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦൵ࢂͳΠϯγσϯτʢϒϥοΫ εϫϯΠϕϯτʣʹͳΔ • → ແ࡞ҝʹσʔληοτΛׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕
MTTRͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢ 3 The VOID Report 16
ࢀߟ: ϒϥοΫεϫϯΠϕϯτ • ༧ظͰ͖ͳ͍ɺյ໓తͳ݁ՌΛҾ͖ى ͜͢ࣄ • ϤʔϩούͰനௗന͍ௗ͚ͩͱࢥ ΘΕ͍ͯͨ • "༧ظ͞Εͳ͍େ͖ͳग़དྷࣄ"
Λ “ϒ ϥοΫεϫϯ” ͱݺͿΑ͏ʹͳͬͨ • 2007ʹൃץ͞ΕͨʮThe Black Swanʯ͕͖͔͚ͬ 17
γϛϡϨʔγϣϯ݁Ռ ֤Πϯγσϯτͷम෮࣌ؒΛ10%ͨ͘͠ʹ͔͔ΘΒͣɺMTTR͕10%Ҏ্͘ͳΔέʔε49%ɺ50%ɺ64%ͷΈ → ͘Β͍ɺम෮࣌ؒͷॖ͕MTTRʹө͞Εͳ͍ 18
ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTRσʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 19
Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • ΠϯγσϯτނোظؒͷΒ͖͕ͭେ͖͍ͨΊɺվળ݁Ռ͕ MTTR
ʹө͞ΕͮΒ͍ • վળͯ͠ѱԽ͢Δέʔεͦͦ͋͜͜Δ • ݁ • MTTR վળͷධՁࢦඪͱͯ͠ʹཱͨͳ͍ 20
ͳʹ͕ͩͬͨͷʁ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛͱʹվળͷՌΛ֬ೝ͢Δ͜ͱ ֤ཁૉͳ͍ → తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕
21
σʔλੳʢԾઆݕূܕʣͷྲྀΕ 22
MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 23
ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ߹ 24
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 25
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 26
ิ: TTRͷ͍ಓ ฏۉ(MTTR)େࡶ͗͢Δ → ͷൺֱ՝ൃݟͷࢳޱʹͳΔ • ex. ଈ࣌෮چͷো͕ݮগ • →
ܰඍͳোͷࣗಈ෮چͷՌʁ • → োݕͷΈʹෆ۩߹ʁ • ex. ϒϥοΫεϫϯΠϕϯτ͕૿Ճ • → ίʔυΠϯϑϥͷ࣭Լʁ 27
͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹෆద • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ͑Δ͜ͱ͕Մೳ → TTRΑΓࡉ͔͍ϝτϦΫεͷधཁ͕ग़ͯ͘Δ
28
2. ࣮ફతͳ TTX ϝτϦΫε 29
Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ • ཏతͰ͋Δ͜ͱ • ཻ͕ࡉ͔͍͜ͱ • ऩू͕ݱ࣮తͰ͋Δ͜ͱ 30
ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 31
32
TTXϝτϦΫεͷ՝ײ • ੈͷதʹࣄྫ͍͔ͭ͋͘Δ͕ɺఆٛ౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜ΛΈ߹ΘͤΑ͏ͱͯ͠ɺॏෳෆ͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺཏతͳఆٛΛࢦ͢ 33
TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 34
ϕετϓϥΫςΟεΛֶͿ 35
େ·͔ʹεςʔλεΛఆٛ͢Δ 36
37
38
ϚΠϧετʔϯΛͱʹ TTXʹམͱ͠ࠐΉ 39
40
ίϥϜ: ϝτϦΫεऩू͍ͨΜ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ͑Δ͝ͱ ʹλΠϜελϯϓΛه͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷඇݱ࣮త • → WaroomͰࣗಈऩू͍ͯ͠·͢
41
ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕʣ Ξϥʔτൃੜ௨ Acknowledgedʢೝʣ νϟϯωϧ࡞ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ͚ʢPrecheck ͱResolu.onʣ
Recoveredʢ෮چʣ SlackͷΓͱΓ͔ΒAI͕அ͢Δ 42
3. TTXϝτϦΫεͷ׆༻ 43
ϝτϦΫεΛޮՌతʹ͏ͨΊʹ ੳͷతͱϝτϦΫεͷಛΛ߹ͤ͞Δ 44
45
ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝ վળࢪࡦ TTDetectʢݕʣ ൃੜ͔ͯ͠Βݕ·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ
͔͔Δ γϑτׂͷ໌֬ԽɺΦ ϯίʔϧ੍ͷಋೖ TTInves-gateʢௐࠪʣ োΓ͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ උ TTFixʢम෮ʣ োͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴԽ 46
47
യવͱͨ͠ԾઆΛͱʹɺ͔Β՝Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝ͷྫ ࣾͰੜ͡ΔΠϯγσϯτͰ͋ ΕTTXͷҰఆͷͣ αʔϏενʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTXఆʹ͍ۙͣ ʢex. TTAͳΒ10Ҏ͘Β
͍ʣ ʢ࣮ʣணख͕શମతʹ͍ɺ ղܾࡦͷಛఆ͕શମతʹ͍ 48
49
50
4. ൃలతͳϝτϦΫε 51
αʔϏε෮چҎ֎ʹॏཁͳ͜ͱ • ͜Ε·ͰΈ͖ͯͨTTXϝτϦΫεγεςϜ෮چʹয͕͋ͨͬ ͍ͯΔ • ࣮ࡍͷΠϯγσϯτରԠ γεςϜ͚ͩͰͳ͘ɺਓʹྀ͢ Δඞཁ͕͋Δ • ސ٬ରԠࣄۀӡӦ؍ͷϝτϦΫεΛ׆༻͢Δ͜ͱͰɺΤ
ϯδχΞҎ֎ͷϝϯόʔؚΊͨ৫తͳରԠͷ࣮ݱ͕ۙͮ ͘ 52
ൃలͳϝτϦΫεͷྫ ސ٬ରԠࠜຊରࡦʹযΛͯɺ͞·͟·ͳϩʔϧΛר͖ࠐΈɺ৫తͳΠϯγσϯτରԠΛՃͤ͞ Δ ϝτϦΫε໊ λʔήοτϩʔϧ త Incident Response Metrics Engineer
७ਮͳ෮چରԠͷ՝ಛఆɾվળ ࢦඪ Customer Reliability Metrics Sales, CRE ސ٬ରԠͷ՝ಛఆɾվળࢦඪ Learning Metrics Maneger, Engineer ৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτ ϥοΩϯά Improvement Metrics Maneger, Engineer ࠜຊରࡦͷ࣮ࢪঢ়گͷੳ 53
·ͱΊ ҎԼͷ5Λ͓͑͠·ͨ͠ɻෆ໌͕͋Γ·ͨ͠ΒɺAsk the Speaker͓ӽ͍ͩ͘͠͞ʂ 1. MTTRվળࢦඪͱཱͯͨ͠ͳ͍ • ཧ༝: Πϯγσϯτσʔλͷมಈੑ͕ߴ͍͔Β 2.
ϝτϦΫε׆༻ɺతʙσʔλੳʹࢸΔ·Ͱͷ߹ੑ͕ॏཁ 3. มಈੑΛ͑ΔͨΊʹɺ͍ͷ۩ମԽͱϝτϦΫεͷࡉԽ͕ॏཁ 4. Waroomʹ͓͚ΔTTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏ 5. αʔϏε෮چҎ֎ʹॏཁͳϝτϦΫε 54
͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷ͔͚͠Λ࡞Δ ͷ͍ͨΜ • ͞ΒʹɺՄࢹԽج൫ͷߏங͍ͨΜ • ͞ΒʹɺݪҼΧςΰϦҙϥϕϧΛ ͱʹ෦நग़͢Δͷ͍ͨΜ •
→ ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ Topotal ͷϒʔε ͥͻ͓ӽ͍ͩ͘͠͞ 55
͋Γ͕ͱ͏͍͟͝·ͨ͠