Upgrade to Pro — share decks privately, control downloads, hide ads and more …

モノレポにおけるエラー管理 ~Runbook自動生成とチームメンションの最適化

モノレポにおけるエラー管理 ~Runbook自動生成とチームメンションの最適化

Japan Datadog User Group Meetup#12@東京 登壇資料
https://datadog-jp.connpass.com/event/360923/

Avatar for Shota Iwami

Shota Iwami

August 19, 2025
Tweet

More Decks by Shota Iwami

Other Decks in Technology

Transcript

  1. 6 Organizational Context — the case of newmo 🚗 ෳ਺ϓϩμΫτΛಉ࣌։ൃ

    📦 ϞϊϨϙͰνʔϜྲྀಈੑ͕ߴ͍ 🤝 Ϗδωε/ΤϯδχΞ྆ํ͕ΠϯγσϯτରԠ 👀 Ξϥʔτνϟϯωϧ͸جຊશһݟΔ
  2. νϟϯωϧϊΠζ Key Pain Points & Solutions 7 ΞϥʔτνϟϯωϧΛݟΔจԽ͕ͳ͍ τϥϒϧରԠํ๏͕ෆ໌ Impact

    νʔϜϝϯόʔ͕සൟʹมߋ͢ΔͷͰ ੹೚ऀ͕ෆ໌ྎ Τϥʔͷৄࡉ͕෼͔Γʹ͘͘ରԠํ๏͕Θ͔Βͳ͍ ௐࠪίετ͕ߴ͍ Φʔφʔ͕ෆಁ໌ Problem
  3. Today's Takeaways and keywords: 🔔Alerting ༏ઌ౓ϕʔετϦΞʔδ — especially in a

    monorepo 📖Runbooks એݴత & ࣗಈ഑৴ — no more silos 🚚GitOps One work fl ow — for all observability assets
  4. Broken Windows Theory One unrepaired broken window is a signal

    that no one cares, and so breaking more windows costs nothing. (James Q. Wilson, George L. Kelling ) ΞϥʔτνϟϯωϧΛ࠶ఆٛ͢Δ An unrepaired window says nobody cares…an untriaged alert says the same! Image: AI-generated (OpenAI / Raycastʣ
  5. About Runbook 13 😵 ͸͡Ίํ͕Θ͔Βͳ͍ 💤 ແࢹ͞ΕΔ 🤯 ೝ஌ෛՙ͕ߴ͍ खॱॻͷͳ͍ੈք

    खॱॻͷ͋Δੈք ✅ ௚ͪʹରԠՕॴ͕Θ͔Δ 🚀 ૉૣ͍ରԠΛଟ͘ͷਓ͕Մೳ 🙌 ଞͷਓ͕ϔϧϓ͠΍͍͢ 🚨 💡
  6. About Runbook 14 😵 ͸͡Ίํ͕Θ͔Βͳ͍ 💤 ແࢹ͞ΕΔ 🤯 ೝ஌ෛՙ͕ߴ͍ खॱॻͷͳ͍ੈք

    खॱॻͷ͋Δੈք ✅ ௚ͪʹରԠՕॴ͕Θ͔Δ 🚀 ૉૣ͍ରԠΛଟ͘ͷਓ͕Մೳ 🙌 ଞͷਓ͕ϔϧϓ͠΍͍͢ 🚨 💡
  7. About Runbook 15 “A runbook is an excellent way to

    quickly indicate the direction you should take when an alert comes in. As environments become more complex, not everyone on the team knows every system, and runbooks become an excellent way to spread knowledge.” (Mike Julian, Practical Monitoring, O’Reilly, 2017) खॱॻʢaka RunbooksʣΛ࣋ͭͱ αΠϩԽΛݮΒͤΔ
  8. 16 Why Runbooks Still Fail ❌ υΩϡϝϯτ͕ࢄཚ͢Δ (Not Git-managed) ❌

    ϩά͔Βࢀর͢Δ͜ͱίετ͕ߴ͍ ❌ ΤϥʔϝοηʔδΛϦονʹ͢Δͱϩάྔ૿Ճ But even with runbooks..
  9. Scattered Information ( e.g. Linked via Custom Code ) 17

    Error ʹຒΊࠐ·Εͨ Custom Code & ෼཭͞Εͨ Runbooks →ࢀর + ϝϯςφϯείετͷ૿Ճ Error Message Stack Trace Datadog URL Link etc… + Custom Code for Runbooks Runbook with Custom Code Custom Code Ͱඥ෇͚ 1. custom code ΛΤϥʔ͔ΒऔΓग़͢ 2. Runnbook Λ୳͢ 3. ಺༰Λ֬ೝ 4. ߦಈ
  10. Custom Code (We Call It “Reason Code”) 19 Reason Code

    Runbook RC00000 FooΤϥʔ͕ൃੜ͠·ͨ͠ɻ https://example.com ͷϦιʔεΛ֬ೝͯ͠ xxx ͍ͯͩ͘͠͞ɻ RC00001 barͷγεςϜ͔Β5xxΤϥʔ͕ฦ͞Ε͍ͯ·͢ɻ ઌํͷγεςϜʹ໰୊͕ൃੜ͍ͯ͠ΔՄೳੑ͕͋Γ·͢ɻ ϩάΛ֬ೝ͠ɺҰ࣌తͳ໰୊Ͱͳ͍৔߹͸ɺSlackͰ୲౰ऀʹ࿈བྷ͍ͯͩ͘͠͞ɻ BARגࣜձࣾ 555-1234-5678 ͜ΕΒͷσʔλ͸ Notion/Confluence/Github Wiki, etc… Ͱ؅ཧ͞Ε͍ͯͨ
  11. ------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) =

    { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 23 Single Source of Truth: Proto File • Runbookͱඥ෇͚ΔϢχʔΫίʔυ Message (Runbook) • Reason code ʹඥ෇͚ΒΕͨ Runbook ຊମ Reason Code
  12. ------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) =

    { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 24 Single Source of Truth: Proto File Message (Runbook) • Reason code ʹඥ෇͚ΒΕͨ Runbook ຊମ • Runbookͱඥ෇͚ΔϢχʔΫίʔυ Reason Code
  13. ------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) =

    { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 25 • Runbookͱඥ෇͚ΔϢχʔΫίʔυ Single Source of Truth: Proto File Reason Code Message (Runbook) • Reason code ʹඥ෇͚ΒΕͨ Runbook ຊମ
  14. ------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode

    string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 27 • Proto ͔Βࣗಈੜ੒ͨ͠ఆ਺ • typo๷ࢭ Application Code: Go Reason Code Constants Custom Error • Datadog ࿈ܞ༻ͷಠࣗΤϥʔܕ
  15. ------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode

    string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 28 • Proto ͔Βࣗಈੜ੒ͨ͠ఆ਺ • typo๷ࢭ Application Code: Go Reason Code Constants Custom Error • Datadog ࿈ܞ༻ͷಠࣗΤϥʔܕ
  16. ------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode

    string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 29 Application Code: Go Reason Code Constants Custom Error • Proto ͔Βࣗಈੜ੒ͨ͠ఆ਺ • typo๷ࢭ • Datadog ࿈ܞ༻ͷಠࣗΤϥʔܕ
  17. { "error": { "reason_codes": [ "RC00000" ], ... "message": “foo

    error" } } 30 • Errorͷߏ଄Խϩάͷ Output Error JSON
  18. • is_match Λ࢖ͬͯ error ಺ʹ͋Δ reason code Ͱ෼ذ͢Δ • reason

    code ʹҰகͨ͠ Runbook Λදࣔ • ίϯςΩετʹԠͨ͡ϝοηʔδΛࣗಈબ୒ 34 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} is_match Ͱͷ Message ෼ذ
  19. 35 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error

    occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} • is_match Λ࢖ͬͯ error ಺ʹ͋Δ reason code Ͱ෼ذ͢Δ • reason code ʹҰகͨ͠ Runbook Λදࣔ • ίϯςΩετʹԠͨ͡ϝοηʔδΛࣗಈબ୒ is_match Ͱͷ Message ෼ذ
  20. ------------------------------------------ application_error_alert_message.pb.md: ------------------------------------------ {{!-- Monitor generated by protoc-gen-go-reason-code. DO NOT

    EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 37 Monitors as Code • Monitor ͷ Markdown Λࣗಈੜ੒ • is_match Ͱͷ෼ذΛࣗಈੜ੒ • templatefile ͰಡΈࠐΉ
  21. ------------------------------------------ application_error_alert_message.pb.md: ------------------------------------------ {{!-- Monitor generated by protoc-gen-go-reason-code. DO NOT

    EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 38 Monitors as Code ෳࡶͰ๲େͳ෼ذΛࣗಈੜ੒ʹ ΑͬͯθϩίετͰ࣮ݱ • Monitor ͷ Markdown Λࣗಈੜ੒ • is_match Ͱͷ෼ذΛࣗಈੜ੒ • templatefile ͰಡΈࠐΉ
  22. 42 Slack Notifications • Log attribute ͔Βservice name Ͱ ಈతʹ

    switch Ͱ͖ͳ͍ • Slack group ID Λ͢Δඞཁ • Slackάϧʔϓʹϝϯγϣϯ͢Δͱ͖͸group nameͰࢦఆͰ͖ͣɺgroup ID Λࢦఆ͢Δඞཁ͕͋Δ
  23. About Reference Tables 44 • metadata Λطଘͷ Datadog ʹ͋Δ৘ใʹ௥ՃՄೳ •

    csv ϑΥʔϚοτͰهड़ • csv ௚઀ upload ͷଞʹɺS3/GCS/Azure Storage ͳͲͷ Cloud Storage ʹ΋ରԠ What are Reference Tables?
  24. Linking Slack User Group IDs with Reference Tables 45 •

    ϚοϐϯάΛ CSV Ͱఆٛ • Github Ͱ CSV Λ؅ཧ • มߋΛ Actions Ͱ GCS ʹ Sync • ࣗಈతʹ࠷৽ͷ CSV ʹ Datadog ͕൓ө Linking Slack User Group IDs with Reference Tables ------------------------------------------ slack-group-id.csv: ------------------------------------------ service,id,name component.foo,aaaabbbb1234,alert-server-component-foo component.bar,cccdddd4567,alert-server-component-bar component.baz,eeefff8901,alert-server-component-baz ... GitHub Cloud Storage Refernce Tables
  25. Linking Slack User Group IDs with Reference Tables 46 •

    Lookup Processor Ͱ table Λࢦఆɺ௥Ճ͞Εͨ id Λࢦఆ Specify in Monitor Message
  26. Wrap Up! Runbooksͷ ࣗಈੜ੒ 📖 Code ͱ Monitor ʹ ຒΊࠐ·Εͨ

    Rubook
 —always up to date, always in sync νʔϜϝϯγϣϯ ࠷దԽ 🚚 ਖ਼͍͠ Alert ͕ਖ਼͍͠ Team ʹ
 —instantly, every time