Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monorepo Error Management: Automated Runbooks a...

Monorepo Error Management: Automated Runbooks and Team-Targeted Alert Distribution

Avatar for Shota Iwami

Shota Iwami

June 12, 2025
Tweet

More Decks by Shota Iwami

Other Decks in Technology

Transcript

  1. NEW ERROR noti fi cations fl ooding your #alerts-channel that

    nobody checks until the next morning…? 2 Have you ever experienced…
  2. Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in

    a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets
  3. Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in

    a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets
  4. Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in

    a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets
  5. Today's Takeaways and keywords: 🔔Alerting Priority-based triage — especially in

    a monorepo 📖Runbooks Declarative & auto-delivered — no more silos 🚚GitOps One work fl ow — for all observability assets
  6. 11 A Colorful journey with newmo ...is a Japan-Based Mobility

    & FinTech start-up founded Jan 2024. Operates Taxi and Ridesharing platforms & Car-leasing services
  7. 13 Organizational Context — the case of newmo 🚗 Multi-product

    mobility startup 📦 Monorepo, fluid team moves 🤝 Business/Engineering co-own incidents 👀 Visible alert channels for all
  8. Alert channel noise Key Pain Points & Solutions 14 No

    culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
  9. Alert channel noise Key Pain Points & Solutions 15 No

    culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
  10. Alert channel noise Key Pain Points & Solutions 16 No

    culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
  11. Alert channel noise Key Pain Points & Solutions 17 No

    culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
  12. Broken Windows Theory One unrepaired broken window is a signal

    that no one cares, and so breaking more windows costs nothing. (James Q. Wilson, George L. Kelling ) Image: AI-generated (OpenAI / Raycastʣ
  13. Broken Windows Theory One unrepaired broken window is a signal

    that no one cares, and so breaking more windows costs nothing. (James Q. Wilson, George L. Kelling ) Radically Re-Defining Alert Channels An unrepaired window says nobody cares…an untriaged alert says the same! Image: AI-generated (OpenAI / Raycastʣ
  14. About Runbook 23 😵 Unclear where to start 💤 Issues

    ignored 🤯 High cognitive load A World without procedures A World with procedures ✅ Immediately starting point 🚀 Fast action by more people 🙌 Anyone can help 🚨 💡
  15. A World with procedures A World without procedures About Runbook

    24 Runbooks Runbooks 🚨 💡 😵 Unclear where to start 💤 Issues ignored 🤯 High cognitive load ✅ Immediately starting point 🚀 Fast action by more people 🙌 Anyone can help
  16. About Runbook 25 “A runbook is an excellent way to

    quickly indicate the direction you should take when an alert comes in. As environments become more complex, not everyone on the team knows every system, and runbooks become an excellent way to spread knowledge.” (Mike Julian, Practical Monitoring, O’Reilly, 2017) Having procedures documented —aka “Runbooks”— helps prevent silos.
  17. 26 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed)

    ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..
  18. 27 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed)

    ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..
  19. 28 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed)

    ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..
  20. 29 Why Runbooks Still Fail ❌ Scattered docs (Not Git-managed)

    ❌ Hard to reference from logs ❌ Log volume limits rich error messages But even with runbooks..
  21. Scattered Information ( e.g. Linked via Custom Code ) 30

    Embedding custom code in errors & keeping runbooks separate increases both lookup and maintenance costs. Error Message Stack Trace Datadog URL Link etc… + Custom Code for Runbooks Runbook with Custom Code Linked via Custom Code From the custom code in the error, search for the runbook, check the content, and take action….
  22. Custom Code (We Call It “Reason Code”) 32 Reason Code

    Runbook RC00000 A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. RC00001 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. BAR, Inc. 555-1234-5678 Previously, these were documented in Notion/Confluence/Github Wiki, etc…
  23. Architecture 35 ✅ Here's our WHY: We want all the

    benefits of runbooks without the maintenance nightmare.
  24. Architecture 36 ✅ Here's our WHY: We want all the

    benefits of runbooks without the maintenance nightmare. Change it once → instantly reflected everywhere!
  25. ------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) =

    { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 38 Single Source of Truth: Proto File • Unique code linking implementation and runbook Message (Runbook) • Runbook associated with the corresponding reason code Reason Code
  26. ------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) =

    { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 39 Single Source of Truth: Proto File Message (Runbook) • Runbook associated with the corresponding reason code • Unique code linking implementation and runbook Reason Code
  27. ------------------------------------------ reason_code.proto: ------------------------------------------ enum ReasonCode { RC00000 = 0[(ext.reason_code) =

    { message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 40 • Unique code linking implementation and runbook Single Source of Truth: Proto File Reason Code Message (Runbook) • Runbook associated with the corresponding reason code
  28. ------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode

    string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 42 • Auto-generated reason code from proto file Application Code: Go Reason Code Constants Custom Error • Custom errors with reason codes for Datadog integration
  29. ------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode

    string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 43 • Auto-generated reason code from proto file Application Code: Go Reason Code Constants Custom Error • Custom errors with reason codes for Datadog integration
  30. ------------------------------------------ reason_code.pb.go: ------------------------------------------ // Code generated by protoc-gen-go-reason-code. type ReasonCode

    string const ( RC00000 ReasonCode = "RC00000" // Message: A foo... RC00001 ReasonCode = "RC00001" // Message: 5xx err... ) ------------------------------------------ error.go: ------------------------------------------ // Custom Error type Error struct { error ... reasonCodes []ReasonCode ... } func WithReasonCode(rc ReasonCode) Option { return func(e *Error) { e.reasonCodes = append(e.reasonCodes, rc) } } 44 • Auto-generated reason code from proto file Application Code: Go Reason Code Constants Custom Error • Custom errors with reason codes for Datadog integration
  31. { "error": { "reason_codes": [ "RC00000" ], ... "message": “foo

    error" } } 45 • Reason codes in error field (structured logs) Error JSON
  32. • Branch message by reason code with is_match • Show

    runbook content when reason code matches • Context-aware alert messages 49 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match
  33. • Branch message by reason code with is_match • Show

    runbook content when reason code matches • Context-aware alert messages 50 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match
  34. • Branch message by reason code with is_match • Show

    runbook content when reason code matches • Context-aware alert messages 51 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match 💭 OK, I get how this works with Datadog!! But so what?
  35. From Garbage Drift to Gift Delivery 52 Alert channel =

    🗑Garbage fl oating in digital ocean 🗑 🗑 🗑 💩 💩 🗑 💩 Let's be honest: our alert channel was a digital wasteland… BEFORE
  36. Now it's a treasure chest of actionable intelligence!! From Garbage

    Drift to Gift Delivery 53 AFTER Alert channel = 🎁Smart gifts arriving with perfect timing 🎁 ✨ 🎁 🎁 🎁 ✨ ✨ 🎁 A foo error occurred. Please check resources at… 5xx errors returned from foo's system…. 5xx errors returned from bar’s system…. 429 errors returned from foo's system…. System A is down…
  37. ------------------------------------------ application_error_alert_message.pb.md: ------------------------------------------ {{!-- Monitor generated by protoc-gen-go-reason-code. DO NOT

    EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 55 Monitors as Code • Monitor markdown auto-generated • is_match branches auto-generated • Loaded via built-in templatefile function
  38. ------------------------------------------ application_error_alert_message.pb.md: ------------------------------------------ {{!-- Monitor generated by protoc-gen-go-reason-code. DO NOT

    EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 56 Monitors as Code • Monitor markdown auto-generated • is_match branches auto-generated • Loaded via built-in templatefile function Complex, large-scale branching with zero manual cost, enabled by automation
  39. After 57 • Reduced divergence between Runbooks and implementation code

    • One PR updates code, documentation, and monitors simultaneously • Runbooks provided directly in alert notifications, eliminating manual linking Alert with Runbook
  40. After 58 • Reduced divergence between Runbooks and implementation code

    • One PR updates code, documentation, and monitors simultaneously • Runbooks provided directly in alert notifications, eliminating manual linking Alert with Runbook One proto to rule them all, One proto to fi nd them, One proto to generate them all, and in the code bind them! 💍
  41. • To mention a Slack group, you must specify group

    ID, not group name 61 Slack Notifications
  42. • To mention a Slack group, you must specify group

    ID, not group name 62 Slack Notifications • Can't switch dynamically from log attributes service name • Must specify Slack group ID
  43. About Reference Tables 64 • Add metadata to information already

    in Datadog • Describe metadata in CSV format • Data sources can be direct CSV upload, S3, GCS, or Azure Storage What are Reference Tables?
  44. Linking Slack User Group IDs with Reference Tables 65 •

    Define mappings in CSV • Manage CSVs in GitHub • Changes are synced to GCS • Datadog automatically updates from the latest CSV Linking Slack User Group IDs with Reference Tables ------------------------------------------ slack-group-id.csv: ------------------------------------------ service,id,name component.foo,aaaabbbb1234,alert-server-component-foo component.bar,cccdddd4567,alert-server-component-bar component.baz,eeefff8901,alert-server-component-baz ... GitHub Cloud Storage Refernce Tables
  45. Linking Slack User Group IDs with Reference Tables 66 •

    Specify the group ID from log attributes in the monitor message. Specify in Monitor Message
  46. Wrap Up! Automatic Runbooks Generation 📖 Runbooks embedded in code

    & monitors —always up to date, always in sync Team-Specific Alert Notification 🚚 Get the right alert to the right team —instantly, every time