culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
culture of checking alert channels Unknown troubleshooting procedures Impact Ambiguous responsibilities due to team changes Noti fi cations arrive without metadata - responders do not understand root cause or the fi x Unclear ownership Problem
that no one cares, and so breaking more windows costs nothing. (James Q. Wilson, George L. Kelling ) Radically Re-Defining Alert Channels An unrepaired window says nobody cares…an untriaged alert says the same! Image: AI-generated (OpenAI / Raycastʣ
ignored 🤯 High cognitive load A World without procedures A World with procedures ✅ Immediately starting point 🚀 Fast action by more people 🙌 Anyone can help 🚨 💡
24 Runbooks Runbooks 🚨 💡 😵 Unclear where to start 💤 Issues ignored 🤯 High cognitive load ✅ Immediately starting point 🚀 Fast action by more people 🙌 Anyone can help
quickly indicate the direction you should take when an alert comes in. As environments become more complex, not everyone on the team knows every system, and runbooks become an excellent way to spread knowledge.” (Mike Julian, Practical Monitoring, O’Reilly, 2017) Having procedures documented —aka “Runbooks”— helps prevent silos.
Embedding custom code in errors & keeping runbooks separate increases both lookup and maintenance costs. Error Message Stack Trace Datadog URL Link etc… + Custom Code for Runbooks Runbook with Custom Code Linked via Custom Code From the custom code in the error, search for the runbook, check the content, and take action….
Runbook RC00000 A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. RC00001 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. BAR, Inc. 555-1234-5678 Previously, these were documented in Notion/Confluence/Github Wiki, etc…
{ message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 38 Single Source of Truth: Proto File • Unique code linking implementation and runbook Message (Runbook) • Runbook associated with the corresponding reason code Reason Code
{ message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 39 Single Source of Truth: Proto File Message (Runbook) • Runbook associated with the corresponding reason code • Unique code linking implementation and runbook Reason Code
{ message: [ "A foo error occurred.", "Please check resources at https://example.com.", "If it's bar, please proceed”, “with the standard workaround." ] }]; RC00001 = 1[(ext.reason_code) = { message: [ "5xx errors returned from foo's system.", "Their system might be experiencing issues.", "Check logs and if it's not temporary,”, "contact the responsible party through PM.", "foo, Inc.”, "080-1111-1111" ] }]; } 40 • Unique code linking implementation and runbook Single Source of Truth: Proto File Reason Code Message (Runbook) • Runbook associated with the corresponding reason code
runbook content when reason code matches • Context-aware alert messages 49 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match
runbook content when reason code matches • Context-aware alert messages 50 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match
runbook content when reason code matches • Context-aware alert messages 51 {{#is_match "log.attributes.error.reason_codes" "\"RC00000\""}} reason_code: RC00000 ``` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. ``` {{/is_match}} Message Branching with is_match 💭 OK, I get how this works with Datadog!! But so what?
Drift to Gift Delivery 53 AFTER Alert channel = 🎁Smart gifts arriving with perfect timing 🎁 ✨ 🎁 🎁 🎁 ✨ ✨ 🎁 A foo error occurred. Please check resources at… 5xx errors returned from foo's system…. 5xx errors returned from bar’s system…. 429 errors returned from foo's system…. System A is down…
EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 55 Monitors as Code • Monitor markdown auto-generated • is_match branches auto-generated • Loaded via built-in templatefile function
EDIT. --}} {{ log.message }} {{#is_match "log.attributes.error.reason_codes" "\"RC00000\"" }} reason_code: `RC00000` A foo error occurred. Please check resources at https://example.com. If it's bar, please proceed with the standard workaround. {{/is_match}} {{#is_match "log.attributes.error.reason_codes" "\"RC00001\"" }} reason_code: `RC00001` 5xx errors returned from bar's system. Their system might be experiencing issues. Check the logs, and if it's not temporary, contact the responsible party via Slack. bar, Inc. 555-1234-5678 {{/is_match}} ... {{#is_alert}} @${notification_alert_channel} {{/is_alert}} ------------------------------------------ monitor.tf: ------------------------------------------ resource "datadog_monitor" "application_error_alert" { ... message = templatefile( “${path.module}/messages/application_error_alert_message.pb.md”, { notification_alert_channel = var.notification_alert_channel } ) } 56 Monitors as Code • Monitor markdown auto-generated • is_match branches auto-generated • Loaded via built-in templatefile function Complex, large-scale branching with zero manual cost, enabled by automation
• One PR updates code, documentation, and monitors simultaneously • Runbooks provided directly in alert notifications, eliminating manual linking Alert with Runbook One proto to rule them all, One proto to fi nd them, One proto to generate them all, and in the code bind them! 💍
Define mappings in CSV • Manage CSVs in GitHub • Changes are synced to GCS • Datadog automatically updates from the latest CSV Linking Slack User Group IDs with Reference Tables ------------------------------------------ slack-group-id.csv: ------------------------------------------ service,id,name component.foo,aaaabbbb1234,alert-server-component-foo component.bar,cccdddd4567,alert-server-component-bar component.baz,eeefff8901,alert-server-component-baz ... GitHub Cloud Storage Refernce Tables