yes! 3. Automatable: To some extent yeah! 4. Tactical: yesssssssss! 5. No enduring value: sometimes yes! unless you find a bug to fix, which stops the alert from next time! Let’s eliminate it!!!!!! 9
same filter condition for the same alert b. The pattern of alerts are limited. c. Refer to previous alert response: i. “This case can be ignorable.” ii. “This alert is same as this one.” iii. “I'll see if it recurs.” 2. Accumulate knowledge a. A thread in the alert channel can be very valuable and informative but it tends to be lost. 11
Slack bot 2. Generate filter condition a. Time range (30 mins ~ the event timestamp) b. app name e.g. backend c. module name e.g. graphql_server d. etc. 3. Search Logs/Trace a. Error count b. Top 5 gRPC method, graphql query/mutation c. Link New oncall members can easily know what to check when receiving an alert! 12
to read and interpret them. To-Be: LLM summarizes the past responses: 1. Status: Observing, Ticket Created, WIP, etc. 2. Error: error shared on Slack if exists 3. Link: 14 As-Is To-Be (sample)
Automate oncall handover generation. 2. Automate playbook/runbook (placeholder) generation based on Slack conversation. 3. etc. LLM is like a seasoning for me. It’s not always main dish (work) but it can make it more delicious (fun). Future Work 15