Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Escape RCA Writing Hell:
Using MCP to Let AI Ha...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Escape RCA Writing Hell:
Using MCP to Let AI Handle the Last Mile
of Operations 告別 RCA 撰寫地獄:利用 MCP 讓 AI 替你完成維運最後一哩路 @ DevOpsDays 2026

(Presented at DevOpsDays 2026)

After a system goes live, “Day 2” operations pose two big challenges. The first is rapid, minute-by-minute firefighting; the second, and often tougher, is writing a thorough post-incident Root Cause Analysis (RCA).

Under tight SLAs, we spend most of our energy putting out fires, only to burn even more time afterward reconstructing events, stitching together logs, and drafting the report.
This talk skips the hype of fully autonomous AI operations and instead focuses on a pragmatic approach: using the Model Context Protocol to connect your ops tools so an LLM can ingest real-world data and help engineers generate a structured RCA in minutes.

Let AI handle the drudgery of data gathering and formatting, while we invest our precious time in fixing issues and improving the architecture.

---

(分享於 DevOpsDays 2026)

系統上線後的 Day 2 維運挑戰,除了分秒必爭的故障排除,另一個大魔王往往是災難後的 Root Cause Analysis (RCA) 撰寫。

在高壓的 SLA 要求下,我們常忙於救火,卻在事後為了還原現場、統整 Log 與撰寫檢討報告而耗費大量心力。

本次分享將不談誇大的 AI 全自動維運,而是聚焦於務實的應用:如何利用 MCP (Model Context Protocol) 串接維運工具,讓 LLM 能夠讀取真實情境數據,協助工程師快速生成結構化的 RCA 報告。

讓我們把整理資訊的繁瑣工作交給 AI,把寶貴的時間留給解決問題與架構優化。

Avatar for Johnny Sung

Johnny Sung

June 26, 2026

More Decks by Johnny Sung

Other Decks in Technology

Transcript

  1. Escape RCA Writing Hell 
 Using MCP to Let AI

    Handle the Last Mile
 of Operations Johnny Sung 告別 RCA 撰寫地獄 
 利 一
  2. Full stack developer Johnny Sung (宋岡諺) https://fb.com/j796160836 https://blog.jks.co ff ee/

    https://www.slideshare.net/j796160836 https://github.com/j796160836
  3. Agenda • The Day 2 reality & the RCA pain

    • MCP & the architecture • Demo: a sample app, bugs and AI-drafted RCAs • A real-world RCA for comparison • Strengths, limits, privacy • Takeaways
  4. Day 1 = ship it. 👶 Day 2 = keep

    it alive. 
 🍼🍼🥁🥁💩💩💩
  5. The Day 2 Operations Reality • Incidents happen under pressure

    • SLA clocks are always ticking • Engineers focus on fi re fi ghting fi rst • Documentation often comes later https://medium.com/@CWSkelly/analysis-this-is- fi ne-meme-e8980 ff 61e78
  6. of RCA time goes into “pulling data and structuring the

    context.” 70 % 30 % is left for real analysis and improvement suggestions. RCA is necessary, but painful
  7. RCA is necessary, but painful • Context is scattered across

    tools • Logs are noisy (approximate 2-3GB) • Chat history is fragmented • Details fade quickly after recovery Created by darwis
  8. 3 types of MCP Servers • Local stdio server ✅

    • Remote SSE (Server-Sent Events) server (deprecated) • Remote HTTP server
  9. MCP Servers • loki-mcp • https://github.com/grafana/loki-mcp • mcp-grafana • https://github.com/grafana/mcp-grafana

    
 (It depends on your system structure, 
 In is demo structure, choose one of the above two options)
  10. MCP connects AI to real operational context • MCP lets

    LLMs access external tools safely • Tools can provide logs, metrics, alerts, tickets, and deployment records (depends on different MCP tools) • AI becomes a structured assistant, not an autonomous operator
  11. Use MCP to fetch logs • Dump all the logs

    into the LLM context? No way. • With MCP, let the LLM fetch data on demand
 —grab whatever LLM it need. https://thenounproject.com/icon/crane-7972970/
  12. Todays DemoApp: Fortune-telling (運勢抽籤) App • Frontend: VueJS • Backend:

    Spring Boot 4.x + Kotlin 2.2, Java 17 toolchain
 + OpenTelemetry • DB: PostgreSQL • Grafana LGTM for demo use • Grafana UI • Mimir/Prometheus (metric storage) • Loki (log storage) • Tempo (trace storage) • postgres_exporter collects DB metrics https://github.com/j796160836/ devopsdays-fortune-telling-demo Demo source code:
  13. SpringBoot
 (Backend) Loki
 (Logs) Grafana
 (Dashboard) OpenTelemetry Prometheus
 (Metrics) Tempo


    (Traces) Created by Alice Design from the Noun Project Created by Alice Design from the Noun Project Managers
 Ops
 Devs Users Vue.js
 (Frontend) PostgreSQL
 (Database) Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces API access + Architecture https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ https://en.wikipedia.org/wiki/PostgreSQL https://thenounproject.com/icon/robot-1030735/
  14. Architecture SpringBoot
 (Backend) Loki
 (Logs) Grafana
 (Dashboard) OpenTelemetry Prometheus
 (Metrics)

    Tempo
 (Traces) https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ Created by Alice Design from the Noun Project Created by Alice Design from the Noun Project Users Vue.js
 (Frontend) PostgreSQL
 (Database) https://en.wikipedia.org/wiki/PostgreSQL Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces API access + postgres-exporter Managers
 Ops
 Devs https://thenounproject.com/icon/robot-1030735/
  15. ngBoot
 ckend) Loki
 (Logs) Grafana
 (Dashboard) OpenTelemetry Prometheus
 (Metrics) Tempo


    (Traces) Created by Alice Design from the Noun Project Managers
 Ops
 Devs PostgreSQL
 (Database) Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces + Created by James Smith from the Noun Project mcp-grafana AI / LLM https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ https://en.wikipedia.org/wiki/PostgreSQL https://thenounproject.com/icon/robot-1030735/
  16. ngBoot
 ckend) Loki
 (Logs) Grafana
 (Dashboard) OpenTelemetry Prometheus
 (Metrics) Tempo


    (Traces) Created by Alice Design from the Noun Project Managers
 Ops
 Devs PostgreSQL
 (Database) Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces + Created by James Smith from the Noun Project loki-mcp AI / LLM Fetch logs https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ https://en.wikipedia.org/wiki/PostgreSQL https://thenounproject.com/icon/robot-1030735/
  17. A Real-World RCA Report • The generated RCA may include:

    • Executive summary • Customer impact • Incident timeline • Root cause • Contributing factors • Detection and response • Resolution • Corrective and preventive actions • Follow-up owners https://thenounproject.com/icon/ai-report-6818652/
  18. Incident Work fl ow (without AI) 1. Incident detected 2.

    Operations (Ops) mitigates the issue 3. Ops collects Logs and reconstructing issue timeline 4. Developers (Devs) fi nds root cause through materials by Ops 5. Write RCA report
  19. The AI-Assisted Investigation Work fl ow 1. The engineer provides

    the incident time range 2. AI queries monitoring and log tools through MCP 3. AI identi fi es abnormal events 4. AI correlates evidence across multiple sources 5. AI proposes possible causes 6. AI generates a structured RCA draft 7. The engineer validates the fi ndings https://thenounproject.com/icon/ai-report-6818652/
  20. Become some automative • Using WatchDogs (e.g. Ansible EDA) monitoring

    status • Do mitigates scripts if service crashes • Fetch required informations • Inform Ops to check issues. • Asynchronous calls AI / LLM to investigates and drafting RCAs. • Restart service
  21. SpringBoot
 (Backend) https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ Created by

    Alice Design from the Noun Project Users Vue.js
 (Frontend) PostgreSQL
 (Database) https://en.wikipedia.org/wiki/PostgreSQL Architecture API access Ansible EDA
 (WatchDog) Monitoring
  22. SpringBoot
 (Backend) Architecture Ansible EDA
 (WatchDog) Monitoring Informs AI /

    LLM 
 Analyzes root cause Generate 
 RCA Report draft Send draft to 
 Ticket system Monitoring service Keep required infos Restart service If services crash
  23. What AI Can Help With • Generate incident timelines •

    Summarize logs and alerts • Compare normal vs abnormal behavior • Identify related deployments or con fi g changes • Draft structured RCA sections https://thenounproject.com/icon/robot-8347005/ Created by RF_Design
  24. What AI Can Do Well 😎 • Search across multiple

    operational tools • Reduce repetitive copy-and-paste work • Correlate events by timestamp • Translate technical evidence into readable summaries • Generate different versions for engineers and management • Maintain a consistent RCA format
  25. What AI Cannot Reliably Do 🫨 • It may confuse

    correlation with causation • It may miss context that was never recorded • It may trust misleading 🤨 or incomplete logs • It may generate con fi dent but unsupported conclusions 🫨 • It cannot decide organizational priorities • It cannot replace system ownership
  26. AI accelerates writing, but does not replace judgment • Data

    quality matters • Missing or wrong context leads to weak or wrong conclusions • AI may overcon fi dently summarize (Hallucination) • Human review is mandatory • RCA should improve systems, not assign blame
  27. To avoid "Garbage in Garbage out" • Log normalization •

    Write tools (custom scripts or custom MCP)
 for puri fi cation logs, minimize context • Clear tool semantics and descriptions
  28. Practical Adoption Strategy 🔐 • Begin with read-only MCP tools

    • Limit access to selected data sources • Start with RCA drafting, not automated remediation • Add mandatory human review • Evaluate accuracy using historical incidents • Expand only after trust is established https://thenounproject.com/icon/security-6255418/
  29. On-premises LLM • Run LLM model depends on your hardware

    • Provider: LM Studio (Mac) or llama.cpp (Windows) or ollama • Desktop app: Cherry studio • CLI: OpenCode or claude code with custom provider • VScode plugins: Continue https://thenounproject.com/icon/on-premise-7209053/
  30. Hardware • VRAMs (or Uni fi ed Memory) at least

    48GB. • NVIDIA GPUs • Macs: prefer to use the MLX model.
  31. On-premises LLM Models • Use model that can tool calling

    • google/gemma-4-26b-a4b-qat • qwen/qwen3.6-27b • Watch out Context Length
 (Recommended: 64k and above, default 8192)
  32. Related CLI • gcx - A CLI for managing Grafana

    Cloud resources. • https://github.com/grafana/gcx • LogCLI - a command-line tool for querying and exploring logs in Grafana Loki • https://github.com/grafana/loki/releases https://thenounproject.com/icon/terminal-8126394/
  33. The CLI as you understand it isn’t LLM-friendly What LLM

    likes • One command, one action, no excuse. • The output should ideally be in JSON. What LLM don't likes • Interactive CLI, with pause • TUI (Terminal UI) https://thenounproject.com/icon/bot-8249797/
  34. • The on-prem LLM recommends using MCP. • The on-prem

    LLM doesn’t seem to invoke the CLI. • Limited by its model parameters (and hardware), 
 the on-prem LLM can only provide simple answers. https://thenounproject.com/icon/idea-8365882/ My experience (1/2)
  35. My experience (2/2) • Load only the MCPs that the

    context actually needs. • Loading too many MCPs will consume context
 and muddle the conversation. https://thenounproject.com/icon/idea-8365882/
  36. Takeaways • Day 2's hidden tax is the RCA, not

    just the outage • MCP turns your real ops tools into grounded context for the LLM • AI does the tedious aggregation; humans keep the judgment • Assistant, not autopilot — mind the data boundary
  37. Do not automate accountability. Automate the tedious work. • Let

    AI organize the evidence. • Let engineers focus on improving the system.