Escape RCA Writing Hell: Using MCP to Let AI Handle the Last Mile of Operations 告別 RCA 撰寫地獄：利用 MCP 讓 AI 替你完成維運最後一哩路 @ DevOpsDays 2026

Escape RCA Writing Hell   Using MCP to Let AI
Handle the Last Mile  of Operations Johnny Sung 告別 RCA 撰寫地獄   利一

Full stack developer Johnny Sung (宋岡諺) https://fb.com/j796160836 https://blog.jks.co ff ee/
https://www.slideshare.net/j796160836 https://github.com/j796160836

Agenda • The Day 2 reality & the RCA pain
• MCP & the architecture • Demo: a sample app, bugs and AI-drafted RCAs • A real-world RCA for comparison • Strengths, limits, privacy • Takeaways

https://www.dynatrace.com/news/blog/what-is-devops/#&gid=0&pid=1

Day 1 = ship it. 👶 Day 2 = keep
it alive.   🍼🍼🥁🥁💩💩💩

Section B — The Problem

The Day 2 Operations Reality • Incidents happen under pressure
• SLA clocks are always ticking • Engineers focus on fi re fi ghting fi rst • Documentation often comes later https://medium.com/@CWSkelly/analysis-this-is- fi ne-meme-e8980 ff 61e78

根本原因分析 Root Cause Analysis

of RCA time goes into “pulling data and structuring the
context.” 70 % 30 % is left for real analysis and improvement suggestions. RCA is necessary, but painful

RCA is necessary, but painful • Context is scattered across
tools • Logs are noisy (approximate 2-3GB) • Chat history is fragmented • Details fade quickly after recovery Created by darwis

https://x.com/lcMenci/status/1991504102954815980

Section C — The Approach

Model Context Protocol 模型上下文協定

https://mcpcn.com/en/blog/understanding-mcp-protocol/

https://www.unimedia.tech/mcp-model-context-protocol-the-usb-of-modern-arti fi cial-intelligence/

https://pixabay.com/zh/illustrations/octopus-drawing-line-art-9025562/ LLM

https://pixabay.com/zh/illustrations/octopus-drawing-line-art-9025562/ LLM Data Data Data Data MCP MCP MCP MCP
Action

3 types of MCP Servers • Local stdio server ✅
• Remote SSE (Server-Sent Events) server (deprecated) • Remote HTTP server

MCP Servers • loki-mcp • https://github.com/grafana/loki-mcp • mcp-grafana • https://github.com/grafana/mcp-grafana
  (It depends on your system structure,   In is demo structure, choose one of the above two options)

https://github.com/grafana/loki-mcp

https://github.com/grafana/mcp-grafana

MCP connects AI to real operational context • MCP lets
LLMs access external tools safely • Tools can provide logs, metrics, alerts, tickets, and deployment records (depends on different MCP tools) • AI becomes a structured assistant, not an autonomous operator

Use MCP to fetch logs • Dump all the logs
into the LLM context? No way. • With MCP, let the LLM fetch data on demand  —grab whatever LLM it need. https://thenounproject.com/icon/crane-7972970/

Section D — How to do

Todays DemoApp: Fortune-telling (運勢抽籤) demo

Todays DemoApp: Fortune-telling (運勢抽籤) App • Frontend: VueJS • Backend:
Spring Boot 4.x + Kotlin 2.2, Java 17 toolchain  + OpenTelemetry • DB: PostgreSQL • Grafana LGTM for demo use • Grafana UI • Mimir/Prometheus (metric storage) • Loki (log storage) • Tempo (trace storage) • postgres_exporter collects DB metrics https://github.com/j796160836/ devopsdays-fortune-telling-demo Demo source code:

SpringBoot  (Backend) Loki  (Logs) Grafana  (Dashboard) OpenTelemetry Prometheus  (Metrics) Tempo 
(Traces) Created by Alice Design from the Noun Project Created by Alice Design from the Noun Project Managers  Ops  Devs Users Vue.js  (Frontend) PostgreSQL  (Database) Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces API access + Architecture https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ https://en.wikipedia.org/wiki/PostgreSQL https://thenounproject.com/icon/robot-1030735/

Architecture SpringBoot  (Backend) Loki  (Logs) Grafana  (Dashboard) OpenTelemetry Prometheus  (Metrics)
Tempo  (Traces) https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ Created by Alice Design from the Noun Project Created by Alice Design from the Noun Project Users Vue.js  (Frontend) PostgreSQL  (Database) https://en.wikipedia.org/wiki/PostgreSQL Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces API access + postgres-exporter Managers  Ops  Devs https://thenounproject.com/icon/robot-1030735/

ngBoot  ckend) Loki  (Logs) Grafana  (Dashboard) OpenTelemetry Prometheus  (Metrics) Tempo 
(Traces) Created by Alice Design from the Noun Project Managers  Ops  Devs PostgreSQL  (Database) Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces + Created by James Smith from the Noun Project mcp-grafana AI / LLM https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ https://en.wikipedia.org/wiki/PostgreSQL https://thenounproject.com/icon/robot-1030735/

ngBoot  ckend) Loki  (Logs) Grafana  (Dashboard) OpenTelemetry Prometheus  (Metrics) Tempo 
(Traces) Created by Alice Design from the Noun Project Managers  Ops  Devs PostgreSQL  (Database) Fetch metrics Fetch logs Fetch traces Push metrics Push logs Push traces + Created by James Smith from the Noun Project loki-mcp AI / LLM Fetch logs https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ https://en.wikipedia.org/wiki/PostgreSQL https://thenounproject.com/icon/robot-1030735/

Section E — Real-World RCA Report

A Real-World RCA Report • The generated RCA may include:
• Executive summary • Customer impact • Incident timeline • Root cause • Contributing factors • Detection and response • Resolution • Corrective and preventive actions • Follow-up owners https://thenounproject.com/icon/ai-report-6818652/

Incident Work fl ow (without AI) 1. Incident detected 2.
Operations (Ops) mitigates the issue 3. Ops collects Logs and reconstructing issue timeline 4. Developers (Devs) fi nds root cause through materials by Ops 5. Write RCA report

The AI-Assisted Investigation Work fl ow 1. The engineer provides
the incident time range 2. AI queries monitoring and log tools through MCP 3. AI identi fi es abnormal events 4. AI correlates evidence across multiple sources 5. AI proposes possible causes 6. AI generates a structured RCA draft 7. The engineer validates the fi ndings https://thenounproject.com/icon/ai-report-6818652/

Become some automative • Using WatchDogs (e.g. Ansible EDA) monitoring
status • Do mitigates scripts if service crashes • Fetch required informations • Inform Ops to check issues. • Asynchronous calls AI / LLM to investigates and drafting RCAs. • Restart service

SpringBoot  (Backend) https://commons.wikimedia.org/wiki/File:Prometheus_software_logo.svg https://www.nuget.org/pro fi les/OpenTelemetry https://grafana.com/oss/tempo/ https://thenounproject.com/icon/user-4216248/ Created by
Alice Design from the Noun Project Users Vue.js  (Frontend) PostgreSQL  (Database) https://en.wikipedia.org/wiki/PostgreSQL Architecture API access Ansible EDA  (WatchDog) Monitoring

SpringBoot  (Backend) Architecture Ansible EDA  (WatchDog) Monitoring Monitoring service Keep
required infos Restart service If services crash

SpringBoot  (Backend) Architecture Ansible EDA  (WatchDog) Monitoring Informs AI /
LLM   Analyzes root cause Generate   RCA Report draft Send draft to   Ticket system Monitoring service Keep required infos Restart service If services crash

What AI Can Help With • Generate incident timelines •
Summarize logs and alerts • Compare normal vs abnormal behavior • Identify related deployments or con fi g changes • Draft structured RCA sections https://thenounproject.com/icon/robot-8347005/ Created by RF_Design

What AI Can Do Well 😎 • Search across multiple
operational tools • Reduce repetitive copy-and-paste work • Correlate events by timestamp • Translate technical evidence into readable summaries • Generate different versions for engineers and management • Maintain a consistent RCA format

What AI Cannot Reliably Do 🫨 • It may confuse
correlation with causation • It may miss context that was never recorded • It may trust misleading 🤨 or incomplete logs • It may generate con fi dent but unsupported conclusions 🫨 • It cannot decide organizational priorities • It cannot replace system ownership

Section F — What I learned

AI accelerates writing, but does not replace judgment • Data
quality matters • Missing or wrong context leads to weak or wrong conclusions • AI may overcon fi dently summarize (Hallucination) • Human review is mandatory • RCA should improve systems, not assign blame

https://www.linkedin.com/pulse/garbage-out-chanthoeun-chiv/

To avoid "Garbage in Garbage out" • Log normalization •
Write tools (custom scripts or custom MCP)  for puri fi cation logs, minimize context • Clear tool semantics and descriptions

https://www.threads.com/@yeamao_31924/post/DD4SV9fS5Mc

Practical Adoption Strategy 🔐 • Begin with read-only MCP tools
• Limit access to selected data sources • Start with RCA drafting, not automated remediation • Add mandatory human review • Evaluate accuracy using historical incidents • Expand only after trust is established https://thenounproject.com/icon/security-6255418/

Section G — Some supplements

How about  On-premises (Self-hosted) LLM?

On-premises LLM • Run LLM model depends on your hardware
• Provider: LM Studio (Mac) or llama.cpp (Windows) or ollama • Desktop app: Cherry studio • CLI: OpenCode or claude code with custom provider • VScode plugins: Continue https://thenounproject.com/icon/on-premise-7209053/

https://opencode.ai/

https://cherryai.com/

Hardware • VRAMs (or Uni fi ed Memory) at least
48GB. • NVIDIA GPUs • Macs: prefer to use the MLX model.

On-premises LLM Models • Use model that can tool calling
• google/gemma-4-26b-a4b-qat • qwen/qwen3.6-27b • Watch out Context Length  (Recommended: 64k and above, default 8192)

它變成參數說明了...

⚠ 錯誤

MCP vs CLI ?

https://x.com/alexxubyte/status/2041176691087937599/photo/1

Related CLI • gcx - A CLI for managing Grafana
Cloud resources. • https://github.com/grafana/gcx • LogCLI - a command-line tool for querying and exploring logs in Grafana Loki • https://github.com/grafana/loki/releases https://thenounproject.com/icon/terminal-8126394/

https://github.com/grafana/gcx

https://grafana.com/docs/loki/latest/query/logcli/getting-started/

The CLI as you understand it isn’t LLM-friendly What LLM
likes • One command, one action, no excuse. • The output should ideally be in JSON. What LLM don't likes • Interactive CLI, with pause • TUI (Terminal UI) https://thenounproject.com/icon/bot-8249797/

• The on-prem LLM recommends using MCP. • The on-prem
LLM doesn’t seem to invoke the CLI. • Limited by its model parameters (and hardware),   the on-prem LLM can only provide simple answers. https://thenounproject.com/icon/idea-8365882/ My experience (1/2)

My experience (2/2) • Load only the MCPs that the
context actually needs. • Loading too many MCPs will consume context  and muddle the conversation. https://thenounproject.com/icon/idea-8365882/

Takeaways • Day 2's hidden tax is the RCA, not
just the outage • MCP turns your real ops tools into grounded context for the LLM • AI does the tedious aggregation; humans keep the judgment • Assistant, not autopilot — mind the data boundary

Do not automate accountability. Automate the tedious work. • Let
AI organize the evidence. • Let engineers focus on improving the system.

Q & A https://pixabay.com/photos/rca-plug-connection-cable-4737593/ Thanks for listening

Escape RCA Writing Hell: Using MCP to Let AI Ha...

Escape RCA Writing Hell: Using MCP to Let AI Handle the Last Mile of Operations 告別 RCA 撰寫地獄：利用 MCP 讓 AI 替你完成維運最後一哩路 @ DevOpsDays 2026

More Decks by Johnny Sung

Other Decks in Technology

Featured

Transcript

Escape RCA Writing Hell: Using MCP to Let AI Ha...

Escape RCA Writing Hell: Using MCP to Let AI Handle the Last Mile of Operations 告別 RCA 撰寫地獄：利用 MCP 讓 AI 替你完成維運最後一哩路 @ DevOpsDays 2026