Incident Response Assistant

My teammate and I were the only two Site Reliability Engineers in the company.

We watched developers spend far too long during an incident trying to figure out why the issue was occurring. The exact same issue had happened twice before. Both post-mortems were in Confluence. Both Slack threads were still there. None of it was findable when the pager was going off.

That was the “Light-bulb” moment.

We decided we should create a tool that would help developers find answers quickly.

Our Goal Was Simple

Give every developer—who is almost certainly the first responder at 2 a.m.—immediate, context-aware access to the previous mitigating actions.

Just type what you’re seeing and get the answer that worked last time.

What We Shipped

A clean OpenWebUI interface

You type something like:

“redis memory slowly climbing for two days” (For Example)

and get back:

The post-mortem from 11 months ago
The exact Slack message where discussed the fix
The Confluence runbook
The Jira ticket with the root cause
Three similar incidents with similarity scores

The Stack

Frontend: OpenWebUI
Backend: ECS Fargate
Vector DB: Weaviate
Embeddings & LLM: AWS Bedrock
Data sources (all automated, zero manual work):
- Every historical incident Slack channel
- Relevant Confluence space (runbooks, post-mortems, architecture)
- Every closed incident ticket in Jira
- Every post-mortem

Final Thought

Sometimes the highest-leverage thing two SREs can do is build a tool that supports and enables developers.