My teammate and I were the only two Site Reliability Engineers in the company.
We watched developers spend far too long during an incident trying to figure out why the issue was occurring. The exact same issue had happened twice before. Both post-mortems were in Confluence. Both Slack threads were still there. None of it was findable when the pager was going off.
That was the “Light-bulb” moment.
We decided we should create a tool that would help developers find answers quickly.
Our Goal Was Simple
Give every developer—who is almost certainly the first responder at 2 a.m.—immediate, context-aware access to the previous mitigating actions.
Just type what you’re seeing and get the answer that worked last time.
What We Shipped
A clean OpenWebUI interface
You type something like:
“redis memory slowly climbing for two days” (For Example)
and get back:
- The post-mortem from 11 months ago
- The exact Slack message where discussed the fix
- The Confluence runbook
- The Jira ticket with the root cause
- Three similar incidents with similarity scores
The Stack
- Frontend: OpenWebUI
- Backend: ECS Fargate
- Vector DB: Weaviate
- Embeddings & LLM: AWS Bedrock
- Data sources (all automated, zero manual work):
- Every historical incident Slack channel
- Relevant Confluence space (runbooks, post-mortems, architecture)
- Every closed incident ticket in Jira
- Every post-mortem
Final Thought
Sometimes the highest-leverage thing two SREs can do is build a tool that supports and enables developers.