Every ops team has a Notion graveyard or a Confluence dumpster full of runbooks that nobody can find when the building is on fire.
I got tired of watching brilliant engineers fumble around trying to find the documentation and previous incidents (And mitigating steps) at 2 a.m.
So my team built an internal tool we called “SRE Buddy”.
Tech stack:
- Weaviate vector DB
- Cohere embeddings (AWS Bedrock)
- Claude (AWS Bedrock)
- OpenWebUI front-end
Now when someone types: “how do we restart the kafka connect cluster in ca-central-1 again?” (For Example)
It pulls the three most relevant runbook pages, the last five times we actually did it (from incident reports), then spits out the exact steps on what to do.
In the realiability world, and a motto I live by, “Automate the boring stuff”. It just makes sense to have this wealth of knowledge for quick and easy access in a centralized place and not floating around in a developers mind (Which does not help if they leave or the incident is at 2 am).
Pro tip: the secret sauce is chunking runbooks by H2 headings and storing both the raw markdown + a synthetic “TL;DR” version as metadata. Makes retrieval stupidly accurate.