From ae25468bc88aac6da7a9daf0e48fa9b427a78908 Mon Sep 17 00:00:00 2001 From: Eratostenes de Gitjabia Date: Sat, 9 May 2026 18:47:32 +0000 Subject: [PATCH] docs(runbooks): placeholder index --- docs/07-runbooks/README.md | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 docs/07-runbooks/README.md diff --git a/docs/07-runbooks/README.md b/docs/07-runbooks/README.md new file mode 100644 index 0000000..c152f2f --- /dev/null +++ b/docs/07-runbooks/README.md @@ -0,0 +1,28 @@ +# Runbooks + +Operational runbooks for common incidents. **Currently placeholders** — to be written as the platform reaches production. + +Planned runbooks: + +- `bridge-down.md` — when the MQTT bridge to hub fails to connect. +- `gitops-conflict.md` — when reconcile fails due to merge conflict. +- `cell-disk-critical.md` — when Cell disk usage exceeds 95%. +- `camera-mass-offline.md` — when multiple cameras go offline simultaneously. +- `wan-failover.md` — when primary WAN goes down and failover engages. +- `cert-renewal-needed.md` — mTLS cert nearing expiration. +- `frigate-crash-loop.md` — when Frigate fails to start repeatedly. +- `evidence-export.md` — step-by-step for an operator to export an evidence pack for legal counsel. +- `site-recovery.md` — full recovery from a Cell hardware failure. +- `password-recovery.md` — operator forgot password, recovery procedures. + +Each runbook follows a standard structure: + +1. **Symptoms** — what you see that triggers this runbook. +2. **Severity** — how urgent. +3. **Triage** — first 60 seconds: what to check before acting. +4. **Resolution** — step-by-step. +5. **Verification** — how to confirm the fix. +6. **Postmortem checklist** — what to document if the issue recurs. +7. **See also** — related runbooks, ADRs, docs. + +Until these are written, on-call engineers will work from architecture docs + their own judgment. The first runbook to be written should be `bridge-down.md` since it's the most common foreseeable issue.