Files
wdmUI/docs/07-runbooks/README.md
T
2026-05-09 18:47:32 +00:00

29 lines
1.5 KiB
Markdown

# Runbooks
Operational runbooks for common incidents. **Currently placeholders** — to be written as the platform reaches production.
Planned runbooks:
- `bridge-down.md` — when the MQTT bridge to hub fails to connect.
- `gitops-conflict.md` — when reconcile fails due to merge conflict.
- `cell-disk-critical.md` — when Cell disk usage exceeds 95%.
- `camera-mass-offline.md` — when multiple cameras go offline simultaneously.
- `wan-failover.md` — when primary WAN goes down and failover engages.
- `cert-renewal-needed.md` — mTLS cert nearing expiration.
- `frigate-crash-loop.md` — when Frigate fails to start repeatedly.
- `evidence-export.md` — step-by-step for an operator to export an evidence pack for legal counsel.
- `site-recovery.md` — full recovery from a Cell hardware failure.
- `password-recovery.md` — operator forgot password, recovery procedures.
Each runbook follows a standard structure:
1. **Symptoms** — what you see that triggers this runbook.
2. **Severity** — how urgent.
3. **Triage** — first 60 seconds: what to check before acting.
4. **Resolution** — step-by-step.
5. **Verification** — how to confirm the fix.
6. **Postmortem checklist** — what to document if the issue recurs.
7. **See also** — related runbooks, ADRs, docs.
Until these are written, on-call engineers will work from architecture docs + their own judgment. The first runbook to be written should be `bridge-down.md` since it's the most common foreseeable issue.