docs(runbooks): placeholder index
This commit is contained in:
@@ -0,0 +1,28 @@
|
||||
# Runbooks
|
||||
|
||||
Operational runbooks for common incidents. **Currently placeholders** — to be written as the platform reaches production.
|
||||
|
||||
Planned runbooks:
|
||||
|
||||
- `bridge-down.md` — when the MQTT bridge to hub fails to connect.
|
||||
- `gitops-conflict.md` — when reconcile fails due to merge conflict.
|
||||
- `cell-disk-critical.md` — when Cell disk usage exceeds 95%.
|
||||
- `camera-mass-offline.md` — when multiple cameras go offline simultaneously.
|
||||
- `wan-failover.md` — when primary WAN goes down and failover engages.
|
||||
- `cert-renewal-needed.md` — mTLS cert nearing expiration.
|
||||
- `frigate-crash-loop.md` — when Frigate fails to start repeatedly.
|
||||
- `evidence-export.md` — step-by-step for an operator to export an evidence pack for legal counsel.
|
||||
- `site-recovery.md` — full recovery from a Cell hardware failure.
|
||||
- `password-recovery.md` — operator forgot password, recovery procedures.
|
||||
|
||||
Each runbook follows a standard structure:
|
||||
|
||||
1. **Symptoms** — what you see that triggers this runbook.
|
||||
2. **Severity** — how urgent.
|
||||
3. **Triage** — first 60 seconds: what to check before acting.
|
||||
4. **Resolution** — step-by-step.
|
||||
5. **Verification** — how to confirm the fix.
|
||||
6. **Postmortem checklist** — what to document if the issue recurs.
|
||||
7. **See also** — related runbooks, ADRs, docs.
|
||||
|
||||
Until these are written, on-call engineers will work from architecture docs + their own judgment. The first runbook to be written should be `bridge-down.md` since it's the most common foreseeable issue.
|
||||
Reference in New Issue
Block a user