Runbooks
Operational runbooks for common incidents. Currently placeholders — to be written as the platform reaches production.
Planned runbooks:
bridge-down.md— when the MQTT bridge to hub fails to connect.gitops-conflict.md— when reconcile fails due to merge conflict.cell-disk-critical.md— when Cell disk usage exceeds 95%.camera-mass-offline.md— when multiple cameras go offline simultaneously.wan-failover.md— when primary WAN goes down and failover engages.cert-renewal-needed.md— mTLS cert nearing expiration.frigate-crash-loop.md— when Frigate fails to start repeatedly.evidence-export.md— step-by-step for an operator to export an evidence pack for legal counsel.site-recovery.md— full recovery from a Cell hardware failure.password-recovery.md— operator forgot password, recovery procedures.
Each runbook follows a standard structure:
- Symptoms — what you see that triggers this runbook.
- Severity — how urgent.
- Triage — first 60 seconds: what to check before acting.
- Resolution — step-by-step.
- Verification — how to confirm the fix.
- Postmortem checklist — what to document if the issue recurs.
- See also — related runbooks, ADRs, docs.
Until these are written, on-call engineers will work from architecture docs + their own judgment. The first runbook to be written should be bridge-down.md since it's the most common foreseeable issue.