Runbooks

Operational runbooks for common incidents. Currently placeholders — to be written as the platform reaches production.

Planned runbooks:

bridge-down.md — when the MQTT bridge to hub fails to connect.
gitops-conflict.md — when reconcile fails due to merge conflict.
cell-disk-critical.md — when Cell disk usage exceeds 95%.
camera-mass-offline.md — when multiple cameras go offline simultaneously.
wan-failover.md — when primary WAN goes down and failover engages.
cert-renewal-needed.md — mTLS cert nearing expiration.
frigate-crash-loop.md — when Frigate fails to start repeatedly.
evidence-export.md — step-by-step for an operator to export an evidence pack for legal counsel.
site-recovery.md — full recovery from a Cell hardware failure.
password-recovery.md — operator forgot password, recovery procedures.

Each runbook follows a standard structure:

Symptoms — what you see that triggers this runbook.
Severity — how urgent.
Triage — first 60 seconds: what to check before acting.
Resolution — step-by-step.
Verification — how to confirm the fix.
Postmortem checklist — what to document if the issue recurs.
See also — related runbooks, ADRs, docs.

Until these are written, on-call engineers will work from architecture docs + their own judgment. The first runbook to be written should be bridge-down.md since it's the most common foreseeable issue.