Files
wdmUI/docs/07-runbooks/README.md
T
2026-05-09 18:47:32 +00:00

1.5 KiB

Runbooks

Operational runbooks for common incidents. Currently placeholders — to be written as the platform reaches production.

Planned runbooks:

  • bridge-down.md — when the MQTT bridge to hub fails to connect.
  • gitops-conflict.md — when reconcile fails due to merge conflict.
  • cell-disk-critical.md — when Cell disk usage exceeds 95%.
  • camera-mass-offline.md — when multiple cameras go offline simultaneously.
  • wan-failover.md — when primary WAN goes down and failover engages.
  • cert-renewal-needed.md — mTLS cert nearing expiration.
  • frigate-crash-loop.md — when Frigate fails to start repeatedly.
  • evidence-export.md — step-by-step for an operator to export an evidence pack for legal counsel.
  • site-recovery.md — full recovery from a Cell hardware failure.
  • password-recovery.md — operator forgot password, recovery procedures.

Each runbook follows a standard structure:

  1. Symptoms — what you see that triggers this runbook.
  2. Severity — how urgent.
  3. Triage — first 60 seconds: what to check before acting.
  4. Resolution — step-by-step.
  5. Verification — how to confirm the fix.
  6. Postmortem checklist — what to document if the issue recurs.
  7. See also — related runbooks, ADRs, docs.

Until these are written, on-call engineers will work from architecture docs + their own judgment. The first runbook to be written should be bridge-down.md since it's the most common foreseeable issue.