Multi-site fleet deployment

Pattern for customers with 5+ sites managed centrally.

Topology

                   ┌──────────────────────────────┐
                   │  Blocao Hub (Hetzner DE/FI)  │
                   │  - mosquitto                 │
                   │  - keycloak                  │
                   │  - qdrant + timescaledb      │
                   │  - sites overview UI         │
                   └────────────┬─────────────────┘
                                │ MQTT bridges over TLS
                                │
        ┌───────────────┬───────┴───────┬────────────────┐
        │               │               │                │
   ┌────▼────┐     ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
   │BL-LAB-1 │     │BL-LAB-2 │     │BL-WH-N  │     │BL-WH-S  │
   │ R+1     │     │ R+1     │     │ R+2     │     │ R+1     │
   └─────────┘     └─────────┘     └─────────┘     └─────────┘

Each site is independent — operates without the hub if WAN goes down. The hub is a coordination layer, not a critical-path dependency.

Two GitOps repos

When fleet management is in play, the hub provisions two repos per customer:

fleet-config (org-wide common settings):
- Default firewall rules.
- Default Frigate model versions.
- Default retention policy.
- DNS allowlist baseline.
- Common operator role definitions.
site-config-<site_id> (per-site overrides):
- Site identity (BL-...).
- Camera definitions specific to this site.
- Retention overrides if different from fleet default.
- Network specifics.

The router clones both. Reconcile applies fleet first, then site overrides.

This separation means:

"Update Frigate to v0.15 across the fleet" → one commit to fleet-config, propagates to all sites in the next reconcile.
"Add a camera to BL-LAB-2" → one commit to site-config-bl-lab-2, only affects that site.

Operator workflow

A fleet operator (e.g., security ops at headquarters of a 30-store retailer) typically:

Hub Sites Overview: see all sites with health/alerts.
Drill down: click a site → Tailscale tunnel opens to that site's console.
Investigation: query forensics either at the site (single-site context) or at the hub (cross-site context).
Bulk policy: edit fleet-config repo for org-wide changes.

Cross-site forensic search

Implemented in Epic 6 (post-MVP).

The hub maintains a consolidated embeddings index. Each site publishes embeddings (with site_id) to its bridge. Hub merges them. Operator query at hub level fans out:

Direct lookup in consolidated index for "find vehicle plate L-7234" or "find this face".
Re-ranking with site-specific context.
Deep dive into a site's full data via Tailscale tunnel.

Raw video stays at sites. Embeddings + metadata at hub. Sovereignty preserved.

Sites Overview UI

Separate from the per-site router console. Implemented in apps/hub (future code repo).

Mockup not yet created — design TBD post-MVP. Likely:

Map of sites with status pins.
Aggregated health panel (% of sites green/warn/err).
Aggregated alerts panel (active across the fleet).
Bulk actions (update fleet-config, push command to N sites).

Pricing model considerations

A 30-site customer should pay more than a 1-site customer. Subscription tiers:

Tier	Sites	Monthly per site
Starter	1-5	€30-50
Standard	6-25	€25-40
Fleet	26-100	€20-30
Enterprise	100+	Custom

Hardware sold separately. Support tiers add a flat monthly.

(All numbers placeholder; finalize with sales lead.)

Operational considerations

Hub HA: production hub should be at least 2 nodes (active-passive at minimum). For >50 sites, active-active with shared MinIO.
Hub backup: daily snapshots to a second region (OVH France as standard secondary).
Site offline handling: alerts after 5 min of bridge silence. Auto-resolve on reconnect.
Cert management: each site's mTLS cert renews automatically every 6 months. Monitoring alerts at 90 days.

Customer journey

Pilot: 1-2 sites, hub provisioned, validate fleet workflow.
Rollout: phased install of remaining sites.
Optimization: 3 months in, review which fleet-config defaults to tighten.
Steady state: ongoing ops, occasional new sites, regular fleet-config updates.

Total time from pilot to 30 sites in production: typically 6-9 months for a customer with established cabling/infra at each site.

4.9 KiB Raw Blame History