Files
wdmUI/docs/04-deployments/multi-site-fleet.md
T

4.9 KiB

Multi-site fleet deployment

Pattern for customers with 5+ sites managed centrally.

Topology

                   ┌──────────────────────────────┐
                   │  Blocao Hub (Hetzner DE/FI)  │
                   │  - mosquitto                 │
                   │  - keycloak                  │
                   │  - qdrant + timescaledb      │
                   │  - sites overview UI         │
                   └────────────┬─────────────────┘
                                │ MQTT bridges over TLS
                                │
        ┌───────────────┬───────┴───────┬────────────────┐
        │               │               │                │
   ┌────▼────┐     ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
   │BL-LAB-1 │     │BL-LAB-2 │     │BL-WH-N  │     │BL-WH-S  │
   │ R+1     │     │ R+1     │     │ R+2     │     │ R+1     │
   └─────────┘     └─────────┘     └─────────┘     └─────────┘

Each site is independent — operates without the hub if WAN goes down. The hub is a coordination layer, not a critical-path dependency.

Two GitOps repos

When fleet management is in play, the hub provisions two repos per customer:

  1. fleet-config (org-wide common settings):

    • Default firewall rules.
    • Default Frigate model versions.
    • Default retention policy.
    • DNS allowlist baseline.
    • Common operator role definitions.
  2. site-config-<site_id> (per-site overrides):

    • Site identity (BL-...).
    • Camera definitions specific to this site.
    • Retention overrides if different from fleet default.
    • Network specifics.

The router clones both. Reconcile applies fleet first, then site overrides.

This separation means:

  • "Update Frigate to v0.15 across the fleet" → one commit to fleet-config, propagates to all sites in the next reconcile.
  • "Add a camera to BL-LAB-2" → one commit to site-config-bl-lab-2, only affects that site.

Operator workflow

A fleet operator (e.g., security ops at headquarters of a 30-store retailer) typically:

  1. Hub Sites Overview: see all sites with health/alerts.
  2. Drill down: click a site → Tailscale tunnel opens to that site's console.
  3. Investigation: query forensics either at the site (single-site context) or at the hub (cross-site context).
  4. Bulk policy: edit fleet-config repo for org-wide changes.

Implemented in Epic 6 (post-MVP).

The hub maintains a consolidated embeddings index. Each site publishes embeddings (with site_id) to its bridge. Hub merges them. Operator query at hub level fans out:

  • Direct lookup in consolidated index for "find vehicle plate L-7234" or "find this face".
  • Re-ranking with site-specific context.
  • Deep dive into a site's full data via Tailscale tunnel.

Raw video stays at sites. Embeddings + metadata at hub. Sovereignty preserved.

Sites Overview UI

Separate from the per-site router console. Implemented in apps/hub (future code repo).

Mockup not yet created — design TBD post-MVP. Likely:

  • Map of sites with status pins.
  • Aggregated health panel (% of sites green/warn/err).
  • Aggregated alerts panel (active across the fleet).
  • Bulk actions (update fleet-config, push command to N sites).

Pricing model considerations

A 30-site customer should pay more than a 1-site customer. Subscription tiers:

Tier Sites Monthly per site
Starter 1-5 €30-50
Standard 6-25 €25-40
Fleet 26-100 €20-30
Enterprise 100+ Custom

Hardware sold separately. Support tiers add a flat monthly.

(All numbers placeholder; finalize with sales lead.)

Operational considerations

  • Hub HA: production hub should be at least 2 nodes (active-passive at minimum). For >50 sites, active-active with shared MinIO.
  • Hub backup: daily snapshots to a second region (OVH France as standard secondary).
  • Site offline handling: alerts after 5 min of bridge silence. Auto-resolve on reconnect.
  • Cert management: each site's mTLS cert renews automatically every 6 months. Monitoring alerts at 90 days.

Customer journey

  1. Pilot: 1-2 sites, hub provisioned, validate fleet workflow.
  2. Rollout: phased install of remaining sites.
  3. Optimization: 3 months in, review which fleet-config defaults to tighten.
  4. Steady state: ongoing ops, occasional new sites, regular fleet-config updates.

Total time from pilot to 30 sites in production: typically 6-9 months for a customer with established cabling/infra at each site.