From 267d5ddfad03ca1a900889b90e7aabbc5783dac4 Mon Sep 17 00:00:00 2001 From: Eratostenes de Gitjabia Date: Sat, 9 May 2026 12:37:51 +0000 Subject: [PATCH] docs(deployments): multi-site fleet pattern --- docs/04-deployments/multi-site-fleet.md | 112 ++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 docs/04-deployments/multi-site-fleet.md diff --git a/docs/04-deployments/multi-site-fleet.md b/docs/04-deployments/multi-site-fleet.md new file mode 100644 index 0000000..06702cc --- /dev/null +++ b/docs/04-deployments/multi-site-fleet.md @@ -0,0 +1,112 @@ +# Multi-site fleet deployment + +Pattern for customers with 5+ sites managed centrally. + +## Topology + +``` + ┌──────────────────────────────┐ + │ Blocao Hub (Hetzner DE/FI) │ + │ - mosquitto │ + │ - keycloak │ + │ - qdrant + timescaledb │ + │ - sites overview UI │ + └────────────┬─────────────────┘ + │ MQTT bridges over TLS + │ + ┌───────────────┬───────┴───────┬────────────────┐ + │ │ │ │ + ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ + │BL-LAB-1 │ │BL-LAB-2 │ │BL-WH-N │ │BL-WH-S │ + │ R+1 │ │ R+1 │ │ R+2 │ │ R+1 │ + └─────────┘ └─────────┘ └─────────┘ └─────────┘ +``` + +Each site is independent — operates without the hub if WAN goes down. The hub is a **coordination layer**, not a critical-path dependency. + +## Two GitOps repos + +When fleet management is in play, the hub provisions two repos per customer: + +1. **`fleet-config`** (org-wide common settings): + - Default firewall rules. + - Default Frigate model versions. + - Default retention policy. + - DNS allowlist baseline. + - Common operator role definitions. + +2. **`site-config-`** (per-site overrides): + - Site identity (BL-...). + - Camera definitions specific to this site. + - Retention overrides if different from fleet default. + - Network specifics. + +The router clones both. Reconcile applies fleet first, then site overrides. + +This separation means: + +- "Update Frigate to v0.15 across the fleet" → one commit to fleet-config, propagates to all sites in the next reconcile. +- "Add a camera to BL-LAB-2" → one commit to site-config-bl-lab-2, only affects that site. + +## Operator workflow + +A fleet operator (e.g., security ops at headquarters of a 30-store retailer) typically: + +1. **Hub Sites Overview**: see all sites with health/alerts. +2. **Drill down**: click a site → Tailscale tunnel opens to that site's console. +3. **Investigation**: query forensics either at the site (single-site context) or at the hub (cross-site context). +4. **Bulk policy**: edit fleet-config repo for org-wide changes. + +## Cross-site forensic search + +Implemented in Epic 6 (post-MVP). + +The hub maintains a consolidated embeddings index. Each site publishes embeddings (with site_id) to its bridge. Hub merges them. Operator query at hub level fans out: + +- Direct lookup in consolidated index for "find vehicle plate L-7234" or "find this face". +- Re-ranking with site-specific context. +- Deep dive into a site's full data via Tailscale tunnel. + +Raw video stays at sites. Embeddings + metadata at hub. Sovereignty preserved. + +## Sites Overview UI + +Separate from the per-site router console. Implemented in `apps/hub` (future code repo). + +Mockup not yet created — design TBD post-MVP. Likely: + +- Map of sites with status pins. +- Aggregated health panel (% of sites green/warn/err). +- Aggregated alerts panel (active across the fleet). +- Bulk actions (update fleet-config, push command to N sites). + +## Pricing model considerations + +A 30-site customer should pay more than a 1-site customer. Subscription tiers: + +| Tier | Sites | Monthly per site | +|---|---|---| +| Starter | 1-5 | €30-50 | +| Standard | 6-25 | €25-40 | +| Fleet | 26-100 | €20-30 | +| Enterprise | 100+ | Custom | + +Hardware sold separately. Support tiers add a flat monthly. + +(All numbers placeholder; finalize with sales lead.) + +## Operational considerations + +- **Hub HA**: production hub should be at least 2 nodes (active-passive at minimum). For >50 sites, active-active with shared MinIO. +- **Hub backup**: daily snapshots to a second region (OVH France as standard secondary). +- **Site offline handling**: alerts after 5 min of bridge silence. Auto-resolve on reconnect. +- **Cert management**: each site's mTLS cert renews automatically every 6 months. Monitoring alerts at 90 days. + +## Customer journey + +1. **Pilot**: 1-2 sites, hub provisioned, validate fleet workflow. +2. **Rollout**: phased install of remaining sites. +3. **Optimization**: 3 months in, review which fleet-config defaults to tighten. +4. **Steady state**: ongoing ops, occasional new sites, regular fleet-config updates. + +Total time from pilot to 30 sites in production: typically 6-9 months for a customer with established cabling/infra at each site.