Files
wdmUI/decisions/0004-gitops-como-source-of-truth.md

50 lines
2.5 KiB
Markdown

# ADR-0004 · GitOps as source of truth
**Status**: accepted
**Date**: 2026-05
## Context
Configuration of a Blocao site involves: OpenWrt UCI files, Frigate config, MQTT bridge policy, firewall rules, container compose files, retention policy, custom DNS lists. This needs to be:
- Auditable (who changed what, when, why).
- Reproducible (a new router should reach the same state as an existing one).
- Transactional (changes apply together or not at all).
- Reversible (rollback to last known good).
LuCI's web-based config edits and ad-hoc SSH changes don't satisfy any of those. Configuration management tools (Ansible, Puppet) work but introduce a layer of indirection between intent and state.
## Decision
**Two repositories per site**: `site-config` (per-site overrides) and `fleet-config` (organization-wide common config). Configuration changes happen exclusively as commits to these repos.
The router clones both, reconciles every 5 minutes:
1. `git fetch` both repos.
2. Detect drift (SHA256 of live files vs applied config).
3. Apply layered: `fleet-config` first, `site-config` overrides.
4. If a change fails to apply, rollback to last known good and alert.
**Drift detection**: anything edited live (e.g. `vi /etc/config/firewall` outside the repo) is flagged in the UI as `DRIFTED` until either committed back to the repo or reverted.
The **GitOps panel** in the console exposes Applied/HEAD/Remote SHAs and lets the operator trigger fetch/reconcile/rollback. For deeper changes, the operator pushes commits via Gitea/GitHub UI or git CLI.
## Consequences
**Good**:
- Anyone with read access to the repo can audit "what's running here".
- Rolling back a regression is a `git revert + reconcile`.
- Same operator experience scales from 1 site to 1000 sites.
- The fleet-config repo enables "edit once, apply to all" for org-wide policy changes.
**Bad / trade-offs**:
- Steeper learning curve for ops who don't know git. Mitigated by the GitOps panel handling common operations.
- 5-minute reconcile lag means urgent changes feel slow. Mitigated by the **manual reconcile button**.
- Secrets in repos are a problem — addressed by encrypting them with `sops` and decrypting at apply time.
## Alternatives considered
- **Ansible push**: requires central control node, secrets management, doesn't give the audit trail in the same place as the code.
- **Salt or Puppet**: heavier than needed for our scale.
- **Direct UCI edits via API**: works for one-off changes but produces no history and no rollback.