Skip to content

Architecture — Unified Configuration Platform (CVN-N008)

Status: Draft — awaiting Need-level committee review (CVN-N008) Version: 0.1 Date: 2026-04-24 ADRs in scope: ADR-25 (no silent fallbacks), ADR-26 (Grafana single entry point), ADR-56 (FTF-testable changes), ADR-59 (config in PostgreSQL), ADR-65 (DAG params run-level only) Structurizr: views ConfigurationPlatform and Containers in workspace.dsl


1. Purpose

Deliver the runtime architecture for the Unified Configuration Platform as described in CVN-N008. The document fixes the cross-container interactions, data model, distribution mechanism, and safe-boundary contract. Component-level decisions are referenced to their upcoming design docs (one per epic EA / EB / EC / ED).

The platform replaces the legacy Streamlit admin tool and binds together four concerns: (1) a modern web Console, (2) a database-resident parameter catalog, (3) immutable Parameter Sets pinning every run, and (4) real-time distribution of configuration changes to long-lived trading workers under explicit safety invariants.


2. System context

The platform sits inside the existing CVNTrade system. It adds two new containers (Configuration Console, Configuration API) and extends the responsibilities of three existing containers (PostgreSQL, Runtime, Airflow). The Structurizr Containers view is the canonical model; the mermaid diagram below is a readable extract focused on the write path and the runtime-distribution path.

flowchart LR
    operator([Operator])

    subgraph cvntrade[CVNTrade Platform]
        direction LR
        console[Configuration Console
Next.js · shadcn/ui] configApi[Configuration API
Zod validation · RBAC] postgres[(PostgreSQL
ftf_config
variable_catalog
parameter_sets
config_history)] airflow[Airflow
FTF · backtest · train] runtime[Runtime
paper + live workers] grafana[Grafana
observability] end operator -- "edit, approve, rollback" --> console console -- "REST / JSON" --> configApi configApi -- "CRUD + NOTIFY config_changed" --> postgres airflow -- "freeze parameter_set on trigger" --> postgres runtime -- "LISTEN + read snapshot" --> postgres runtime -- "tagged metrics (config_version_id)" --> grafana configApi -- "audit + latency metrics" --> grafana airflow -- "tagged metrics" --> grafana

Three boundaries are normative:

  • Console ⇄ API — every write is validated twice (client via Zod, server via the same schemas imported into the API). No write reaches PostgreSQL without a typed request.
  • API ⇄ PostgreSQL — the single source of truth. No JSON file fallback in production. The API is the only process that mutates configuration rows.
  • PostgreSQL ⇄ Runtime — the distribution channel. Workers subscribe via LISTEN/NOTIFY and read snapshots; they never poll the Console or hold in-memory caches that diverge from the database.

3. Containers — responsibilities

Container Role Technology New vs. existing
Configuration Console Parameter editor, catalog admin, run history with diff + re-run, audit trail with restore, real-time worker dashboard Next.js (App Router), shadcn/ui, TypeScript, Zod, TanStack Query, TanStack Table New (replaces the Streamlit part of the old Console + Frontend container)
Trading Frontend Trading dashboards, reports React SPA Existing — unchanged
Configuration API Typed writes to ftf_config, variable_catalog, parameter_sets; RBAC; approval workflow; publishes change events via NOTIFY config_changed Next.js API Routes or FastAPI (design-doc decision) New
Trading API Internal operations for Trading Frontend FastAPI Existing — unchanged
PostgreSQL Source of truth for config + history + parameter sets + catalog; transport via LISTEN/NOTIFY Scaleway managed PG Existing — schema extended
Airflow Freezes a parameter_sets row at run trigger; stores parameter_set_hash on each finetune_runs row Python, Airflow 2.10 Existing — extended (task freeze_param_set added between validate_params and persist_run_start)
Runtime Subscribes via LISTEN; applies hot-reloadable changes at safe boundaries; tags every metric with config_version_id Python Existing — extended with subscriber + boundary enforcer
Grafana Audit events, distribution latency, per-worker config_version_id, lag-behind alerts Grafana + Loki + Prometheus Existing — dashboards added

4. Data model

Five tables own the platform's state. The ERD below shows columns load-bearing for the architecture; the precise schema lives in the per-epic design docs.

erDiagram
    variable_catalog ||--o{ variable_catalog_history : "tracks edits"
    variable_catalog ||--o{ active_config_entries : "typed by"
    active_config ||--o{ active_config_entries : "composes"
    active_config ||--o{ config_history : "snapshots on save"
    parameter_sets ||--o{ finetune_runs : "binds"
    parameter_sets ||--o{ live_sessions : "binds"
    parameter_sets ||--o{ paper_sessions : "binds"
    config_history ||--o{ parameter_sets : "referenced by"
    variable_catalog_history ||--o{ parameter_sets : "referenced by"

    variable_catalog {
        text name PK
        text type
        jsonb allowed_values
        text default_value
        text description
        text group_name
        bool required
        bool deprecated
        bool hot_reloadable
        bool excluded_from_ftf
        numeric min_value
        numeric max_value
    }

    active_config {
        int id PK
        text environment
        text version
        text locked_by
        timestamptz updated_at
        text updated_by
    }

    active_config_entries {
        int config_id FK
        text name FK
        text value
    }

    config_history {
        bigint id PK
        int config_id
        jsonb base_env
        jsonb diff
        text changed_by
        text change_reason
        timestamptz changed_at
    }

    parameter_sets {
        char64 hash PK
        jsonb base_env
        jsonb variables_catalog
        jsonb factors
        jsonb code_version
        text config_version
        bigint catalog_version
        timestamptz created_at
        bool frozen
    }

    finetune_runs {
        text run_id PK
        char64 parameter_set_hash FK
        text status
    }

    paper_sessions {
        text session_id PK
        char64 parameter_set_hash FK
        text active_config_version
    }

    live_sessions {
        text session_id PK
        char64 parameter_set_hash FK
        text active_config_version
    }

Architectural invariants enforced at the database level:

  • finetune_runs.parameter_set_hash IS NOT NULL — CHECK constraint. Enforced at the DB so no application path can circumvent it.
  • parameter_sets is immutable past a 10-second grace window. A BEFORE UPDATE OR DELETE trigger raises on any mutation with created_at < NOW() - INTERVAL '10 seconds'.
  • variable_catalog.deprecated = TRUE does not remove the row — deleting is blocked by foreign keys in parameter_sets.variables_catalog snapshots (any deletion would break historical reproducibility).

5. Parameter lifecycle (state model)

Every parameter value goes through the states below. Transitions are guarded by RBAC, validation, and approval rules (for parameters tagged critical).

stateDiagram-v2
    [*] --> Draft: editor saves change
(not yet active) Draft --> Rejected: approver rejects
OR validator fails Draft --> PendingApproval: editor submits
(if param is critical) PendingApproval --> Approved: approver signs PendingApproval --> Rejected: approver rejects Draft --> Approved: editor saves
(if param is non-critical) Approved --> Active: API writes to
active_config + NOTIFY Active --> Propagating: distribution channel
delivers to subscribers Propagating --> AppliedByAllWorkers: every subscriber
confirmed apply Propagating --> LaggingWorker: at least one worker
not caught up LaggingWorker --> AppliedByAllWorkers: worker catches up LaggingWorker --> AlertPagesOnCall: lag > threshold AppliedByAllWorkers --> Active Active --> Superseded: newer value written Rejected --> [*] Superseded --> [*]

Notes:

  • Frozen in parameter_sets snapshots — an Active value is captured into every Parameter Set created while it is active. A later supersede does not alter past snapshots, so reproducibility of any historical run is guaranteed.
  • Approval workflow applies only to parameters tagged critical = TRUE in the catalog. Others go Draft → Approved directly.
  • Rollback from the audit trail re-enters the state machine at Draft with the restored value as payload, then follows the normal flow (validation + approval if critical).

6. Real-time distribution — sequence

The critical cross-container interaction. From "Save" click to worker in-memory application, the SLO is 5 seconds p95.

sequenceDiagram
    autonumber
    actor Operator
    participant Console as Configuration
Console participant Api as Configuration
API participant Pg as PostgreSQL participant Audit as config_history participant Worker as Runtime worker
(paper/live) participant Grafana Operator->>Console: edit parameter, Save Console->>Console: Zod validation (client) Console->>Api: POST /api/config/{env}/{key} Api->>Api: Zod validation + RBAC + coherence rules Api->>Pg: BEGIN; SELECT FOR UPDATE active_config Api->>Pg: UPDATE active_config + INSERT config_history Api->>Pg: NOTIFY config_changed 'env=prod,key=CVN_X,version=N+1' Api->>Pg: COMMIT Api->>Audit: (implicit, same tx) Api-->>Console: 200 { version: N+1 } Console-->>Operator: success + diff preview par distribution Pg-->>Worker: NOTIFY payload Worker->>Pg: SELECT snapshot for parameter N+1 Worker->>Worker: schedule apply at next safe boundary Worker->>Worker: boundary reached → swap in-memory config Worker->>Grafana: emit event config_version_applied N+1 and observability Api->>Grafana: emit event config_change_saved (latency t0) Worker->>Grafana: emit metric distribution_latency_ms = t_apply - t0 end

Key properties:

  • Atomicity of write + notifyNOTIFY is transactional in PostgreSQL; the subscribers receive the payload only if the transaction commits. A rollback cancels the notification with no orphan state.
  • Transactional serializationSELECT FOR UPDATE active_config blocks concurrent writers; no two writes can race past the lock. Proven pattern, already used by the existing save_config in the Streamlit tool.
  • No polling — workers do not poll PostgreSQL. If the channel drops, the reconnect logic catches up by comparing last-applied version to the current active version (see §9 failure modes).
  • Back-pressure — if a worker is processing a decision cycle when the notification arrives, it queues the apply and swaps at the next safe boundary. The queue depth is bounded (1 pending apply per parameter); a second notification overwrites the first before it fires.

7. Safe-boundary contract

A trading worker must never apply a configuration change in the middle of: an open trade, an active decision cycle, a model inference call, or a critical section (order submission, position reconciliation). The contract below is enforced by the Runtime container; the Console / API only schedule the apply — they never reach into the worker's execution state.

flowchart TD
    event[Change notification
from PostgreSQL] --> classify{Parameter
category?} classify -->|hot_reloadable=false
(model type, FE version)| restart[Queue restart-required
flag + alert dashboard.
Apply only on next
pod restart.] classify -->|category=threshold| boundaryDecision{Decision cycle
in flight?} classify -->|category=filter_flag| boundaryDecision classify -->|category=cooldown| boundaryDecision classify -->|category=risk_limit| boundaryPosition{Position open?} boundaryDecision -->|yes| queueD[Queue apply.
Fire at cycle end.] boundaryDecision -->|no| apply[Swap in-memory config] queueD --> cycleEnd[Cycle end event] --> apply boundaryPosition -->|yes| queueP[Queue apply.
Fire at position close
OR next heartbeat
whichever later.] boundaryPosition -->|no| apply queueP --> close[Position close event] --> apply apply --> emit[Emit config_version_applied N
to Grafana + structured log] apply --> tag[Tag subsequent metrics
with config_version_id = N] restart --> alertDash[Dashboard: pending restart-required]

Declaration surface:

  • Each catalog row carries hot_reloadable: boolean and category: text. The category drives the boundary choice per the flowchart.
  • The mapping category → boundary is declared in code (one canonical enum in the Runtime), unit-tested, and documented in the ED design doc.
  • Adding a new category is a code change + ADR, not a catalog change — the set is intentionally small and stable.

8. Console component view

Internal structure of the Next.js Console. The view is not in workspace.dsl (Structurizr components are added per-ADR when load-bearing for a decision); mermaid is sufficient here.

flowchart TB
    subgraph console[Configuration Console — Next.js App Router]
        direction TB

        subgraph pages[Pages]
            pageEditor[/config/{env}
Parameter editor] pageCatalog[/admin/catalog
Variable catalog admin] pageRuns[/runs
Run history + diff] pageHistory[/history
Audit trail + restore] pageDashboard[/dashboard
Worker real-time status] end subgraph forms[Form layer] hookForm[React Hook Form] zodSchemas[Zod schemas
shared client/server] diffViewer[Diff viewer
before / after] approvalFlow[Approval flow
for critical params] end subgraph query[Data layer] tanstackQuery[TanStack Query
caching + mutations] tanstackTable[TanStack Table
runs / history grids] end subgraph realtime[Real-time layer] sseClient[SSE or WS client
for worker status] dashboardPanels[Live worker
config_version panels] end pages --> forms pages --> query pageDashboard --> realtime forms --> query realtime --> query end query -- "REST / JSON" --> configApi[Configuration API] realtime -- "SSE / WS" --> configApi

Key properties:

  • Zod schemas are the single validation source. The same .ts files are imported client-side (in React Hook Form resolvers) and server-side (in API route handlers). A type mismatch is a compile error, not a runtime drift.
  • Mutations via TanStack Query — optimistic updates with server reconciliation. On error, the UI rolls back to server state, no manual recovery dance.
  • Approval flow is a component, not a page — it overlays the editor when a critical parameter is touched, keeping the UX linear.
  • The live dashboard subscribes over SSE (server-sent events) or WS — the API translates PostgreSQL LISTEN payloads into a per-client event stream. No worker talks to the browser directly.

9. Failure modes and mitigations

Failure mode Detection Mitigation
Worker loses LISTEN connection (TCP reset, PG restart) Worker heartbeat records last notify timestamp; if stale > 30 s → auto-reconnect On reconnect, compare last-applied version to active_config.version and replay missed changes. Emit listen_reconnect event
NOTIFY fires but transaction rollback Not possible — NOTIFY is transactional in PG, released only on commit Design feature; documented in §6
Two operators write concurrently SELECT FOR UPDATE active_config serializes writers; second waits Operator sees their diff preview re-computed against the now-committed state before their save lands
Critical parameter saved without approval API enforces; client-side UI also enforces 403 from API, clear error in UI, no side effects
Parameter Set insertion race (same hash, two runs) INSERT ... ON CONFLICT DO NOTHING Second run reuses the existing row — de-duplication by design
Worker applies change in a forbidden section Boundary enforcer queues apply; apply-in-section is never invoked by construction Unit test: feed a stream of notifications while in section, assert apply count = 0
Worker falls >N versions behind Lag-behind alert on dashboard PageDuty / Slack alert; runbook instructs a pod restart if the gap is restart-required or a forced recompute trigger if transport is broken
Rollback races with a live write Rollback path uses the same SELECT FOR UPDATE as a forward write Rollback is a write; it serializes with others
Catalog row deleted while referenced by a parameter_set snapshot Foreign key rejects deletion Operator must deprecate instead of delete (documented in §4 and ADR introduced with EB)
Distribution latency exceeds 5 s p95 Grafana SLO panel; alert fires on 3 consecutive minutes over target Transport-level diagnosis: PG NOTIFY queue depth, worker CPU, network. Fallback: switch to Redis pub/sub per ED design doc

10. Deployment topology

flowchart LR
    subgraph ingress[Ingress — nginx + cert-manager]
        consoleRoute[/console.cvntrade.eu/]
        configApiRoute[/config-api.cvntrade.eu/
or /console.cvntrade.eu/api] end subgraph cluster[Scaleway Kapsule] subgraph platformNs[platform namespace] consolePod[Configuration Console pod
Next.js] configApiPod[Configuration API pod
Next.js Routes or FastAPI] end subgraph tradingNs[cvntrade namespace] airflowPods[Airflow scheduler + workers] paperPods[Paper trading workers] livePods[Live trading workers] end end pg[(Scaleway managed PG)] consoleRoute --> consolePod configApiRoute --> configApiPod consolePod --> configApiPod configApiPod --> pg airflowPods --> pg paperPods -- LISTEN --> pg livePods -- LISTEN --> pg
  • The Configuration Console and API deploy via a new Helm chart under infra/helm/configuration-platform/. The chart mirrors the MLflow chart's structure (image + service + ingress + TLS).
  • configApi exposes a persistent connection to PG for NOTIFY publishing and SSE fan-out. HPA sized conservatively (2-4 replicas) — the write path is human-paced, not automation-paced.
  • Paper and live worker pods gain a LISTEN connection per pod (one open connection on top of their existing query traffic). Connection pool size on the PG side must accommodate len(workers) × 2 (read + listen).

11. Observability

Reuses the existing Grafana + Loki + Prometheus stack (ADR-26). No new observability container.

  • Audit log — emitted by Configuration API as structured events: config_change_saved, config_change_approved, config_change_applied, parameter_set_frozen, listen_reconnect, worker_lag_behind. Logs land in Loki with config_version, changed_by, env, parameter_name fields.
  • Metrics — distribution latency histogram per env (saved→applied), lag-behind gauge per worker, active version counter per env, rollback counter.
  • Dashboards:
  • configuration-platform-overview — SLO panels, distribution latency, save rate, rollback rate
  • configuration-platform-workers — per worker current version, lag, last apply, pending restart-required
  • configuration-platform-audit — live log stream filtered by env
  • Alerts — worker lag > 3 versions or 60 s, distribution latency p95 > 5 s for 3 minutes, any config_change_applied_in_forbidden_section event (must be zero).

12. Security and RBAC

  • Authentication — MVP: basic auth over HTTPS, single shared operator credential rotated quarterly. Value-add: OIDC via an existing provider if available on the cluster.
  • RBAC roles:
  • viewer — read every surface, no writes
  • editor — write non-critical parameters, trigger rollbacks only on non-critical parameters
  • approver — approve / reject critical-parameter changes submitted by editors
  • admin — mutate the variable catalog (add / deprecate variables, edit LoV)
  • Enforcement — server-side in the Configuration API. The Console's UI state mirrors the server's role check; it never "hides" a capability the server would allow, and never allows a capability the server would refuse.
  • Audit — every mutation records (env, name, old, new, changed_by, approved_by, changed_at, reason) in config_history. changed_by is required; the server rejects writes without it.
  • Freeze windows — per env configurable intervals during which writes to production are refused. Enforced at the API; the Console shows a banner + disables save.

13. Open decisions (for design docs per epic)

Decision Owner Default candidate
Configuration API implementation: Next.js API routes vs. separate FastAPI service EA design doc Next.js routes (simpler deploy, shared Zod schemas) — revisit if RBAC complexity or long-poll fan-out grows
Real-time transport: PostgreSQL LISTEN/NOTIFY vs. Redis pub/sub vs. NATS ED design doc LISTEN/NOTIFY (transactional, low-ops, already available)
OIDC provider EA/EB design doc Defer to P1 — ship MVP with basic auth
Dashboard fan-out to browsers: SSE vs. WebSocket EA design doc SSE (simpler, one-way, proxy-friendly)
Approval workflow UX — inline overlay vs. separate queue page EB design doc Inline overlay in the editor; separate queue page reachable from the dashboard

Each decision becomes an ADR if it fixes an invariant future PRs must respect.


14. References

  • CVN-N008 Need — product-level framing and KPIs
  • Structurizr workspace — containers and relationships (canonical model)
  • ADR-25 — no silent fallbacks
  • ADR-26 — Grafana single entry point for observability
  • ADR-56 — every pipeline change must be FTF-testable
  • ADR-59 — configuration managed in PostgreSQL, Console-only write path
  • ADR-65 — Airflow DAG params run-level only
  • GitHub issue #675 — source discussion with the four expanded comments
  • GitHub issue #673 — triggering incident (PTE silent override)