Architecture — Unified Configuration Platform (CVN-N008)¶
Status: Draft — awaiting Need-level committee review (CVN-N008)
Version: 0.1
Date: 2026-04-24
ADRs in scope: ADR-25 (no silent fallbacks), ADR-26 (Grafana single entry point), ADR-56 (FTF-testable changes), ADR-59 (config in PostgreSQL), ADR-65 (DAG params run-level only)
Structurizr: views ConfigurationPlatform and Containers in workspace.dsl
1. Purpose¶
Deliver the runtime architecture for the Unified Configuration Platform as described in CVN-N008. The document fixes the cross-container interactions, data model, distribution mechanism, and safe-boundary contract. Component-level decisions are referenced to their upcoming design docs (one per epic EA / EB / EC / ED).
The platform replaces the legacy Streamlit admin tool and binds together four concerns: (1) a modern web Console, (2) a database-resident parameter catalog, (3) immutable Parameter Sets pinning every run, and (4) real-time distribution of configuration changes to long-lived trading workers under explicit safety invariants.
2. System context¶
The platform sits inside the existing CVNTrade system. It adds two new containers (Configuration Console, Configuration API) and extends the responsibilities of three existing containers (PostgreSQL, Runtime, Airflow). The Structurizr Containers view is the canonical model; the mermaid diagram below is a readable extract focused on the write path and the runtime-distribution path.
flowchart LR
operator([Operator])
subgraph cvntrade[CVNTrade Platform]
direction LR
console[Configuration Console
Next.js · shadcn/ui]
configApi[Configuration API
Zod validation · RBAC]
postgres[(PostgreSQL
ftf_config
variable_catalog
parameter_sets
config_history)]
airflow[Airflow
FTF · backtest · train]
runtime[Runtime
paper + live workers]
grafana[Grafana
observability]
end
operator -- "edit, approve, rollback" --> console
console -- "REST / JSON" --> configApi
configApi -- "CRUD + NOTIFY config_changed" --> postgres
airflow -- "freeze parameter_set on trigger" --> postgres
runtime -- "LISTEN + read snapshot" --> postgres
runtime -- "tagged metrics (config_version_id)" --> grafana
configApi -- "audit + latency metrics" --> grafana
airflow -- "tagged metrics" --> grafana
Three boundaries are normative:
- Console ⇄ API — every write is validated twice (client via Zod, server via the same schemas imported into the API). No write reaches PostgreSQL without a typed request.
- API ⇄ PostgreSQL — the single source of truth. No JSON file fallback in production. The API is the only process that mutates configuration rows.
- PostgreSQL ⇄ Runtime — the distribution channel. Workers subscribe via
LISTEN/NOTIFYand read snapshots; they never poll the Console or hold in-memory caches that diverge from the database.
3. Containers — responsibilities¶
| Container | Role | Technology | New vs. existing |
|---|---|---|---|
| Configuration Console | Parameter editor, catalog admin, run history with diff + re-run, audit trail with restore, real-time worker dashboard | Next.js (App Router), shadcn/ui, TypeScript, Zod, TanStack Query, TanStack Table | New (replaces the Streamlit part of the old Console + Frontend container) |
| Trading Frontend | Trading dashboards, reports | React SPA | Existing — unchanged |
| Configuration API | Typed writes to ftf_config, variable_catalog, parameter_sets; RBAC; approval workflow; publishes change events via NOTIFY config_changed |
Next.js API Routes or FastAPI (design-doc decision) | New |
| Trading API | Internal operations for Trading Frontend | FastAPI | Existing — unchanged |
| PostgreSQL | Source of truth for config + history + parameter sets + catalog; transport via LISTEN/NOTIFY |
Scaleway managed PG | Existing — schema extended |
| Airflow | Freezes a parameter_sets row at run trigger; stores parameter_set_hash on each finetune_runs row |
Python, Airflow 2.10 | Existing — extended (task freeze_param_set added between validate_params and persist_run_start) |
| Runtime | Subscribes via LISTEN; applies hot-reloadable changes at safe boundaries; tags every metric with config_version_id |
Python | Existing — extended with subscriber + boundary enforcer |
| Grafana | Audit events, distribution latency, per-worker config_version_id, lag-behind alerts |
Grafana + Loki + Prometheus | Existing — dashboards added |
4. Data model¶
Five tables own the platform's state. The ERD below shows columns load-bearing for the architecture; the precise schema lives in the per-epic design docs.
erDiagram
variable_catalog ||--o{ variable_catalog_history : "tracks edits"
variable_catalog ||--o{ active_config_entries : "typed by"
active_config ||--o{ active_config_entries : "composes"
active_config ||--o{ config_history : "snapshots on save"
parameter_sets ||--o{ finetune_runs : "binds"
parameter_sets ||--o{ live_sessions : "binds"
parameter_sets ||--o{ paper_sessions : "binds"
config_history ||--o{ parameter_sets : "referenced by"
variable_catalog_history ||--o{ parameter_sets : "referenced by"
variable_catalog {
text name PK
text type
jsonb allowed_values
text default_value
text description
text group_name
bool required
bool deprecated
bool hot_reloadable
bool excluded_from_ftf
numeric min_value
numeric max_value
}
active_config {
int id PK
text environment
text version
text locked_by
timestamptz updated_at
text updated_by
}
active_config_entries {
int config_id FK
text name FK
text value
}
config_history {
bigint id PK
int config_id
jsonb base_env
jsonb diff
text changed_by
text change_reason
timestamptz changed_at
}
parameter_sets {
char64 hash PK
jsonb base_env
jsonb variables_catalog
jsonb factors
jsonb code_version
text config_version
bigint catalog_version
timestamptz created_at
bool frozen
}
finetune_runs {
text run_id PK
char64 parameter_set_hash FK
text status
}
paper_sessions {
text session_id PK
char64 parameter_set_hash FK
text active_config_version
}
live_sessions {
text session_id PK
char64 parameter_set_hash FK
text active_config_version
}
Architectural invariants enforced at the database level:
finetune_runs.parameter_set_hash IS NOT NULL— CHECK constraint. Enforced at the DB so no application path can circumvent it.parameter_setsis immutable past a 10-second grace window. ABEFORE UPDATE OR DELETEtrigger raises on any mutation withcreated_at < NOW() - INTERVAL '10 seconds'.variable_catalog.deprecated = TRUEdoes not remove the row — deleting is blocked by foreign keys inparameter_sets.variables_catalogsnapshots (any deletion would break historical reproducibility).
5. Parameter lifecycle (state model)¶
Every parameter value goes through the states below. Transitions are guarded by RBAC, validation, and approval rules (for parameters tagged critical).
stateDiagram-v2
[*] --> Draft: editor saves change
(not yet active)
Draft --> Rejected: approver rejects
OR validator fails
Draft --> PendingApproval: editor submits
(if param is critical)
PendingApproval --> Approved: approver signs
PendingApproval --> Rejected: approver rejects
Draft --> Approved: editor saves
(if param is non-critical)
Approved --> Active: API writes to
active_config + NOTIFY
Active --> Propagating: distribution channel
delivers to subscribers
Propagating --> AppliedByAllWorkers: every subscriber
confirmed apply
Propagating --> LaggingWorker: at least one worker
not caught up
LaggingWorker --> AppliedByAllWorkers: worker catches up
LaggingWorker --> AlertPagesOnCall: lag > threshold
AppliedByAllWorkers --> Active
Active --> Superseded: newer value written
Rejected --> [*]
Superseded --> [*]
Notes:
- Frozen in parameter_sets snapshots — an
Activevalue is captured into every Parameter Set created while it is active. A later supersede does not alter past snapshots, so reproducibility of any historical run is guaranteed. - Approval workflow applies only to parameters tagged
critical = TRUEin the catalog. Others go Draft → Approved directly. - Rollback from the audit trail re-enters the state machine at
Draftwith the restored value as payload, then follows the normal flow (validation + approval if critical).
6. Real-time distribution — sequence¶
The critical cross-container interaction. From "Save" click to worker in-memory application, the SLO is 5 seconds p95.
sequenceDiagram
autonumber
actor Operator
participant Console as Configuration
Console
participant Api as Configuration
API
participant Pg as PostgreSQL
participant Audit as config_history
participant Worker as Runtime worker
(paper/live)
participant Grafana
Operator->>Console: edit parameter, Save
Console->>Console: Zod validation (client)
Console->>Api: POST /api/config/{env}/{key}
Api->>Api: Zod validation + RBAC + coherence rules
Api->>Pg: BEGIN; SELECT FOR UPDATE active_config
Api->>Pg: UPDATE active_config + INSERT config_history
Api->>Pg: NOTIFY config_changed 'env=prod,key=CVN_X,version=N+1'
Api->>Pg: COMMIT
Api->>Audit: (implicit, same tx)
Api-->>Console: 200 { version: N+1 }
Console-->>Operator: success + diff preview
par distribution
Pg-->>Worker: NOTIFY payload
Worker->>Pg: SELECT snapshot for parameter N+1
Worker->>Worker: schedule apply at next safe boundary
Worker->>Worker: boundary reached → swap in-memory config
Worker->>Grafana: emit event config_version_applied N+1
and observability
Api->>Grafana: emit event config_change_saved (latency t0)
Worker->>Grafana: emit metric distribution_latency_ms = t_apply - t0
end
Key properties:
- Atomicity of write + notify —
NOTIFYis transactional in PostgreSQL; the subscribers receive the payload only if the transaction commits. A rollback cancels the notification with no orphan state. - Transactional serialization —
SELECT FOR UPDATE active_configblocks concurrent writers; no two writes can race past the lock. Proven pattern, already used by the existingsave_configin the Streamlit tool. - No polling — workers do not poll PostgreSQL. If the channel drops, the reconnect logic catches up by comparing last-applied version to the current active version (see §9 failure modes).
- Back-pressure — if a worker is processing a decision cycle when the notification arrives, it queues the apply and swaps at the next safe boundary. The queue depth is bounded (1 pending apply per parameter); a second notification overwrites the first before it fires.
7. Safe-boundary contract¶
A trading worker must never apply a configuration change in the middle of: an open trade, an active decision cycle, a model inference call, or a critical section (order submission, position reconciliation). The contract below is enforced by the Runtime container; the Console / API only schedule the apply — they never reach into the worker's execution state.
flowchart TD
event[Change notification
from PostgreSQL] --> classify{Parameter
category?}
classify -->|hot_reloadable=false
(model type, FE version)| restart[Queue restart-required
flag + alert dashboard.
Apply only on next
pod restart.]
classify -->|category=threshold| boundaryDecision{Decision cycle
in flight?}
classify -->|category=filter_flag| boundaryDecision
classify -->|category=cooldown| boundaryDecision
classify -->|category=risk_limit| boundaryPosition{Position open?}
boundaryDecision -->|yes| queueD[Queue apply.
Fire at cycle end.]
boundaryDecision -->|no| apply[Swap in-memory config]
queueD --> cycleEnd[Cycle end event] --> apply
boundaryPosition -->|yes| queueP[Queue apply.
Fire at position close
OR next heartbeat
whichever later.]
boundaryPosition -->|no| apply
queueP --> close[Position close event] --> apply
apply --> emit[Emit config_version_applied N
to Grafana + structured log]
apply --> tag[Tag subsequent metrics
with config_version_id = N]
restart --> alertDash[Dashboard: pending restart-required]
Declaration surface:
- Each catalog row carries
hot_reloadable: booleanandcategory: text. The category drives the boundary choice per the flowchart. - The mapping
category → boundaryis declared in code (one canonical enum in the Runtime), unit-tested, and documented in the ED design doc. - Adding a new category is a code change + ADR, not a catalog change — the set is intentionally small and stable.
8. Console component view¶
Internal structure of the Next.js Console. The view is not in workspace.dsl (Structurizr components are added per-ADR when load-bearing for a decision); mermaid is sufficient here.
flowchart TB
subgraph console[Configuration Console — Next.js App Router]
direction TB
subgraph pages[Pages]
pageEditor[/config/{env}
Parameter editor]
pageCatalog[/admin/catalog
Variable catalog admin]
pageRuns[/runs
Run history + diff]
pageHistory[/history
Audit trail + restore]
pageDashboard[/dashboard
Worker real-time status]
end
subgraph forms[Form layer]
hookForm[React Hook Form]
zodSchemas[Zod schemas
shared client/server]
diffViewer[Diff viewer
before / after]
approvalFlow[Approval flow
for critical params]
end
subgraph query[Data layer]
tanstackQuery[TanStack Query
caching + mutations]
tanstackTable[TanStack Table
runs / history grids]
end
subgraph realtime[Real-time layer]
sseClient[SSE or WS client
for worker status]
dashboardPanels[Live worker
config_version panels]
end
pages --> forms
pages --> query
pageDashboard --> realtime
forms --> query
realtime --> query
end
query -- "REST / JSON" --> configApi[Configuration API]
realtime -- "SSE / WS" --> configApi
Key properties:
- Zod schemas are the single validation source. The same
.tsfiles are imported client-side (in React Hook Form resolvers) and server-side (in API route handlers). A type mismatch is a compile error, not a runtime drift. - Mutations via TanStack Query — optimistic updates with server reconciliation. On error, the UI rolls back to server state, no manual recovery dance.
- Approval flow is a component, not a page — it overlays the editor when a critical parameter is touched, keeping the UX linear.
- The live dashboard subscribes over SSE (server-sent events) or WS — the API translates PostgreSQL
LISTENpayloads into a per-client event stream. No worker talks to the browser directly.
9. Failure modes and mitigations¶
| Failure mode | Detection | Mitigation |
|---|---|---|
Worker loses LISTEN connection (TCP reset, PG restart) |
Worker heartbeat records last notify timestamp; if stale > 30 s → auto-reconnect | On reconnect, compare last-applied version to active_config.version and replay missed changes. Emit listen_reconnect event |
NOTIFY fires but transaction rollback |
Not possible — NOTIFY is transactional in PG, released only on commit |
Design feature; documented in §6 |
| Two operators write concurrently | SELECT FOR UPDATE active_config serializes writers; second waits |
Operator sees their diff preview re-computed against the now-committed state before their save lands |
| Critical parameter saved without approval | API enforces; client-side UI also enforces | 403 from API, clear error in UI, no side effects |
| Parameter Set insertion race (same hash, two runs) | INSERT ... ON CONFLICT DO NOTHING |
Second run reuses the existing row — de-duplication by design |
| Worker applies change in a forbidden section | Boundary enforcer queues apply; apply-in-section is never invoked by construction | Unit test: feed a stream of notifications while in section, assert apply count = 0 |
| Worker falls >N versions behind | Lag-behind alert on dashboard | PageDuty / Slack alert; runbook instructs a pod restart if the gap is restart-required or a forced recompute trigger if transport is broken |
| Rollback races with a live write | Rollback path uses the same SELECT FOR UPDATE as a forward write |
Rollback is a write; it serializes with others |
| Catalog row deleted while referenced by a parameter_set snapshot | Foreign key rejects deletion | Operator must deprecate instead of delete (documented in §4 and ADR introduced with EB) |
| Distribution latency exceeds 5 s p95 | Grafana SLO panel; alert fires on 3 consecutive minutes over target | Transport-level diagnosis: PG NOTIFY queue depth, worker CPU, network. Fallback: switch to Redis pub/sub per ED design doc |
10. Deployment topology¶
flowchart LR
subgraph ingress[Ingress — nginx + cert-manager]
consoleRoute[/console.cvntrade.eu/]
configApiRoute[/config-api.cvntrade.eu/
or /console.cvntrade.eu/api]
end
subgraph cluster[Scaleway Kapsule]
subgraph platformNs[platform namespace]
consolePod[Configuration Console pod
Next.js]
configApiPod[Configuration API pod
Next.js Routes or FastAPI]
end
subgraph tradingNs[cvntrade namespace]
airflowPods[Airflow scheduler + workers]
paperPods[Paper trading workers]
livePods[Live trading workers]
end
end
pg[(Scaleway managed PG)]
consoleRoute --> consolePod
configApiRoute --> configApiPod
consolePod --> configApiPod
configApiPod --> pg
airflowPods --> pg
paperPods -- LISTEN --> pg
livePods -- LISTEN --> pg
- The Configuration Console and API deploy via a new Helm chart under
infra/helm/configuration-platform/. The chart mirrors the MLflow chart's structure (image + service + ingress + TLS). configApiexposes a persistent connection to PG forNOTIFYpublishing and SSE fan-out. HPA sized conservatively (2-4 replicas) — the write path is human-paced, not automation-paced.- Paper and live worker pods gain a
LISTENconnection per pod (one open connection on top of their existing query traffic). Connection pool size on the PG side must accommodatelen(workers) × 2(read + listen).
11. Observability¶
Reuses the existing Grafana + Loki + Prometheus stack (ADR-26). No new observability container.
- Audit log — emitted by Configuration API as structured events:
config_change_saved,config_change_approved,config_change_applied,parameter_set_frozen,listen_reconnect,worker_lag_behind. Logs land in Loki withconfig_version,changed_by,env,parameter_namefields. - Metrics — distribution latency histogram per env (saved→applied), lag-behind gauge per worker, active version counter per env, rollback counter.
- Dashboards:
configuration-platform-overview— SLO panels, distribution latency, save rate, rollback rateconfiguration-platform-workers— per worker current version, lag, last apply, pending restart-requiredconfiguration-platform-audit— live log stream filtered by env- Alerts — worker lag > 3 versions or 60 s, distribution latency p95 > 5 s for 3 minutes, any
config_change_applied_in_forbidden_sectionevent (must be zero).
12. Security and RBAC¶
- Authentication — MVP: basic auth over HTTPS, single shared operator credential rotated quarterly. Value-add: OIDC via an existing provider if available on the cluster.
- RBAC roles:
viewer— read every surface, no writeseditor— write non-critical parameters, trigger rollbacks only on non-critical parametersapprover— approve / reject critical-parameter changes submitted by editorsadmin— mutate the variable catalog (add / deprecate variables, edit LoV)- Enforcement — server-side in the Configuration API. The Console's UI state mirrors the server's role check; it never "hides" a capability the server would allow, and never allows a capability the server would refuse.
- Audit — every mutation records
(env, name, old, new, changed_by, approved_by, changed_at, reason)inconfig_history.changed_byis required; the server rejects writes without it. - Freeze windows — per env configurable intervals during which writes to production are refused. Enforced at the API; the Console shows a banner + disables save.
13. Open decisions (for design docs per epic)¶
| Decision | Owner | Default candidate |
|---|---|---|
| Configuration API implementation: Next.js API routes vs. separate FastAPI service | EA design doc | Next.js routes (simpler deploy, shared Zod schemas) — revisit if RBAC complexity or long-poll fan-out grows |
| Real-time transport: PostgreSQL LISTEN/NOTIFY vs. Redis pub/sub vs. NATS | ED design doc | LISTEN/NOTIFY (transactional, low-ops, already available) |
| OIDC provider | EA/EB design doc | Defer to P1 — ship MVP with basic auth |
| Dashboard fan-out to browsers: SSE vs. WebSocket | EA design doc | SSE (simpler, one-way, proxy-friendly) |
| Approval workflow UX — inline overlay vs. separate queue page | EB design doc | Inline overlay in the editor; separate queue page reachable from the dashboard |
Each decision becomes an ADR if it fixes an invariant future PRs must respect.
14. References¶
- CVN-N008 Need — product-level framing and KPIs
- Structurizr workspace — containers and relationships (canonical model)
- ADR-25 — no silent fallbacks
- ADR-26 — Grafana single entry point for observability
- ADR-56 — every pipeline change must be FTF-testable
- ADR-59 — configuration managed in PostgreSQL, Console-only write path
- ADR-65 — Airflow DAG params run-level only
- GitHub issue #675 — source discussion with the four expanded comments
- GitHub issue #673 — triggering incident (PTE silent override)