AI Software-Engineering Studio — Architecture¶
Document status — target vs. in-place
This document describes the AI-augmented software-engineering studio
built around the CVNTrade algo-trading platform. Where a capability is
live today it is stated as fact and anchored to its ADR / script /
runbook. Where it is a target (the original note called this an
"architecture cible") it is explicitly flagged TARGET. The intent is
a faithful description corrected against the real system, not an
aspirational pitch.
1. Purpose¶
The studio combines, around a single algo-trading codebase:
- a heavily-tooled, modern software-delivery chain;
- product governance fused into the sprint workflow (OpenProject — "OP");
- a multi-agent expert committee built from specialised expertises;
- a test-as-code architecture;
- a full MLOps backbone;
- living, versioned documentation;
- application, ML and operational observability.
The studio is not a code-generation assistant. It is a distributed cognitive organisation: AI agents, CI/CD, ADRs, tests, dashboards and the OP/GitHub workflow jointly produce, control and validate the software.
2. General vision¶
A complex trading system cannot be steered by a single generalist model. The studio therefore runs a multi-agent Expert Committee (ADR-68) whose specialised members intervene at the key gates of a Story's life-cycle.
Each member is a skill representing one engineering expertise (ML engineer, data scientist, crypto trader, software architect, ops). Members are grounded by a domain RAG and each is instantiated by two independent LLMs (Mistral + Gemini) to obtain cross-readings and reduce blind spots. A consolidation step ("chair") aggregates the dual-LLM opinions into one auditable verdict.
Crucially — and contrary to a common misconception — the committee is
operator-invoked at well-defined gates via a CLI, not fired by an OP
webhook on every status change (see §5.4). Automatic OP-transition
triggering is a TARGET (§16.2).
3. Architecture principles¶
3.1 AI-native engineering¶
AI is a workflow actor, not a peripheral tool. Agents act on: plan review, PR review, experiment / verdict sign-off ("cold-eyes"), ADR consistency, test impact, MLOps impact, and trading / risk impact.
3.2 Human-in-the-loop¶
Agents produce advisory recommendations (advisory_only: true,
human_review_required: true are stamped on every verdict); the merge /
close decision stays human. The studio's job is to structure the
recommendations, rank the risks and make the trade-offs legible — avoiding
both blind automation and AI-opinion overload.
3.3 Documentation as code (ADR-77)¶
Documentation is a versioned software artefact and a single source of
truth: MkDocs (docs.cvntrade.eu, strict-mode CI gate — broken file
links fail the build), Structurizr for architecture views (lockstep when
load-bearing for an ADR), the ADR corpus (90+ records) as decision
memory, GitHub for versioning/review. Discrepancy ⇒ the docs site wins
(ADR-77); project memory SSoT is OpenProject (ADR-76). The software
catalog is also code — Backstage-schema YAML in git, every change a PR
(the PR is the audit trail; see §4.7).
3.4 Test as code (ADR-83 → ADR-88)¶
Tests are executable architectural guard-rails, governed by a locked taxonomy (see §7).
3.5 Observability by design (ADR-26, ADR-30→38, ADR-62)¶
Observability is part of the architecture, not bolted on. Grafana is the
single entry point (ADR-26); structured event=key=value logs are a stable
interface (ADR-30/32/33/38); every pipeline step emits OpenTelemetry spans
(ADR-62).
3.6 Reproducibility¶
For an algo-trading platform reproducibility is critical: backtests, ML experiments, datasets, features, models, simulated decisions, test environments and orchestration pipelines. The MLflow + Feast + Airflow + Hamilton + test-as-code + DVC (ADR-88) combination is the reproducibility substrate.
4. Component overview¶
4.1 Product & workflow layer — OpenProject (ADR-69, ADR-76, ADR-81)¶
OP is the orchestrator and project-memory SSoT. The Story life-cycle is the
8-state model of ADR-81 / process/STORY_WORKFLOW.md:
New → In specification → Specified → In progress → Developed → In testing
→ Tested → Closed (+ On hold, Rejected escapes).
Every gate requires a ritual + verdict + artefact. The committee is
invoked at the Specified gate (plan_review) and the PR gate
(pr_review); a verdict sign-off uses experiment_review ("cold-eyes",
S22 plan §6.4 / RCA discipline).
4.2 Repository & collaboration layer — GitHub¶
Issues, PRs, branches, reviews, CI/CD, secrets, branch-protection, checks.
GitHub is the anchor between code, agents, tests, ADRs and CI. No direct
push to main — every code change goes through a PR (CLAUDE.md
14-step process); docs/config/dashboards may be pushed directly.
4.3 Infrastructure layer — Scaleway Kapsule¶
Kubernetes on Scaleway Kapsule: a permanent PRO2-S pool + an
autoscaling PRO2-M compute pool. It hosts Airflow, MLflow, Grafana,
Prometheus, Loki, the Streamlit Console, ZenML (publish job only —
ADR-1: no ZenML Pro), Redis and PostgreSQL. Ingress: nginx + cert-manager
(Let's Encrypt), *.cvntrade.eu. Helm is the single source of runtime
config (infra/helm/.../values-prod.yaml).
4.4 Documentation & architecture layer (ADR-77)¶
MkDocs (textual) + Structurizr (architecture views). The ADR corpus is the architectural constitution: 90+ records explicit about why past choices were made and which invariants must hold — a first-class RAG corpus for the agents.
4.5 MLOps layer¶
- MLflow — ML backbone: experiment tracking, params, metrics, artefacts,
model registry, run comparison (live; ADR-6 /
architecture/MLFLOW.md). - Feast — feature store, wired in the training pods
(
feature_repo), offline/online serving, training-serving-skew control (architecture/FEAST_INTEGRATION.md). - Evidently AI / Giskard —
TARGETML-quality layer (data/model drift, bias, robustness, continuous validation). Today drift/quality is covered operationally by structured events + Grafana + the FTF statistical reports; Evidently/Giskard are the planned formalisation.
4.6 Data / ML orchestration layer¶
Two orchestrators, by design (ADR-60/61/89):
- Airflow — operational orchestration: when and in what order tasks
run (ingestion, training, batch scoring, reports, quality gates,
monitoring). Launchers only (ADR-11/19);
schedule=None, manual / event-triggered (ADR-18/22). - Hamilton — declarative compute graph: how a feature / metric is computed. Batch DAGs use Hamilton, not imperative code (ADR-61); the unified training harness is a Hamilton plugin registry (ADR-89).
4.7 Software register / Internal Developer Platform (ADR-78)¶
There is a software-register / service catalog — but it is not
Backstage. It is a lightweight in-house IDP, hosted inside console-next
(the Next.js 14 + shadcn cockpit), that adopts the Backstage catalog
schema (apiVersion: backstage.io/v1alpha1, kind: Component) and the
TechDocs convention as portable formats, without running Backstage
itself.
Decision (traced) — Epic CVN-N012-EA (IDP, wp#77); decision dossier
reviews/2026-04-29-idp-choice-plan.md (committee plan_review 316c39b3
PASSED, strong consensus, 0 blocker) → ADR-78 (stub). Backstage was
deliberately rejected as the primary IDP: console-next was already
ratified (ADR-66: Next.js 14 + shadcn + Storybook + DTCG tokens) and
in-flight as the Streamlit replacement; adopting Backstage would have
forced a costly fork / iframe-embed and a second docs surface
violating ADR-77 (MkDocs = docs SSoT). Verdict: "Extended console-next" —
keep the ratified stack, build the IDP capabilities into it, reuse only
Backstage's portable catalog/TechDocs formats. CVN-N008 + CVN-N012 were
consolidated into one Need to remove the boundary ambiguity.
What is live today:
- Catalog entries —
documentation/catalog/*.yaml, one BackstageComponentper service. 7 services registered:airflow,console-next,grafana,mlflow,postgresql,redis,s3(each withspec.owner= an OIDC group, links to UI + Grafana). - Catalog engine —
console-next/lib/catalog/(schema.ts,parser.ts,owners.ts+ tests) with the detail UI atconsole-next/app/catalog/[name]/page.tsx. - Governance — the catalog is read-only at runtime; every change is a
git PR and the PR is the audit trail (ADR-78 invariant I7). Day-2
procedure: runbook
runbooks/catalog-add-service.md(StoryCVN-N012-EA-S02, wp#96).
Scope caveat — this is a service/component register (7 infra services), not a per-Python-module registry. Code-level "referencing" is currently carried by the Hamilton-native module standard + OTel spans (ADR-62) + the 90+ ADR corpus, not by the catalog.
TARGET— a true module-level inventory (each application/ML module as a catalogued entity with owners, KPIs and dependency edges) does not exist yet; it is a naturalconsole-nextcatalog extension under CVN-N012-EA and would give the multi-agent committee a queryable component graph (today the agents reconstruct it from ADRs + OTel + code RAG).
5. Multi-agent architecture¶
5.1 The Committee as orchestration unit (ADR-68)¶
The central object is the Expert Committee, invoked by
scripts/expert_committee.py (GUI on port 8502; runbook
OPERATIONS.md §15). It is the default channel for plan review and
substantial PR review.
The five locked skills (one expertise each):
| Skill | Focus |
|---|---|
expert-ml-engineer |
training harness, model code, leakage, metrics |
expert-data-scientist |
statistical validity, power, drift, datasets |
expert-crypto-trader |
trading risk, sizing, costs, market realism |
expert-architect |
ADR conformance, coupling, system invariants |
expert-ops |
infra, CI/CD, observability, runbooks, cost |
Each skill is grounded by a domain RAG (relevant ADRs, technical docs, code patterns, past decisions, test conventions, observability incidents, ML metrics, trading constraints).
5.2 Doubled LLMs¶
Each skill is run by two independent LLMs — Mistral
(mistral-large-latest) and Gemini — to compare diagnoses, detect
divergence, reduce single-model bias and separate strong consensus from
weak opinion. LiteLLM is the multi-LLM abstraction/routing layer
(model selection, fallback, cost/quota control); Langfuse is the prompt
registry + tracing/observability layer (commun.prompts.langfuse_client;
prompts in prompt-library/, synced to Langfuse labels staging/prod).
5.3 Review types (the real --session-type values)¶
expert_committee.py accepts exactly four session types:
plan_review, pr_review, experiment_review, general. There is no
separate automatic "spec review" / "design review" stage: specification,
design and plan are reviewed together in one self-contained plan dossier
(documentation/reviews/YYYY-MM-DD-<slug>.md).
plan_review— at theNew → Specifiedgate. Checks: problem framing, acceptance criteria, hypotheses, dependencies, trading risk, data/ML impact, infra impact, ADR need, design coherence, execution-plan realism.REJECTEDblocks implementation until blockers are addressed (or a written waiver on the issue).pr_review— for PRs touchingsrc/commun/pipeline|finetune|cache,backtest, training, labels, prod trading (CLAUDE.md §8). Runs on top of the automated CodeRabbit incremental review (§5.5). Checks: scope conformance, code quality, test coverage, regression, ADR consistency, security, performance, ML & trading impact, docs updated.experiment_review— "cold-eyes" sign-off on a produced verdict (e.g. the S22 diagnostic chain:REPRODUCED→SEED_INDEPENDENT→CURVE_DEGENERATE→ M1/M2 …). Confirms the verdict is correctly & defensibly derived from pre-specified criteria, non-post-hoc.general— ad-hoc.
The committee question must be in English (RAG retrieval degrades on non-English input — a hard operating rule).
5.4 Invocation model (corrected)¶
The committee is operator-invoked via the CLI at two gates plus the verdict sign-off, immediately followed by mandatory traceability:
python scripts/expert_committee.py \
--artifact documentation/reviews/<dossier>.md \
--question "<English question>" \
--session-type plan_review|pr_review|experiment_review \
--issue "#<gh-issue>"
ADR-82 (mandatory) — every committee session, regardless of type, is logged as an OpenProject Meeting immediately after the CLI returns:
python scripts/op_save_committee_as_meeting.py \
--session committee/sessions/<id>_committee.json --linked-wp <wp>
The meeting is state=closed, with the author + the 5 locked expert users
as participants, and the session JSON + reviewed artefact attached. This
makes every AI arbitration auditable inside OP.
TARGET— automatic triggering on OP transitions / GitHub events (webhook-driven) is not in place; it is the §16.2 routing-matrix evolution. Today the discipline is enforced by the 14-step process + ADR-81 gates, not by automation.
5.5 Automated PR review — CodeRabbit¶
Independently of the committee, CodeRabbit performs the automated, multi-round incremental PR review (the "CR cycle"). Practice rules learned operationally:
- let the automatic incremental review run on each fix push — it carries forward addressed-state and converges naturally;
- do not force
@coderabbitai full reviewrepeatedly — a full re-scan re-raises already-resolved items and invents new theoretical edges (it is an LLM re-reading the whole file fresh), causing non-convergence; - every Major/Minor is fixed or explicitly justified; the cycle ends at 0 actionable (or on explicit operator authorisation), never auto-merged.
CodeRabbit (breadth, every PR) and the committee pr_review (depth, for
substantial PRs) are complementary inputs to the human merge decision.
6. Role of the ADRs¶
The 90+ ADRs are a central asset: decision memory, trade-off rationale, visible constraints, stable architecture principles, and a RAG-exploitable corpus. In the multi-agent system they act as the architectural constitution — agents check whether a proposal violates a past decision, whether a new decision needs an ADR, whether an existing ADR must be amended, or whether a change creates a systemic inconsistency. Each ADR defines invariants that must never be violated (CLAUDE.md "Avant toute PR").
7. Test as code (ADR-83 → ADR-88)¶
7.1 Canonical stack & taxonomy¶
The pinned stack is governed by ADR-84 (foundation stack pick: pytest
8.x + pytest-xdist + Testcontainers + DVC) with fixture-scope discipline
(ADR-85), the CI tier-promotion gate (ADR-86), story-phase test integration
(ADR-87, pytest --story=CVN-NXXX-EX-SYY) and versioned + provenance-tracked
test cases/datasets via DVC (ADR-88).
The marker catalogue is locked in pyproject.toml [tool.pytest.ini_options]
(--strict-markers --strict-config; a zero-type-marker test fails
collection):
| Tier | Markers | p95 budget |
|---|---|---|
| fast (PR-blocking) | unit, property, contract |
≤ 2 min |
| medium (PR + nightly) | cache, integration, dag_smoke |
≤ 10 min |
| nightly | data_quality, ml_behaviour, performance, system_e2e |
≤ 30 min |
| operator-driven | uat, post_deploy_smoke |
n/a |
| scope | story("CVN-NXXX-EX-SYY") |
n/a |
Complementary plugins support the trading-specific concerns the taxonomy targets: time control (timezone / market-session / scheduling determinism — a blocking concern for backtests), test-order randomisation (hidden cross-test coupling), and disciplined rerun of intermittent tests (CI hygiene — must surface, not mask, flakiness).
7.2 Blocking trading tests¶
Treated as gates: no data leakage, timezone coherence, backtest reproducibility, fees, slippage, order sizing, risk constraints, circuit breakers / kill-switch (ADR-71), exchange-error handling, training/inference parity (batch↔stream certificate, ADR / CVN-N005), portfolio invariants.
8. MLOps & ML backbone¶
- MLflow (live) — must trace, per run: dataset, params, metrics,
artefacts, model, code version, environment, run-id, validation status;
linked to PRs/Stories when a change affects models. Models named
CVNTrade_{MODEL}_{SYMBOL}_{STRATEGY}. - Feast (wired) — feature definitions, offline/online serving, training-serving-skew control, feature↔model dependency tracing.
- Evidently AI / Giskard (
TARGET) — formalised data/prediction/target drift, bias, robustness, continuous validation. Drift is currently watched via structured events + Grafana + FTF statistical reports.
9. Hamilton-native modules + Airflow¶
9.0 Hamilton-native module standard (ADR-61/62/89)¶
Application / data / ML modules are structured Hamilton-native: each is an
observable, steerable functional unit with OpenTelemetry instrumentation
(ADR-62 — cross-system correlation, distributed debugging, agentic context
reconstruction), runtime parameters resolved from PostgreSQL ftf_config
and editable only via the Console (ADR-59 — never hard-coded Python
defaults that override PG; ADR-90 sharpens this for training
hyper-parameters via CVN_HPO_* keys + a fail-fast resolver + the G5 CI
grep gate), and structured KPIs feeding Grafana / Langfuse / gating /
alerting.
9.1 Airflow¶
Operational workflows (ingestion, training, batch scoring, reports, quality
gates, monitoring). Launchers are the only authority (ADR-11/19); DAGs are
schedule=None, created paused, max_active_runs=1 (ADR-18/22); DAGs live
in the git-synced cvntrade-airflow-dags repo, auto-synced by the deploy
CI.
9.2 Hamilton¶
Declarative transformation graph — dependency traceability, modularity, testability, readable feature pipelines (ADR-61).
9.3 Complementarity¶
Airflow drives the workflow; Hamilton structures the computation. Typical
chain: Airflow triggers a pipeline → Hamilton defines the feature graph →
Feast stores/serves features → MLflow traces the trained model →
Evidently/Giskard (TARGET) validate drift/quality → Grafana/Loki expose
observability.
10. Observability¶
- Grafana (ADR-26 — single entry point): application, infra, ML, trading, CI/CD, cost, data-quality dashboards.
- Loki: log search, release correlation, error tracking, runtime
behaviour, post-mortems. Structured
event=logs (ADR-30→38) are the stable machine interface and the primary diagnostic surface (Loki query before any fix is a hard rule). - Prometheus / OpenTelemetry (ADR-62): metrics + spans; OTel traces let the agents reconstruct the execution context of an incident.
- ML observability: drift, model performance, feature quality, inference latency, distribution anomalies, serving errors, business-metric drift.
- Trading observability: PnL, drawdown, exposure, turnover, slippage, fees, order latency, exchange-error rate, rejected orders, open positions, limit violations.
11. GitHub, CI/CD, secrets & guardrails¶
GitHub governs delivery (issues, branches, PRs, checks, CI/CD, secrets,
protection rules, review policies). Agent interaction rules: no direct push
to main; agents propose diffs via PR; critical zones need human review;
secrets are never exposed to prompts; agent actions are logged; infra
changes require CI + specialised review.
CI guardrails (live, PR-blocking) — the Story workflow guardrails
job:
- G1 — PR title format;
- G2 — Story / issue reference in the PR body (
Closes #N/Story: CVN-…/OpenProject: wp#N/ issue URL); - G3 — plan-dossier presence (
documentation/reviews/*-plan.md); - G4 — MLOps-readiness artefact (
documentation/stories/<cvn_id>/); - G5 — ADR-90 hyper-parameter-externalisation grep gate.
12. Gating & decision rules¶
Informational opinions are separated from blocking gates. Blocking
examples: critical-ADR violation; missing tests on a critical module;
risk-engine / kill-switch change without specialised review; secrets/RBAC
change without security validation; feature-store change without
training/serving test; unexplained ML drift; negative backtest regression;
CI failure; a structural change without the architecture doc updated
(ADR-77); a Story passing Developed without its docs live on
docs.cvntrade.eu (no waiver).
13. Chair / consolidator agent¶
A consolidation step already exists: after the 5 skills × 2 LLMs
produce opinions, a consolidator (Langfuse-managed review-consolidator
prompt, run on gemini-2.5-flash) aggregates them into one verdict. Its
job: aggregate opinions, identify consensus, isolate disagreement, rank
risks, produce a recommended decision, separate blocking / non-blocking,
emit a human-actionable summary. It is the cognitive-governance layer of
the multi-agent system.
Observed verdict schema (committee/sessions/<id>_committee.json):
{
"status": "PASSED | REJECTED",
"code": "OK | …",
"consensus_strength": "strong | weak",
"reason": "…",
"areas_of_agreement": ["… (per-expert attribution)"],
"blockers": [],
"dissent": [],
"recommendations": ["… (advisory)"],
"advisory_only": true,
"human_review_required": true
}
Recommended human-facing summary format:
## Review Summary
Decision: APPROVE / REQUEST CHANGES / BLOCK
### Blocking issues
### Non-blocking issues
### ADR impact
### Test impact
### ML impact
### Trading risk impact
### Recommended next actions
14. Risks of the multi-agent system¶
- Excessive noise — too many agents → too many comments. Mitigation: scope agents per story type, weight opinions, consolidator synthesis, criticality thresholds.
- False consensus — two models can agree and be wrong. Mitigation: strong CI, test-as-code, human review, golden scenarios, explicit contrarian prompting.
- Contextual hallucination — an agent invents a constraint / misreads an ADR. Mitigation: sourced RAG, internal-doc citations, mandatory ADR pointers, no decision without a source.
- Cost & latency — Mitigation: LiteLLM routing, small models for simple
tasks, premium models only for critical tasks, analysis cache,
incremental reviews, FinOps accounting (every committee session emits
event=committee_finopswith token + USD cost). - Responsibility dilution — agents can feel like objective validation. Mitigation: final human decision, explicit owners, audit trail (ADR-82 meetings), documented gates.
15. Maturity model¶
| Level | Description |
|---|---|
| 1 — Assisted | An LLM occasionally helps write code/docs |
| 2 — Workflow-aware | Agents know issues/PR/tests/ADRs |
| 3 — Committee-based | Specialised agents intervene at OP gates |
| 4 — Governed multi-agent | Consolidator arbitrates, formal gates, traceable decisions |
| 5 — Self-improving | The system learns from its own incidents/CR-failures/post-mortems to improve prompts, tests, ADRs, gating |
The studio sits between levels 3 and 4 — committee + dual-LLM +
consolidator + ADR-82 audit trail + G1-G5 gates are live; full
event-driven routing and the self-improving loop are TARGET.
16. Evolution recommendations¶
- Event-driven routing matrix (
TARGET) — auto-trigger the right agents from OP transitions / changed files / labels / criticality / risk history (today: operator-invoked at gates). - ADR linter (
TARGET) — auto-detect architectural change → missing / violated / stale ADR. - Golden backtests — frozen reference scenarios, run before/after every critical change.
- Technical risk register — centralise trading / ML / infra / security / tech-debt / data / observability risks.
- MLflow ↔ PR linkage — every model-affecting PR references its MLflow run, metrics, artefacts, dataset, drift report, Giskard/Evidently validation.
- Per-agent review contracts — explicit scope / inputs / outputs / format / blocking criteria / limits per skill.
- Module-level software register (
TARGET) — extend theconsole-nextBackstage-schema catalog (§4.7) from 7 services to a per-module inventory (owners, KPIs, dependency edges), giving the committee a queryable component graph instead of ADR+OTel+RAG reconstruction. - Self-improving loop (
TARGET, level 5) — feed incidents, CR failures and post-mortems back into prompts, tests, ADRs and gating.
17. Synthesis¶
The studio's strength is that it does not reduce AI to code generation.
It articulates: the OP product workflow; specialised agents; domain RAG;
dual-LLM instantiation; GitHub; CI/CD + G1-G5 guardrails; the 90+ ADR
constitution; test-as-code; the MLflow/Feast MLOps backbone; the
Backstage-schema software register in console-next (ADR-78); Grafana /
Loki / OTel observability; living MkDocs/Structurizr documentation; and a
consolidator with full ADR-82 audit traceability. The result is a
distributed cognitive infrastructure that assists the design,
implementation, validation and operation of a complex algo-trading
platform. The next structuring steps are event-driven committee routing,
a module-level software register, and the self-improving loop (level 5).
18. Conceptual diagram¶
flowchart TD
H[Human Lead / Product Owner] --> OP[OP Story life-cycle - ADR-81 8 states]
OP -->|Specified gate| PR1[plan_review]
OP -->|PR gate| PR2[pr_review + CodeRabbit CR cycle]
OP -->|verdict sign-off| PR3[experiment_review cold-eyes]
PR1 --> C[Expert Committee - 5 skills x 2 LLMs Mistral+Gemini]
PR2 --> C
PR3 --> C
C --> CH[Consolidator - review-consolidator / gemini-2.5-flash]
CH --> M[ADR-82 OP Meeting - audit trail]
CH --> GH[GitHub Issue / PR - human decision]
GH --> CI[CI/CD + G1-G5 guardrails]
CI --> K8S[Scaleway Kapsule]
CI --> ML[MLflow / Feast]
CI --> DOC[ADR / MkDocs / Structurizr]
K8S --> OBS[Grafana / Loki / OTel / Prometheus]
ML --> OBS