Skip to content

AI Software-Engineering Studio — Architecture

Document status — target vs. in-place

This document describes the AI-augmented software-engineering studio built around the CVNTrade algo-trading platform. Where a capability is live today it is stated as fact and anchored to its ADR / script / runbook. Where it is a target (the original note called this an "architecture cible") it is explicitly flagged TARGET. The intent is a faithful description corrected against the real system, not an aspirational pitch.

1. Purpose

The studio combines, around a single algo-trading codebase:

  • a heavily-tooled, modern software-delivery chain;
  • product governance fused into the sprint workflow (OpenProject — "OP");
  • a multi-agent expert committee built from specialised expertises;
  • a test-as-code architecture;
  • a full MLOps backbone;
  • living, versioned documentation;
  • application, ML and operational observability.

The studio is not a code-generation assistant. It is a distributed cognitive organisation: AI agents, CI/CD, ADRs, tests, dashboards and the OP/GitHub workflow jointly produce, control and validate the software.

2. General vision

A complex trading system cannot be steered by a single generalist model. The studio therefore runs a multi-agent Expert Committee (ADR-68) whose specialised members intervene at the key gates of a Story's life-cycle.

Each member is a skill representing one engineering expertise (ML engineer, data scientist, crypto trader, software architect, ops). Members are grounded by a domain RAG and each is instantiated by two independent LLMs (Mistral + Gemini) to obtain cross-readings and reduce blind spots. A consolidation step ("chair") aggregates the dual-LLM opinions into one auditable verdict.

Crucially — and contrary to a common misconception — the committee is operator-invoked at well-defined gates via a CLI, not fired by an OP webhook on every status change (see §5.4). Automatic OP-transition triggering is a TARGET (§16.2).

3. Architecture principles

3.1 AI-native engineering

AI is a workflow actor, not a peripheral tool. Agents act on: plan review, PR review, experiment / verdict sign-off ("cold-eyes"), ADR consistency, test impact, MLOps impact, and trading / risk impact.

3.2 Human-in-the-loop

Agents produce advisory recommendations (advisory_only: true, human_review_required: true are stamped on every verdict); the merge / close decision stays human. The studio's job is to structure the recommendations, rank the risks and make the trade-offs legible — avoiding both blind automation and AI-opinion overload.

3.3 Documentation as code (ADR-77)

Documentation is a versioned software artefact and a single source of truth: MkDocs (docs.cvntrade.eu, strict-mode CI gate — broken file links fail the build), Structurizr for architecture views (lockstep when load-bearing for an ADR), the ADR corpus (90+ records) as decision memory, GitHub for versioning/review. Discrepancy ⇒ the docs site wins (ADR-77); project memory SSoT is OpenProject (ADR-76). The software catalog is also code — Backstage-schema YAML in git, every change a PR (the PR is the audit trail; see §4.7).

3.4 Test as code (ADR-83 → ADR-88)

Tests are executable architectural guard-rails, governed by a locked taxonomy (see §7).

3.5 Observability by design (ADR-26, ADR-30→38, ADR-62)

Observability is part of the architecture, not bolted on. Grafana is the single entry point (ADR-26); structured event=key=value logs are a stable interface (ADR-30/32/33/38); every pipeline step emits OpenTelemetry spans (ADR-62).

3.6 Reproducibility

For an algo-trading platform reproducibility is critical: backtests, ML experiments, datasets, features, models, simulated decisions, test environments and orchestration pipelines. The MLflow + Feast + Airflow + Hamilton + test-as-code + DVC (ADR-88) combination is the reproducibility substrate.

4. Component overview

4.1 Product & workflow layer — OpenProject (ADR-69, ADR-76, ADR-81)

OP is the orchestrator and project-memory SSoT. The Story life-cycle is the 8-state model of ADR-81 / process/STORY_WORKFLOW.md:

New → In specification → Specified → In progress → Developed → In testing → Tested → Closed (+ On hold, Rejected escapes).

Every gate requires a ritual + verdict + artefact. The committee is invoked at the Specified gate (plan_review) and the PR gate (pr_review); a verdict sign-off uses experiment_review ("cold-eyes", S22 plan §6.4 / RCA discipline).

4.2 Repository & collaboration layer — GitHub

Issues, PRs, branches, reviews, CI/CD, secrets, branch-protection, checks. GitHub is the anchor between code, agents, tests, ADRs and CI. No direct push to main — every code change goes through a PR (CLAUDE.md 14-step process); docs/config/dashboards may be pushed directly.

4.3 Infrastructure layer — Scaleway Kapsule

Kubernetes on Scaleway Kapsule: a permanent PRO2-S pool + an autoscaling PRO2-M compute pool. It hosts Airflow, MLflow, Grafana, Prometheus, Loki, the Streamlit Console, ZenML (publish job only — ADR-1: no ZenML Pro), Redis and PostgreSQL. Ingress: nginx + cert-manager (Let's Encrypt), *.cvntrade.eu. Helm is the single source of runtime config (infra/helm/.../values-prod.yaml).

4.4 Documentation & architecture layer (ADR-77)

MkDocs (textual) + Structurizr (architecture views). The ADR corpus is the architectural constitution: 90+ records explicit about why past choices were made and which invariants must hold — a first-class RAG corpus for the agents.

4.5 MLOps layer

  • MLflow — ML backbone: experiment tracking, params, metrics, artefacts, model registry, run comparison (live; ADR-6 / architecture/MLFLOW.md).
  • Feast — feature store, wired in the training pods (feature_repo), offline/online serving, training-serving-skew control (architecture/FEAST_INTEGRATION.md).
  • Evidently AI / GiskardTARGET ML-quality layer (data/model drift, bias, robustness, continuous validation). Today drift/quality is covered operationally by structured events + Grafana + the FTF statistical reports; Evidently/Giskard are the planned formalisation.

4.6 Data / ML orchestration layer

Two orchestrators, by design (ADR-60/61/89):

  • Airflow — operational orchestration: when and in what order tasks run (ingestion, training, batch scoring, reports, quality gates, monitoring). Launchers only (ADR-11/19); schedule=None, manual / event-triggered (ADR-18/22).
  • Hamilton — declarative compute graph: how a feature / metric is computed. Batch DAGs use Hamilton, not imperative code (ADR-61); the unified training harness is a Hamilton plugin registry (ADR-89).

4.7 Software register / Internal Developer Platform (ADR-78)

There is a software-register / service catalog — but it is not Backstage. It is a lightweight in-house IDP, hosted inside console-next (the Next.js 14 + shadcn cockpit), that adopts the Backstage catalog schema (apiVersion: backstage.io/v1alpha1, kind: Component) and the TechDocs convention as portable formats, without running Backstage itself.

Decision (traced) — Epic CVN-N012-EA (IDP, wp#77); decision dossier reviews/2026-04-29-idp-choice-plan.md (committee plan_review 316c39b3 PASSED, strong consensus, 0 blocker) → ADR-78 (stub). Backstage was deliberately rejected as the primary IDP: console-next was already ratified (ADR-66: Next.js 14 + shadcn + Storybook + DTCG tokens) and in-flight as the Streamlit replacement; adopting Backstage would have forced a costly fork / iframe-embed and a second docs surface violating ADR-77 (MkDocs = docs SSoT). Verdict: "Extended console-next" — keep the ratified stack, build the IDP capabilities into it, reuse only Backstage's portable catalog/TechDocs formats. CVN-N008 + CVN-N012 were consolidated into one Need to remove the boundary ambiguity.

What is live today:

  • Catalog entries — documentation/catalog/*.yaml, one Backstage Component per service. 7 services registered: airflow, console-next, grafana, mlflow, postgresql, redis, s3 (each with spec.owner = an OIDC group, links to UI + Grafana).
  • Catalog engine — console-next/lib/catalog/ (schema.ts, parser.ts, owners.ts + tests) with the detail UI at console-next/app/catalog/[name]/page.tsx.
  • Governance — the catalog is read-only at runtime; every change is a git PR and the PR is the audit trail (ADR-78 invariant I7). Day-2 procedure: runbook runbooks/catalog-add-service.md (Story CVN-N012-EA-S02, wp#96).

Scope caveat — this is a service/component register (7 infra services), not a per-Python-module registry. Code-level "referencing" is currently carried by the Hamilton-native module standard + OTel spans (ADR-62) + the 90+ ADR corpus, not by the catalog.

TARGET — a true module-level inventory (each application/ML module as a catalogued entity with owners, KPIs and dependency edges) does not exist yet; it is a natural console-next catalog extension under CVN-N012-EA and would give the multi-agent committee a queryable component graph (today the agents reconstruct it from ADRs + OTel + code RAG).

5. Multi-agent architecture

5.1 The Committee as orchestration unit (ADR-68)

The central object is the Expert Committee, invoked by scripts/expert_committee.py (GUI on port 8502; runbook OPERATIONS.md §15). It is the default channel for plan review and substantial PR review.

The five locked skills (one expertise each):

Skill Focus
expert-ml-engineer training harness, model code, leakage, metrics
expert-data-scientist statistical validity, power, drift, datasets
expert-crypto-trader trading risk, sizing, costs, market realism
expert-architect ADR conformance, coupling, system invariants
expert-ops infra, CI/CD, observability, runbooks, cost

Each skill is grounded by a domain RAG (relevant ADRs, technical docs, code patterns, past decisions, test conventions, observability incidents, ML metrics, trading constraints).

5.2 Doubled LLMs

Each skill is run by two independent LLMs — Mistral (mistral-large-latest) and Gemini — to compare diagnoses, detect divergence, reduce single-model bias and separate strong consensus from weak opinion. LiteLLM is the multi-LLM abstraction/routing layer (model selection, fallback, cost/quota control); Langfuse is the prompt registry + tracing/observability layer (commun.prompts.langfuse_client; prompts in prompt-library/, synced to Langfuse labels staging/prod).

5.3 Review types (the real --session-type values)

expert_committee.py accepts exactly four session types: plan_review, pr_review, experiment_review, general. There is no separate automatic "spec review" / "design review" stage: specification, design and plan are reviewed together in one self-contained plan dossier (documentation/reviews/YYYY-MM-DD-<slug>.md).

  • plan_review — at the New → Specified gate. Checks: problem framing, acceptance criteria, hypotheses, dependencies, trading risk, data/ML impact, infra impact, ADR need, design coherence, execution-plan realism. REJECTED blocks implementation until blockers are addressed (or a written waiver on the issue).
  • pr_review — for PRs touching src/commun/pipeline|finetune|cache, backtest, training, labels, prod trading (CLAUDE.md §8). Runs on top of the automated CodeRabbit incremental review (§5.5). Checks: scope conformance, code quality, test coverage, regression, ADR consistency, security, performance, ML & trading impact, docs updated.
  • experiment_review — "cold-eyes" sign-off on a produced verdict (e.g. the S22 diagnostic chain: REPRODUCEDSEED_INDEPENDENTCURVE_DEGENERATE → M1/M2 …). Confirms the verdict is correctly & defensibly derived from pre-specified criteria, non-post-hoc.
  • general — ad-hoc.

The committee question must be in English (RAG retrieval degrades on non-English input — a hard operating rule).

5.4 Invocation model (corrected)

The committee is operator-invoked via the CLI at two gates plus the verdict sign-off, immediately followed by mandatory traceability:

python scripts/expert_committee.py \
  --artifact documentation/reviews/<dossier>.md \
  --question "<English question>" \
  --session-type plan_review|pr_review|experiment_review \
  --issue "#<gh-issue>"

ADR-82 (mandatory) — every committee session, regardless of type, is logged as an OpenProject Meeting immediately after the CLI returns:

python scripts/op_save_committee_as_meeting.py \
  --session committee/sessions/<id>_committee.json --linked-wp <wp>

The meeting is state=closed, with the author + the 5 locked expert users as participants, and the session JSON + reviewed artefact attached. This makes every AI arbitration auditable inside OP.

TARGET — automatic triggering on OP transitions / GitHub events (webhook-driven) is not in place; it is the §16.2 routing-matrix evolution. Today the discipline is enforced by the 14-step process + ADR-81 gates, not by automation.

5.5 Automated PR review — CodeRabbit

Independently of the committee, CodeRabbit performs the automated, multi-round incremental PR review (the "CR cycle"). Practice rules learned operationally:

  • let the automatic incremental review run on each fix push — it carries forward addressed-state and converges naturally;
  • do not force @coderabbitai full review repeatedly — a full re-scan re-raises already-resolved items and invents new theoretical edges (it is an LLM re-reading the whole file fresh), causing non-convergence;
  • every Major/Minor is fixed or explicitly justified; the cycle ends at 0 actionable (or on explicit operator authorisation), never auto-merged.

CodeRabbit (breadth, every PR) and the committee pr_review (depth, for substantial PRs) are complementary inputs to the human merge decision.

6. Role of the ADRs

The 90+ ADRs are a central asset: decision memory, trade-off rationale, visible constraints, stable architecture principles, and a RAG-exploitable corpus. In the multi-agent system they act as the architectural constitution — agents check whether a proposal violates a past decision, whether a new decision needs an ADR, whether an existing ADR must be amended, or whether a change creates a systemic inconsistency. Each ADR defines invariants that must never be violated (CLAUDE.md "Avant toute PR").

7. Test as code (ADR-83 → ADR-88)

7.1 Canonical stack & taxonomy

The pinned stack is governed by ADR-84 (foundation stack pick: pytest 8.x + pytest-xdist + Testcontainers + DVC) with fixture-scope discipline (ADR-85), the CI tier-promotion gate (ADR-86), story-phase test integration (ADR-87, pytest --story=CVN-NXXX-EX-SYY) and versioned + provenance-tracked test cases/datasets via DVC (ADR-88).

The marker catalogue is locked in pyproject.toml [tool.pytest.ini_options] (--strict-markers --strict-config; a zero-type-marker test fails collection):

Tier Markers p95 budget
fast (PR-blocking) unit, property, contract ≤ 2 min
medium (PR + nightly) cache, integration, dag_smoke ≤ 10 min
nightly data_quality, ml_behaviour, performance, system_e2e ≤ 30 min
operator-driven uat, post_deploy_smoke n/a
scope story("CVN-NXXX-EX-SYY") n/a

Complementary plugins support the trading-specific concerns the taxonomy targets: time control (timezone / market-session / scheduling determinism — a blocking concern for backtests), test-order randomisation (hidden cross-test coupling), and disciplined rerun of intermittent tests (CI hygiene — must surface, not mask, flakiness).

7.2 Blocking trading tests

Treated as gates: no data leakage, timezone coherence, backtest reproducibility, fees, slippage, order sizing, risk constraints, circuit breakers / kill-switch (ADR-71), exchange-error handling, training/inference parity (batch↔stream certificate, ADR / CVN-N005), portfolio invariants.

8. MLOps & ML backbone

  • MLflow (live) — must trace, per run: dataset, params, metrics, artefacts, model, code version, environment, run-id, validation status; linked to PRs/Stories when a change affects models. Models named CVNTrade_{MODEL}_{SYMBOL}_{STRATEGY}.
  • Feast (wired) — feature definitions, offline/online serving, training-serving-skew control, feature↔model dependency tracing.
  • Evidently AI / Giskard (TARGET) — formalised data/prediction/target drift, bias, robustness, continuous validation. Drift is currently watched via structured events + Grafana + FTF statistical reports.

9. Hamilton-native modules + Airflow

9.0 Hamilton-native module standard (ADR-61/62/89)

Application / data / ML modules are structured Hamilton-native: each is an observable, steerable functional unit with OpenTelemetry instrumentation (ADR-62 — cross-system correlation, distributed debugging, agentic context reconstruction), runtime parameters resolved from PostgreSQL ftf_config and editable only via the Console (ADR-59 — never hard-coded Python defaults that override PG; ADR-90 sharpens this for training hyper-parameters via CVN_HPO_* keys + a fail-fast resolver + the G5 CI grep gate), and structured KPIs feeding Grafana / Langfuse / gating / alerting.

9.1 Airflow

Operational workflows (ingestion, training, batch scoring, reports, quality gates, monitoring). Launchers are the only authority (ADR-11/19); DAGs are schedule=None, created paused, max_active_runs=1 (ADR-18/22); DAGs live in the git-synced cvntrade-airflow-dags repo, auto-synced by the deploy CI.

9.2 Hamilton

Declarative transformation graph — dependency traceability, modularity, testability, readable feature pipelines (ADR-61).

9.3 Complementarity

Airflow drives the workflow; Hamilton structures the computation. Typical chain: Airflow triggers a pipeline → Hamilton defines the feature graph → Feast stores/serves features → MLflow traces the trained model → Evidently/Giskard (TARGET) validate drift/quality → Grafana/Loki expose observability.

10. Observability

  • Grafana (ADR-26 — single entry point): application, infra, ML, trading, CI/CD, cost, data-quality dashboards.
  • Loki: log search, release correlation, error tracking, runtime behaviour, post-mortems. Structured event= logs (ADR-30→38) are the stable machine interface and the primary diagnostic surface (Loki query before any fix is a hard rule).
  • Prometheus / OpenTelemetry (ADR-62): metrics + spans; OTel traces let the agents reconstruct the execution context of an incident.
  • ML observability: drift, model performance, feature quality, inference latency, distribution anomalies, serving errors, business-metric drift.
  • Trading observability: PnL, drawdown, exposure, turnover, slippage, fees, order latency, exchange-error rate, rejected orders, open positions, limit violations.

11. GitHub, CI/CD, secrets & guardrails

GitHub governs delivery (issues, branches, PRs, checks, CI/CD, secrets, protection rules, review policies). Agent interaction rules: no direct push to main; agents propose diffs via PR; critical zones need human review; secrets are never exposed to prompts; agent actions are logged; infra changes require CI + specialised review.

CI guardrails (live, PR-blocking) — the Story workflow guardrails job:

  • G1 — PR title format;
  • G2 — Story / issue reference in the PR body (Closes #N / Story: CVN-… / OpenProject: wp#N / issue URL);
  • G3 — plan-dossier presence (documentation/reviews/*-plan.md);
  • G4 — MLOps-readiness artefact (documentation/stories/<cvn_id>/);
  • G5 — ADR-90 hyper-parameter-externalisation grep gate.

12. Gating & decision rules

Informational opinions are separated from blocking gates. Blocking examples: critical-ADR violation; missing tests on a critical module; risk-engine / kill-switch change without specialised review; secrets/RBAC change without security validation; feature-store change without training/serving test; unexplained ML drift; negative backtest regression; CI failure; a structural change without the architecture doc updated (ADR-77); a Story passing Developed without its docs live on docs.cvntrade.eu (no waiver).

13. Chair / consolidator agent

A consolidation step already exists: after the 5 skills × 2 LLMs produce opinions, a consolidator (Langfuse-managed review-consolidator prompt, run on gemini-2.5-flash) aggregates them into one verdict. Its job: aggregate opinions, identify consensus, isolate disagreement, rank risks, produce a recommended decision, separate blocking / non-blocking, emit a human-actionable summary. It is the cognitive-governance layer of the multi-agent system.

Observed verdict schema (committee/sessions/<id>_committee.json):

{
  "status": "PASSED | REJECTED",
  "code": "OK | …",
  "consensus_strength": "strong | weak",
  "reason": "…",
  "areas_of_agreement": ["… (per-expert attribution)"],
  "blockers": [],
  "dissent": [],
  "recommendations": ["… (advisory)"],
  "advisory_only": true,
  "human_review_required": true
}

Recommended human-facing summary format:

## Review Summary
Decision: APPROVE / REQUEST CHANGES / BLOCK
### Blocking issues
### Non-blocking issues
### ADR impact
### Test impact
### ML impact
### Trading risk impact
### Recommended next actions

14. Risks of the multi-agent system

  • Excessive noise — too many agents → too many comments. Mitigation: scope agents per story type, weight opinions, consolidator synthesis, criticality thresholds.
  • False consensus — two models can agree and be wrong. Mitigation: strong CI, test-as-code, human review, golden scenarios, explicit contrarian prompting.
  • Contextual hallucination — an agent invents a constraint / misreads an ADR. Mitigation: sourced RAG, internal-doc citations, mandatory ADR pointers, no decision without a source.
  • Cost & latency — Mitigation: LiteLLM routing, small models for simple tasks, premium models only for critical tasks, analysis cache, incremental reviews, FinOps accounting (every committee session emits event=committee_finops with token + USD cost).
  • Responsibility dilution — agents can feel like objective validation. Mitigation: final human decision, explicit owners, audit trail (ADR-82 meetings), documented gates.

15. Maturity model

Level Description
1 — Assisted An LLM occasionally helps write code/docs
2 — Workflow-aware Agents know issues/PR/tests/ADRs
3 — Committee-based Specialised agents intervene at OP gates
4 — Governed multi-agent Consolidator arbitrates, formal gates, traceable decisions
5 — Self-improving The system learns from its own incidents/CR-failures/post-mortems to improve prompts, tests, ADRs, gating

The studio sits between levels 3 and 4 — committee + dual-LLM + consolidator + ADR-82 audit trail + G1-G5 gates are live; full event-driven routing and the self-improving loop are TARGET.

16. Evolution recommendations

  1. Event-driven routing matrix (TARGET) — auto-trigger the right agents from OP transitions / changed files / labels / criticality / risk history (today: operator-invoked at gates).
  2. ADR linter (TARGET) — auto-detect architectural change → missing / violated / stale ADR.
  3. Golden backtests — frozen reference scenarios, run before/after every critical change.
  4. Technical risk register — centralise trading / ML / infra / security / tech-debt / data / observability risks.
  5. MLflow ↔ PR linkage — every model-affecting PR references its MLflow run, metrics, artefacts, dataset, drift report, Giskard/Evidently validation.
  6. Per-agent review contracts — explicit scope / inputs / outputs / format / blocking criteria / limits per skill.
  7. Module-level software register (TARGET) — extend the console-next Backstage-schema catalog (§4.7) from 7 services to a per-module inventory (owners, KPIs, dependency edges), giving the committee a queryable component graph instead of ADR+OTel+RAG reconstruction.
  8. Self-improving loop (TARGET, level 5) — feed incidents, CR failures and post-mortems back into prompts, tests, ADRs and gating.

17. Synthesis

The studio's strength is that it does not reduce AI to code generation. It articulates: the OP product workflow; specialised agents; domain RAG; dual-LLM instantiation; GitHub; CI/CD + G1-G5 guardrails; the 90+ ADR constitution; test-as-code; the MLflow/Feast MLOps backbone; the Backstage-schema software register in console-next (ADR-78); Grafana / Loki / OTel observability; living MkDocs/Structurizr documentation; and a consolidator with full ADR-82 audit traceability. The result is a distributed cognitive infrastructure that assists the design, implementation, validation and operation of a complex algo-trading platform. The next structuring steps are event-driven committee routing, a module-level software register, and the self-improving loop (level 5).

18. Conceptual diagram

flowchart TD
    H[Human Lead / Product Owner] --> OP[OP Story life-cycle - ADR-81 8 states]
    OP -->|Specified gate| PR1[plan_review]
    OP -->|PR gate| PR2[pr_review + CodeRabbit CR cycle]
    OP -->|verdict sign-off| PR3[experiment_review cold-eyes]
    PR1 --> C[Expert Committee - 5 skills x 2 LLMs Mistral+Gemini]
    PR2 --> C
    PR3 --> C
    C --> CH[Consolidator - review-consolidator / gemini-2.5-flash]
    CH --> M[ADR-82 OP Meeting - audit trail]
    CH --> GH[GitHub Issue / PR - human decision]
    GH --> CI[CI/CD + G1-G5 guardrails]
    CI --> K8S[Scaleway Kapsule]
    CI --> ML[MLflow / Feast]
    CI --> DOC[ADR / MkDocs / Structurizr]
    K8S --> OBS[Grafana / Loki / OTel / Prometheus]
    ML --> OBS