Skip to content

CVNTrade ML Platform — World-Class Design

Status: Committee-reviewed (session cb99b702 — 2026-04-18) — PASSED, score 8.0/10, strong consensus Version: 2.0 (2026-04-18) — integrates committee feedback Authors: Architecture team Reviewers: 5 experts (architect 8.0, ops 9.0, data-scientist 7.5, crypto-trader 7.5, ml-engineer 8.0)

Review summary: - Unanimous strengths: tool separation, A/B 4-mode framework, immutable audit (MiFID II / SOC2 ready), 10-layer observability, defense in depth, Model Cards + lineage. - Dissent: (1) 20-day timeline disputed (3/5 judge optimistic) → revised to 30 days. (2) A/B statistical rigor (2/5 find 50 trades + Cohen's d insufficient for Sortino variance) → replaced by power analysis with MDE. - Critical additions (v2): Feast as explicit feature store, cost model integrated everywhere, calibration as first-class, SR 11-7 Model Risk Management framework, leakage audit DAG, multi-person approval, incident response runbooks, upstream data observability, execution realism monitoring.


1. Executive Summary

This document proposes the target-state ML platform architecture for CVNTrade. It builds on three tools already in the stack — MLflow, Weights & Biases, Evidently — each with exclusive, non-overlapping responsibilities, and introduces A/B testing as a foundational pattern and a unified model observability layer.

The design is incremental: MLflow stays as the foundation. W&B and Evidently get promoted from "installed but under-used" to first-class roles. The production runtime gains a routing layer that makes champion/challenger real (not decorative). No single-vendor lock-in, all components open-source or with free tiers.

Objective: pass an authority-level review (regulator, institutional investor, senior MLOps practitioner) on reproducibility, auditability, safety, and observability.

Effort: ~30 days of focused work (revised from 20 per committee dissent — 3/5 experts judged original timeline optimistic given dependencies and validation requirements), broken into 14 issues (listed in §11).


2. Design Principles

  1. Separation of concerns. Each tool has one job; overlapping features are ignored.
  2. Defense in depth. Three independent monitoring surfaces so no single failure is silent.
  3. Reproducibility by construction. Any model in production can be rebuilt bit-for-bit from its artifacts + code SHA + dataset version.
  4. Immutable audit trail. Promotions, demotions, killswitch events are append-only, signed, timestamped.
  5. A/B testing is first-class. Not bolted on: the runtime has a routing layer; the registry supports multiple live variants; statistical significance is built into the decision flow.
  6. Observability is a product feature, not an afterthought. Metrics, logs, traces, drift reports are emitted at every layer and reach a single entry point (Grafana).
  7. Automated response with human override. Drift → auto-quarantine. Promotion → human approval. The right level of autonomy at each step.
  8. Regulatory readiness. Model cards, data lineage, decision explainability, audit log — all generated automatically, not manually.

3. Tool Responsibilities

MLflow — System of Record

Owns exclusively: - Model Registry (versions, aliases, tags) - Artifact storage (pickle, parquet, sklearn pipelines) on S3 - Lineage metadata (which run produced which model version) - Promotion state (which version holds which alias at any time) - Source of truth for production routing decisions

Interface: CVNTrade_MLFlowManager (existing abstraction, enforced single entry point)

Storage: - Backend store: PostgreSQL (champollion DB) - Artifact store: S3 (cvntrade-artifacts/mlflow/)

Exposure: mlflow.cvntrade.eu with OAuth2 proxy (new — today it's internal only)

Does NOT own: - Training dashboards (→ W&B) - Drift detection (→ Evidently) - Monitoring metrics (→ Prometheus) - A/B routing rules (→ Runtime service)

Weights & Biases — Experiment Lab

Owns exclusively: - Training curves (loss, F1, Sortino per epoch / per trial) - HPO visualization (parallel coordinates, importance plots, sweep results) - Cross-experiment comparison (cross-crypto, cross-fold, cross-strategy) - Shareable Reports — narrative artifacts with interactive charts and markdown for stakeholders - Public read-only dashboards for external stakeholders (regulators, investors) with role-based access

Interface: wandb.init() at the start of every training run, wandb.log() during, wandb.summary at end.

Cross-reference with MLflow:

wandb.init(
    project="cvntrade-training",
    name=f"{symbol}_{strategy}_{ts}",
    config=hyperparams,
    tags=[symbol, strategy, model_type],
    notes=f"mlflow_run_id={mlflow_run.info.run_id}",
)
# At end, record the cross-link back to MLflow
wandb.summary["mlflow_model_version"] = registered_version
wandb.summary["mlflow_model_uri"] = f"models:/{model_name}/{version}"

Workspace structure: - cvntrade-training — all training runs - cvntrade-sweeps — HPO sweeps - cvntrade-reports — curated reports for promotion decisions

Does NOT own: - Model artifacts (→ MLflow) - Production routing (→ Runtime) - Drift detection (→ Evidently)

Evidently — Production Monitor

Owns exclusively: - Data drift reports (PSI, KL divergence, Wasserstein on every feature) - Prediction drift (distribution shift on prob_buy, confidence) - Target drift (when labels arrive H hours after prediction) - Performance decay (Sortino live vs Sortino backtest, precision/recall by regime) - HTML reports archived in MLflow artifacts + S3 cold storage - Data quality checks (missing values, outliers, schema violations)

Interface: Evidently Report + TestSuite objects run daily and on-demand.

Execution: - Daily via dag_pte__6_monitoring at 06:00 UTC - On-demand via API endpoint (for operator or committee inspection)

Outputs: - Prometheus metrics (pushed via Pushgateway): evidently_drift_score, evidently_prediction_drift, sortino_decay_pct - HTML reports attached to MLflow as run artifacts (audit trail) - Alerts via Grafana alerting rules (drift_score > 0.2 for 24h) - Auto-trigger of killswitch DAG when critical thresholds breached

Does NOT own: - Experiment tracking (→ W&B) - Model storage (→ MLflow) - Raw log aggregation (→ Loki)

Feast — Feature Store (new in v2 per committee)

Owns exclusively: - Feature definitions (feature_view schemas) as single source of truth - Feature versioning (ADR-23 enforcement — features_hash pinned per model version) - Online serving (low-latency inference path via Redis online store) - Offline serving (training + backfill path via Parquet offline store) - Point-in-time correctness (prevents training-serving skew) - Feature lineage: which upstream tables/computations produced which feature

Cross-system integration: - MLflow: each model version tags features_hash = Feast feature view version at training time. Serving loads the same version. - W&B: training runs log Feast feature view references. - Evidently: monitors feature freshness (SLO: features < 1h stale) and value drift vs Feast offline reference. - Runtime: reads features from Feast online store (Redis) with fallback to re-compute on cache miss.

Leakage prevention: Feast enforces temporal separation via created_timestamp column — point-in-time joins guarantee no future data leaks into training features.

Supporting Infrastructure

  • Airflow: orchestration — DAGs for training, promotion, monitoring, killswitch, rollback, decommission, leakage audit (new), auto-retraining (new)
  • Loki: log aggregation — 30-day retention, queryable via Grafana
  • Prometheus: metrics aggregation — hit rates, drift scores, training throughput, calibration metrics (new)
  • Grafana: single observability entry point (ADR-26) — metrics + logs + traces
  • Runtime service: model serving for paper/live trading
  • Console (Streamlit): operator UI — config, run history, PDF download, A/B experiment management (new)
  • PostgreSQL: operational data — finetune_runs, ftf_config, model_inferences, mlflow_promotion_audit, ab_routing, ab_experiments, pending_approvals, trade_outcomes
  • Vault or K8s External Secrets (new per committee): HMAC signing keys, OAuth2 credentials with automated rotation

4. Core Workflows

4.1 Training → Registration

Airflow DAG triggers training
┌───────────────────┐  mlflow.create_run() → run_id
│  Python training  │
│    pipeline       │  wandb.init(notes=mlflow_run_id)
│                   │
│  HPO loop         │  wandb.log({trial, f1, sortino, ...})  ─► W&B dashboard
│  (n trials)       │  optuna study also logged to W&B
│                   │
│  Fit + validate   │  mlflow.log_params(), mlflow.log_metrics()
│                   │
│  Save artifacts   │  mlflow.log_model(flavor=xgboost)
│                   │  mlflow.log_artifact(feature_list.json)
└───────────────────┘
MLflow Model Registry:
  register_model(CVNTrade_XGBoost_UNIUSDC_1h_ATR1.5_3.0_H5)
  set_alias("challenger", new_version)
W&B summary:
  mlflow_model_version = 42
  mlflow_model_uri = "models:/.../42"
  dvc_dataset_hash = "..."  (Phase 5 #575)
Slack notification:
  "New challenger: ...v42 — W&B report: <url>"

4.2 Validation → Promotion

Operator receives Slack notification with W&B Report URL
Reviews the W&B Report (training curves, HPO landscape, Sortino vs baseline)
Reviews the Evidently pre-deployment report (optional — run once on
reference OHLCV data to confirm no input schema drift)
Decision: Approve / Reject via Slack interactive button
        ├─ Reject → ModelVersion stays at "challenger" alias
        │          (or moved to "archived" after review period)
        └─ Approve
Webhook → Airflow → dag_pte__5_promotion
MLflow transition (atomic, ADR-42):
  - Get current champion version → set as "champion_previous"
  - Set new version alias to "champion"
  - Tag with promoted_at, promoted_by, approval_reason
Append to mlflow_promotion_audit (PG append-only):
  - id, timestamp, operator_slack_id, model_name, from_version,
    to_version, wandb_report_url, evidently_report_uri,
    artifact_sha256, signature_hmac
Slack notification to #trading-ops:
  "Promoted: ... v42 → champion (was v41).
   Report: <wandb>. Audit: <pg_id>"

4.3 Serving

Runtime service loads champion via MLflow URI:

model = mlflow.pyfunc.load_model(
    f"models:/{model_name}@champion"
)

Every inference logs to model_inferences table: - timestamp, symbol, model_name, model_version, features_hash, prob_buy, prob_sell, prob_hold, confidence, decision

This table is the ground truth for: - Evidently drift reports (compare with training distribution) - Target drift (join with trade_outcomes H hours later) - A/B analysis (see §5)

4.4 Monitoring (replaces stubbed DAG 6)

Daily at 06:00 UTC, dag_pte__6_monitoring runs:

  1. Collect reference data from MLflow: the training features DataFrame used for the current champion (artifact).
  2. Collect current data from model_inferences table for last 24h.
  3. Run Evidently Report:
  4. DataDriftPreset → PSI, KL on every feature
  5. TargetDriftPreset → conditional on available labels (prediction outcomes 5h later)
  6. RegressionPerformancePreset (or custom) → Sortino live vs Sortino backtest
  7. Push metrics to Prometheus:
  8. evidently_data_drift{crypto, model_version}
  9. evidently_prediction_drift{crypto, model_version}
  10. sortino_decay_pct{crypto, model_version}
  11. Archive HTML report as MLflow artifact tagged monitoring_report/evidently/{date}.html
  12. Evaluate alerting rules (Grafana):
  13. drift_score > 0.25 for 24h → warning
  14. drift_score > 0.40 for 6h → critical → auto-trigger killswitch DAG
  15. sortino_decay > 50% for 72h → critical → auto-trigger killswitch DAG
  16. Write audit entry: monitoring_run_log table (ran, verdict, actions taken)

4.5 A/B Testing (see §5)

4.6 Rollback / Killswitch

  • Killswitch (dag_pte__7_killswitch): tag champion with killswitch=true, lifecycle_phase=quarantine. Runtime checks this tag on every inference and returns HOLD. Append to audit table.
  • Rollback (dag_pte__8_rollback): promote champion_previous to champion, tag former champion as rolled_back. Append to audit table with reason.

Both operations require human approval via Slack buttons (same flow as promotion).


5. A/B Testing Framework (First-Class)

A/B testing in trading is tricky because you cannot split the same market between two models in real time without taking positions with both. This design handles it via four complementary modes, each appropriate for a different risk level.

5.1 Four A/B Modes

Mode 1: Shadow (zero risk, zero blast radius)

Challenger runs alongside champion on the same candles, produces predictions, but never executes trades. Both predictions are logged to model_inferences. Comparison is statistical (hit rate, Sortino simulation on retro-trades).

Implemented by: the CandlePipeline shadow mode from #568 (Phase 3). Just needs extension to log challenger predictions with a flag is_shadow=true.

Use case: validate a brand-new challenger before any real exposure.

Mode 2: Canary (bounded risk) — revised per committee (v2)

Champion handles N-1 cryptos, challenger handles 1 crypto in production. Regime-aware stratification (per data-scientist + crypto-trader feedback) — the canary crypto is chosen by volatility quintile, not market cap, to reduce regime bias. Rollout ladder with power-analysis-derived sample sizes:

Stage Cryptos on challenger Minimum sample size Duration Promotion gate
1 1 (mid-cap, mid-volatility) N = power analysis for MDE 0.3 Sortino 7d or N trades (whichever longer) No critical drift + calibration ECE < 0.05 + liquidity score ≥ 0.7
2 2 (different volatility quintiles) N per crypto 7d or N each Sortino within 80% of champion + regime-balanced performance
3 50% of cryptos aggregate N 14 days BH p<0.05 on paired comparison + MDE of 15% Sortino improvement
4 All Promote to sole champion

Sample size: derived from power analysis (power=0.8, alpha=0.05, MDE = 0.3 Sortino units) based on observed Sortino variance in training backtests. 50 trades was a placeholder; actual minimum is typically 200-500 depending on strategy frequency.

Routing layer with fallback (per architect + ml-engineer): - Runtime service reads ab_routing table on every inference - On query failure → default to champion alias (ADR-25 compliance — no silent fallback, but a safe default rather than no serving) - Automatic rollback: error rate > 2× champion baseline for 15 min → revert crypto to champion

Implemented by: ab_routing table + routing layer with fallback + auto-rollback daemon.

Mode 3: Time-Split (for temporal analysis)

Champion runs Mon-Wed, challenger runs Thu-Sat. Less statistically rigorous but useful for exploring regime sensitivity.

Use case: hypotheses about regime-dependent performance.

Mode 4: Paper Parallel (pre-production)

Challenger runs in paper trading on the same cryptos as champion in production. Full realistic trades simulated. No real money, no correlation with live.

Use case: final validation before canary.

5.2 Routing Layer (new component)

New table ab_routing:

CREATE TABLE ab_routing (
    id SERIAL PRIMARY KEY,
    crypto TEXT NOT NULL,
    active_alias TEXT NOT NULL,  -- 'champion' | 'challenger' | 'champion_previous'
    experiment_id TEXT,          -- UUID of the A/B experiment
    assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    assigned_by TEXT NOT NULL,
    UNIQUE (crypto, experiment_id)
);

Runtime service reads this on every inference:

def get_model_for_crypto(crypto: str) -> str:
    """Returns the MLflow alias for this crypto."""
    row = pg.query(
        "SELECT active_alias FROM ab_routing WHERE crypto = %s AND active_at <= NOW() "
        "ORDER BY assigned_at DESC LIMIT 1",
        (crypto,),
    )
    return row["active_alias"] if row else "champion"

alias = get_model_for_crypto(candle.symbol)
model = mlflow.pyfunc.load_model(f"models:/{model_name}@{alias}")

5.3 Statistical Decision Framework (revised v2 per data-scientist + crypto-trader)

A challenger is promoted only if ALL gates pass:

Gate Criterion Rationale
Sample size N trades ≥ power analysis output for strategy-specific Sortino variance Replaces 50 trades placeholder (insufficient for Sortino high variance). Computed per strategy based on observed σ.
Minimum Detectable Effect (MDE) Promote only if observed effect ≥ MDE of 15% Sortino improvement OR 0.3 Sortino absolute MDE prevents false positives from small samples.
Statistical significance BH-corrected paired Wilcoxon p < 0.05 across folds + stratified by volatility regime Paired test controls for market regime; Wilcoxon is non-parametric (Sortino isn't normal).
Confidence intervals Sortino 95% CI (bootstrap, n=1000) must exclude champion's Sortino CI lower bound Ensures improvement is robust, not a fluke.
Cost-aware All metrics computed NET of fees + slippage + funding Sortino optimization without costs is deceptive (crypto-trader).
No critical drift Evidently PSI < 0.25 on top-20 features AND drift-performance correlation verified Drift threshold must be tied to actual Sortino degradation observed OOS (ADR-15).
Calibration quality Expected Calibration Error (ECE) < 0.05 Uncalibrated probabilities break position sizing and cost filters (ml-engineer).
Liquidity sanity All cryptos in challenger bucket have 24h volume ≥ $10M AND bid-ask spread ≤ 5bps Avoids illiquid regime where backtests diverge from reality (crypto-trader).
Operational Error rate ≤ champion + 2σ, latency p99 ≤ champion + 2σ No regression in observability or safety.
No leakage Leakage audit DAG (new) validates purging + embargo + feature version pinning Blocks the experiment if temporal separation is violated (ml-engineer).

Decision logic:

# scripts/ab_decision.py (weekly in dag_pte__ab_decision)
def decide(experiment_id):
    gates = [
        check_sample_size(mde=0.3_sortino, power=0.8, alpha=0.05),
        check_mde_achieved(min_mde_pct=15),
        check_statistical_significance(method="wilcoxon_paired_bh_corrected"),
        check_bootstrap_ci_excludes_baseline(n_bootstrap=1000),
        check_cost_net_metrics(),
        check_no_critical_drift(psi_threshold=0.25, linked_to_sortino=True),
        check_calibration_ece(threshold=0.05),
        check_liquidity_sanity(min_volume=10_000_000, max_spread_bps=5),
        check_operational_regression(sigma_threshold=2),
        check_leakage_audit_passed(),
    ]
    return all(gates), [g for g in gates if not g.passed]

5.4 A/B Experiment Lifecycle

1. Create experiment (UUID + name + hypothesis) via Console UI
2. Define rollout plan (list of stages, criteria, durations)
3. Deploy challenger as 'challenger' alias in MLflow
4. ab_routing assigns 1 crypto to challenger
5. Wait 7 days (monitored by Evidently + Grafana)
6. Weekly analysis job evaluates promotion gate
7. If passed: next stage (more cryptos); if failed: rollback
8. Final stage: challenger becomes sole champion
9. W&B Report archived summarizing the experiment
10. Audit entry written to mlflow_promotion_audit

6. Model Observability (Unified Layer)

6.1 What is observed

Layer Metric Source Frequency
Input quality Schema violations, missing values, outliers Evidently tests Per inference (sampled)
Input drift PSI, KL on each feature Evidently daily Daily
Prediction drift Distribution shift on prob_buy, confidence Evidently daily Daily
Target drift Label distribution change (5h after prediction) Evidently daily Daily
Concept drift Prediction vs outcome alignment over time Custom query + Evidently Weekly
Performance decay Sortino live vs Sortino backtest PG query on trade_outcomes Daily
Explainability drift SHAP value distribution per feature Custom + W&B Weekly
Operational Latency p50/p99, error rate, throughput Runtime service → Prometheus Real-time
Business Net PnL, win rate, Sharpe, drawdown PG query on trade_outcomes Daily
Cost Compute hours, artifact storage, inference volume Cloud cost API + PG Daily

6.2 Where it is observed

Single entry point: Grafana (ADR-26).

Dashboards:

  • Model Health Overview — one panel per active model with traffic-light status. Green = no drift + Sortino in spec + no errors. Yellow = warning. Red = auto-quarantined.
  • Drift Deep-Dive — per-feature drift evolution over last 90 days. Evidently reports embedded via iframe for detailed analysis.
  • Performance Decay — Sortino live (rolling 7d) vs Sortino backtest per model version. Gap > 30% = critical.
  • A/B Experiment Dashboard — one per active experiment. Statistical power, current metrics, gate status.
  • Training Lab (embedded W&B) — curves, HPO, comparisons.
  • Cost Dashboard — per model per crypto per day.

6.3 Alerting

Grafana alert rules trigger on:

  • drift_score > 0.25 for 24h → warning to #trading-ops
  • drift_score > 0.40 for 6h → critical + auto-killswitch
  • sortino_decay > 50% for 72h → critical + auto-killswitch
  • inference_error_rate > 1% for 10min → critical + auto-killswitch
  • inference_latency_p99 > 500ms for 10min → warning

Critical alerts reach Slack + PagerDuty (when deployed).

6.4 Explainability (revised v2)

SHAP values computed asynchronously on a sampling basis (default 10% of inferences) to avoid latency impact on the hot path. Configurable per crypto.

  • On-demand endpoint: operators can query /explain/{inference_id} for full SHAP computation of a specific decision (audit use case).
  • Aggregated reports: weekly W&B Reports show SHAP value distributions per feature, flag shifts in feature importance over time (explainability drift).
  • Cached top-10: top-10 contributing features per inference are stored synchronously in model_inferences.top_features (JSONB) for fast display in Grafana drill-down.

Regulator-ready: any decision can be traced back to features, values, and marginal contribution. Meets MiFID II Article 28 explainability requirements.

6.5 Cost & Execution Realism Monitoring (new v2 per crypto-trader)

Bridges the gap between ML metrics and economic reality:

Metric Computation Alert
Effective spread per trade (exit_price - entry_price) / entry_price - gross_return_backtest > 2× backtest assumption → warning
Slippage incurred Actual fill price vs theoretical (close price at signal) > 5bps deviation from assumption → investigate
Fill rate % of orders filled vs placed < 95% → flag partial-fill risk
Funding rate impact Σ funding paid on open positions (derivatives only) > 0.5% of notional / day → review leverage
Liquidity score min(volume_24h / $10M, 1) × (1 - spread_bps / 50) < 0.7 → flag liquidity-constrained
P&L reconciliation Daily: model predictions → executed trades → realized P&L Discrepancy > 10% → investigation

The P&L sanity check (crypto-trader's ONE improvement request) is the reconciliation layer: each day, for each model version, compare the aggregate predicted edge (Σ prob_buy - threshold) to the realized P&L, adjusted for costs. Systematic deviation indicates model drift OR execution degradation — a single Grafana panel "P&L Reconciliation" flags the gap.

6.6 Upstream Data Observability (new v2 per ops)

Monitors the data pipeline before features reach the model:

  • OHLCV ingestion freshness: last Binance candle timestamp vs now (SLO: < 2 min lag)
  • Schema drift: column types and names vs expected schema
  • Missing data: gaps in OHLCV, unexpected NULLs
  • Data quality score: rollup of validation tests (Great Expectations-compatible format)
  • Exchange API health: error rate, rate limit hits

Exposes to Grafana dashboard "Data Pipeline Health".

6.7 Calibration Monitoring (new v2 per ml-engineer)

Model probabilities must be calibrated to be usable for cost/Kelly filters (ADR-46).

  • Reliability diagrams computed weekly from model_inferences vs outcomes
  • Expected Calibration Error (ECE) exposed to Prometheus per model version
  • Brier Score tracked over time — regression in calibration triggers alert
  • Post-training calibration: isotonic regression or Platt scaling applied on held-out OOS fold (not train set to avoid leakage). Calibrated model is the one registered in MLflow.
  • Threshold calibration OOS (ADR-15): decision thresholds (threshold_buy) derived on walk-forward validation set, not HPO val set. Pinned per model version in model_inferences for audit.

7. Audit & Compliance

7.1 Immutable Audit Log

Table mlflow_promotion_audit is append-only (PG trigger blocks UPDATE/DELETE):

CREATE TABLE mlflow_promotion_audit (
    id BIGSERIAL PRIMARY KEY,
    event_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    event_type TEXT NOT NULL CHECK (event_type IN (
        'promotion', 'rollback', 'killswitch', 'decommission',
        'ab_start', 'ab_promote', 'ab_rollback'
    )),
    model_name TEXT NOT NULL,
    from_version INT,
    to_version INT,
    from_alias TEXT,
    to_alias TEXT,
    actor_type TEXT NOT NULL CHECK (actor_type IN ('human', 'automated')),
    actor_id TEXT NOT NULL,  -- Slack user ID or system identity
    reason TEXT,
    wandb_report_url TEXT,
    evidently_report_uri TEXT,
    artifact_sha256 TEXT NOT NULL,
    signature_hmac TEXT NOT NULL,  -- HMAC of the row content with secret key
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Prevent updates and deletes
CREATE RULE no_update AS ON UPDATE TO mlflow_promotion_audit DO INSTEAD NOTHING;
CREATE RULE no_delete AS ON DELETE TO mlflow_promotion_audit DO INSTEAD NOTHING;

-- Daily export to S3 cold storage
-- (separate cron job; S3 bucket has Object Lock enabled for WORM compliance)

7.2 Model Cards (auto-generated)

For every promoted model, a Model Card is generated and stored alongside the MLflow artifact:

  • Model name, version, creation date
  • Training data summary (period, n_samples, cryptos included)
  • Features used (full list, top-10 by importance)
  • Hyperparameters
  • Evaluation metrics (train, val, test, walk-forward, holdout)
  • Known limitations and failure modes
  • Fairness considerations (cross-crypto performance disparity)
  • Link to W&B training report
  • Link to Evidently pre-deployment report
  • Link to MLflow artifact URI

Format: HTML + JSON. Stored in s3://cvntrade-artifacts/model_cards/.

7.3 Lineage

Every production prediction can be traced back through:

  1. Inference log (model_inferences) → features used, model version
  2. MLflow model version → MLflow run_id, artifact, training parameters
  3. MLflow run tags → W&B run URL, DVC dataset hash (Phase 5 #575), Git commit SHA
  4. DVC → exact training data
  5. Git commit → exact code version

This satisfies MiFID II Article 28 (algorithmic trading audit trail) and SOC2 change control requirements.

7.4 Model Risk Management (SR 11-7 framework, new v2 per data-scientist)

Adopts the SR 11-7 Supervisory Guidance on Model Risk Management (Federal Reserve / OCC) as the formalized framework for validation, monitoring, and decommissioning. Key practices:

  • Independent validation: each production model undergoes validation by a reviewer NOT the one who built it (peer review, expert committee)
  • Challenger testing: continuous A/B framework (§5) satisfies the "effective challenge" requirement
  • Uncertainty quantification: all reported metrics carry 95% CI (bootstrap-derived)
  • Documentation: Model Card + validation report stored in s3://cvntrade-artifacts/model_cards/ and linked from MLflow
  • Ongoing monitoring: monitoring DAG (#584) with defined thresholds for drift-driven revalidation
  • Limits and compensating controls: position sizing limits via Kelly filter; max drawdown limits per model variant enforced at routing layer
  • Remediation triggers: drift > threshold → retraining; persistent underperformance → decommission

7.5 Leakage Audit DAG (new v2 per ml-engineer — ONE critical improvement)

A pre-promotion DAG validates temporal integrity of training data:

  • Purging audit: no overlap between train and validation candles (ADR-14/15)
  • Embargo audit: minimum N-hour gap between train end and validation start (per triple barrier horizon)
  • Feature version pinning: features_hash in model matches Feast feature view version at training time (ADR-23)
  • Timestamp monotonicity: all features computed from data ≤ inference timestamp (point-in-time correctness)
  • Label delay audit: labels used in training were not observable at the time of the corresponding features

Blocking behavior: if audit fails → model registration fails with explicit error. No promotion possible without audit pass.


8. Incident Response Runbooks (new v2 per ops)

For each critical alert, a runbook in documentation/runbooks/ with:

  1. Immediate human actions (0-5 min)
  2. Escalation path (on-call → team lead → committee)
  3. Diagnostic steps (Grafana panels to check, Loki queries, DB queries)
  4. Recovery procedures (rollback, retrain, manual override)
  5. Post-incident (audit entry, root cause, preventive action)

Required runbooks (before production):

  • runbook_drift_critical.md — drift_score > 0.40 for 6h
  • runbook_sortino_decay.md — sortino decay > 50% for 72h
  • runbook_inference_errors.md — error rate > 1% for 10 min
  • runbook_latency_spike.md — p99 latency > 500ms for 10 min
  • runbook_killswitch_triggered.md — auto or manual killswitch
  • runbook_ab_rollback.md — A/B auto-rollback triggered
  • runbook_calibration_regression.md — ECE > 0.10
  • runbook_liquidity_degradation.md — liquidity_score < 0.5 on production crypto

Runbooks are validated via tabletop exercises before production deployment.


9. Migration Plan

9.1 Current state → Target state

Component Current Target Issue
MLflow UI Internal only Public via OAuth2 #583
Monitoring DAG Stubbed Evidently + auto-response #584
W&B integration Minimal First-class with cross-links #585
Audit trail Mutable tags Append-only table + WORM #586
Promotion Direct Airflow trigger Slack approval workflow (2-person) #587
A/B testing None Four-mode framework + routing #588
Model observability Scattered Unified Grafana dashboards #589
Model cards None Auto-generated #590
Feature store Ad-hoc Feast (versioning + online/offline) #591
Leakage prevention Partial (ADR-14/15) Blocking leakage audit DAG #592
Retraining Manual Auto-retraining DAG (drift-driven) #593
Runbooks None 8 runbooks + tabletop exercises #594
Model risk framework Informal SR 11-7 formalized #595
Calibration monitoring None ECE + Brier + reliability diagrams #596

9.2 Sequencing (revised v2 — 30 days)

Phase 1 — Observability foundation (9 days) - Issue #584 (monitoring DAG with Evidently) — unblocks everything else - Issue #585 (W&B first-class integration) — enables richer reports - Issue #586 (audit trail with HMAC + WORM) — enables trust in subsequent phases - Issue #596 (calibration monitoring) — required before A/B gates can depend on ECE

Phase 2 — Safety & exposure (6 days) - Issue #583 (MLflow UI exposure with OAuth2) - Issue #587 (Slack approval workflow, multi-person for prod) - Issue #592 (leakage audit DAG — blocking gate for promotion) - Issue #594 (runbooks — before any critical auto-response ships)

Phase 3 — Feature store & retraining (7 days) - Issue #591 (Feast feature store with online/offline) - Issue #593 (auto-retraining DAG driven by drift thresholds)

Phase 4 — A/B and lifecycle (8 days) - Issue #588 (A/B framework — 4 modes + routing layer + fallback) - Issue #589 (unified observability dashboards + cost + P&L reconciliation) - Issue #590 (model cards) - Issue #595 (SR 11-7 model risk formalization)

Total: ~30 days, phased so each delivers value independently. Committee dissent on 20d resolved by explicit buffer for validation + runbook exercises.

9.3 Risks

  • W&B cost: free tier limit (100 GB artifacts, 5 users). Escalation path (v2): if monthly usage > 70% → upgrade to Team plan (~€50/month/user); if > 100 GB artifacts → migrate to self-hosted W&B on K8s.
  • Evidently performance: large DataFrames slow. Sample + parallelize reports if needed.
  • PG WORM trigger compatibility: must test on managed Scaleway PG that triggers are allowed.
  • A/B routing complexity: adds a critical path; fallback to champion alias on query failure (documented in §5.2); daemon auto-rollback on error-rate regression.
  • HMAC key management (v2 per ops): keys managed by Vault or K8s External Secrets; automated rotation every 90 days; old keys kept for signature verification of historical records for 7 years.
  • Feast adoption friction: training code refactor non-trivial. Mitigation: gradual — run Feast alongside existing feature loading for 2 weeks (shadow mode), then cut over.

9.4 Rollback

Each phase is additive; rollback is simple: - Disable monitoring DAG → back to current state - Skip W&B logging → no training impact - Revert audit table trigger → mutable as before - A/B routing: single row in ab_routing returns all cryptos to champion - Feast: keep legacy feature loading code paths for 2 sprints as fallback


10. What Authorities See

When this design is presented to an authority (regulator, institutional investor, senior MLOps practitioner), they see:

  1. Defense in depth — three independent monitoring surfaces (Evidently, W&B, Prometheus/Grafana) plus Feast feature-level observability plus upstream data observability, so no failure mode is silent.
  2. Deterministic reproducibility — any production prediction can be reconstructed exactly from its lineage chain (features_hash pinned to Feast view, model_version pinned to MLflow artifact, dataset pinned via DVC, code pinned via Git SHA).
  3. Immutable audit with cryptographic integrity — all lifecycle events in an append-only store with HMAC signatures (keys in Vault with automated 90d rotation), 7-year WORM cold storage for MiFID II compliance.
  4. A/B testing as a first-class pattern — formal rollout ladder with regime-aware stratification, 10-gate statistical decision framework (power analysis MDE + bootstrap CI + paired Wilcoxon BH + cost-net metrics + calibration ECE + liquidity sanity + leakage audit), routing layer with auto-rollback.
  5. Leakage prevention as a blocking gate — dedicated DAG validates purging, embargo, feature-version pinning, and point-in-time correctness before any promotion.
  6. Automated response with multi-person human authority — drift auto-quarantines (killswitch); promotions require 2-person approval (ML owner + operator) via signed Slack button; auto-retraining is gated (challenger only, never direct promotion).
  7. Model Risk Management per SR 11-7 — independent validation, challenger testing, uncertainty quantification, position-sizing limits, documented remediation triggers.
  8. Regulatory compliance-ready — Model Cards, lineage, explainability (async SHAP, on-demand per-inference), audit trail all generated automatically. Meets MiFID II Article 28 + SOC2.
  9. Economic realism — ML metrics always computed NET of fees/slippage/funding; P&L reconciliation dashboard catches divergence between predicted edge and realized return.
  10. Incident readiness — 8 runbooks per critical alert, tabletop-exercised before production, linked from Grafana alert descriptions.
  11. Observability is end-to-end — Grafana is the single entry point (ADR-26); no tool-hopping required to diagnose an issue.
  12. No vendor lock-in — MLflow, W&B (free tier → self-host escalation path), Evidently, Feast, Grafana, Loki, Prometheus are all open-source or with clear migration paths.

11. Issues List

Issue #583 — Expose MLflow UI with OAuth2 (Major, 1 day)

  • Ingress mlflow.cvntrade.eu with TLS
  • OAuth2 proxy (same IdP as rest of stack)
  • Roles: viewer (read), operator (promote/rollback), admin (registry edit)
  • Grafana link-out to MLflow run pages

Issue #584 — Monitoring DAG with Evidently (Critical, 5 days)

  • Implement dag_pte__6_monitoring (currently stubbed)
  • model_inferences table for production predictions
  • Evidently Report daily: data drift, prediction drift, target drift, performance decay
  • Push metrics to Prometheus via Pushgateway
  • Grafana alerting rules (warning + critical)
  • Auto-trigger dag_pte__7_killswitch on critical breach
  • Archive HTML reports in MLflow artifacts + S3
  • Grafana dashboard "Production Model Health"

Issue #585 — W&B first-class integration (Major, 3 days)

  • Every training run calls wandb.init(notes=mlflow_run_id)
  • HPO trials logged to W&B (replaces raw Optuna visualization)
  • W&B summary includes mlflow_model_version cross-link
  • Auto-generated W&B Report per challenger (before promotion)
  • Slack promotion message includes W&B Report URL
  • Public dashboard for external stakeholders (read-only)
  • W&B workspace structure: cvntrade-training, cvntrade-sweeps, cvntrade-reports

Issue #586 — Immutable audit trail (Major, 2 days)

  • Table mlflow_promotion_audit with append-only trigger (PG rules)
  • HMAC signature per row using secret key from K8s
  • Writes on promote / rollback / killswitch / decommission / A/B events
  • Daily export to S3 cold storage (WORM with Object Lock)
  • Grafana panel "Audit Trail Viewer" (read-only)
  • Alembic migration

Issue #587 — Slack approval workflow with multi-person gate (Major, 2 days)

  • Promotion request created in PG table pending_approvals
  • Slack interactive message (Block Kit) with Approve / Reject buttons
  • Webhook endpoint → Airflow API → dag_pte__5_promotion
  • 48h timeout → auto-reject
  • Multi-person approval for production promotions (2 distinct approvers required, per committee ops feedback): 1 ML owner + 1 operator
  • All approvals signed (HMAC) and logged to audit table
  • Emergency-override path (1 signer) for killswitch/rollback, logged with escalation trail

Issue #588 — A/B testing framework with routing layer + fallback (Major, 6 days)

  • ab_routing table with per-crypto routing rules + query-failure fallback to champion
  • Runtime service reads ab_routing on every inference (cached 30s to bound PG load)
  • Shadow mode extension (from #568): log challenger predictions with is_shadow=true
  • Canary ladder: 1 crypto (mid-volatility quintile) → 2 (diverse quintiles) → 50% → 100%
  • Time-split mode (optional) + Paper-parallel mode
  • Statistical decision framework (scripts/ab_decision.py) — 10 gates from §5.3: power-analysis MDE, bootstrap CI, BH-corrected paired Wilcoxon, cost-net metrics, ECE calibration, liquidity sanity, leakage audit, operational
  • Console UI to create/view experiments with regime-aware stratification config
  • Auto-rollback daemon: error rate > 2× champion baseline for 15 min → revert to champion
  • Weekly cron dag_pte__ab_decision for stage gates

Issue #589 — Unified observability dashboards + P&L reconciliation (Major, 4 days)

  • Grafana dashboard "Production Model Health" (status per model with traffic light)
  • Grafana dashboard "Drift Deep-Dive" (per-feature PSI over 90 days)
  • Grafana dashboard "Performance Decay" (Sortino live vs backtest)
  • Grafana dashboard "A/B Experiment" (one per active experiment)
  • Grafana dashboard "Cost per Model" (compute + storage + inference)
  • Grafana dashboard "P&L Reconciliation" (v2 per crypto-trader): daily predicted edge vs realized P&L, discrepancy > 10% alert
  • Grafana dashboard "Data Pipeline Health" (v2 per ops): upstream data observability (OHLCV freshness, schema drift, exchange API)
  • SHAP explainability aggregation in W&B Reports (async + sampling)

Issue #590 — Auto-generated Model Cards (Nice, 2 days)

  • Template (HTML + JSON) with all required fields
  • Generated on each promotion and stored in s3://cvntrade-artifacts/model_cards/
  • Linked from MLflow registry page
  • Format compatible with Google Model Cards spec

Issue #591 — Feast Feature Store integration (Major, 4 days — new v2)

  • Deploy Feast with Redis online store + Parquet offline store on S3
  • Define feature_view for every existing feature set (OHLCV, indicators, CUSUM diagnostics)
  • Training path reads from Feast offline store (point-in-time correctness enforced)
  • Runtime path reads from Feast online store (low-latency)
  • MLflow model version tags feast_feature_view_version for reproducibility
  • Shadow-run alongside existing feature loading for 2 weeks before cut-over
  • Evidently monitors feature freshness SLO (< 1h stale)

Issue #592 — Leakage Audit DAG (Critical, 2 days — new v2 per ml-engineer)

  • dag_pte__0_leakage_audit runs before every model registration
  • Purging check: no train/val candle overlap
  • Embargo check: N-hour gap between train end and val start (per triple barrier horizon)
  • Feature version pinning check: features_hash matches Feast version at training time
  • Point-in-time check: all features computed from data ≤ label timestamp
  • Blocking behavior: audit fail → model registration fails, no promotion path
  • Audit entry logged to leakage_audit_log table (append-only)

Issue #593 — Auto-retraining DAG (Major, 3 days — new v2 per architect)

  • dag_pte__9_auto_retrain triggered by drift alerts (Evidently → Grafana → webhook)
  • Triggers on: PSI > 0.4 for 6h OR sortino_decay > 50% for 72h OR calibration ECE > 0.10
  • Budget governor: max 1 retrain per crypto per week to bound compute
  • New model enters A/B framework as challenger (§5) — not auto-promoted
  • Notification to #trading-ops with diff vs champion (hyperparams, feature importance)
  • Retraining uses latest Feast offline snapshot + pinned code SHA

Issue #594 — Incident Response Runbooks (Critical, 3 days — new v2 per ops)

  • 8 runbooks in documentation/runbooks/ for all critical alerts
  • Tabletop exercises with team before production deployment
  • Each runbook has: immediate actions / escalation / diagnostics / recovery / post-incident
  • Linked from Grafana alert descriptions (click-through)
  • Reviewed quarterly

Issue #595 — SR 11-7 Model Risk Management formalization (Major, 3 days — new v2 per data-scientist)

  • Validation report template per new model (independent reviewer required)
  • Uncertainty quantification (95% CI via bootstrap) on all metrics
  • Model inventory register in PG (model_risk_register table) — tier classification, limits, review date
  • Position sizing limits codified per model variant, enforced at runtime
  • Quarterly MRM review meeting output archived to s3://cvntrade-artifacts/mrm/

Issue #596 — Calibration Monitoring (Major, 2 days — new v2 per ml-engineer)

  • Compute ECE and Brier Score weekly from model_inferences vs trade_outcomes
  • Prometheus metrics per model version: ece, brier_score, reliability_bin_counts
  • Grafana panel "Calibration" with reliability diagrams
  • Alert: ECE > 0.10 → warning, ECE > 0.15 → trigger auto-retrain gate
  • Post-training calibration (isotonic) mandatory on held-out fold (ADR-15 enforcement)

12. Committee decisions (v2 — resolved from previous open questions)

The 7 open questions from v1 were submitted to the expert committee (session cb99b702). Resolutions:

  1. W&B tier → Start on Free, automate usage tracking in Prometheus, upgrade to Team (~€50/month/user) at 70% quota. Self-host if artifacts > 100 GB. (Issue #589 exposes cost dashboard.)
  2. Audit trail retention → 7 years in S3 WORM with Object Lock (MiFID II Article 28 record-keeping requirement).
  3. A/B canary starting crypto → Choose by volatility quintile (mid), NOT market cap. Regime-aware stratification (§5.1 Mode 2).
  4. Approval authorities2-person approval required for production champion promotion (1 ML owner + 1 operator). 1-person OK for killswitch/rollback (safety-first override).
  5. Public W&B dashboard → Read-only, role-based, external stakeholders (board/investors) via W&B workspace-level ACL. No PII; only aggregated metrics.
  6. Regulatory framework → Design for MiFID II + SOC2 intersection (superset of both). MRM follows SR 11-7 as an internal best-practice framework (not regulatory binding here but a readiness signal). See §7.4.
  7. Cost tracking granularity → Per-model per-crypto per-day (business decisions) + per-trial for HPO sweeps (engineering cost optimization). §6.1 Cost layer.

13. Remaining open questions for future committee

  1. Feast deployment footprint: Redis online store scaling at 10+ cryptos × 1s cadence — do we need Redis Cluster?
  2. A/B crypto bucket overlap: if we have 30 cryptos, does 2-crypto canary produce statistically meaningful results, or do we need correlated-group analysis?
  3. Auto-retraining economics: retraining cost per trigger vs expected Sortino recovery — is the budget governor (1/week/crypto) the right ceiling?
  4. Runtime A/B cache TTL: 30s cache on ab_routing reads OK for candle cadence — but for tick-level runtime in the future, do we need pub/sub invalidation?
  5. SR 11-7 tier mapping: which of our model families qualify as "material" vs "non-material" — affects validation intensity.
  6. Multi-region failover: A/B routing is currently single-region (Paris) — multi-region adds routing table replication complexity.

14. Appendix — ADR impacts

This design reinforces and extends:

  • ADR-26: Grafana as single entry point — preserved, strengthened
  • ADR-29: Naive baseline — Evidently reports include comparison to naive baseline
  • ADR-30: Structured logs as stable interface — monitoring DAG uses log_event
  • ADR-31–38: Logging standards — all new code complies
  • ADR-40: Same kernel BT/paper/live — routing layer preserves this
  • ADR-42: Atomic promotion per crypto — A/B framework extends this with staging
  • ADR-56: A/B testable by design — every factor remains togglable

New ADRs proposed as follow-up:

  • ADR-63: Promotion requires immutable audit entry (no direct MLflow alias change without audit write)
  • ADR-64: A/B routing is the only way to deploy a model in production (no bypass)
  • ADR-65: Model observability metrics must reach Prometheus (no private monitoring)