CVNTrade ML Platform — World-Class Design¶

Status: Committee-reviewed (session cb99b702 — 2026-04-18) — PASSED, score 8.0/10, strong consensus Version: 2.0 (2026-04-18) — integrates committee feedback Authors: Architecture team Reviewers: 5 experts (architect 8.0, ops 9.0, data-scientist 7.5, crypto-trader 7.5, ml-engineer 8.0)

Review summary: - Unanimous strengths: tool separation, A/B 4-mode framework, immutable audit (MiFID II / SOC2 ready), 10-layer observability, defense in depth, Model Cards + lineage. - Dissent: (1) 20-day timeline disputed (3/5 judge optimistic) → revised to 30 days. (2) A/B statistical rigor (2/5 find 50 trades + Cohen's d insufficient for Sortino variance) → replaced by power analysis with MDE. - Critical additions (v2): Feast as explicit feature store, cost model integrated everywhere, calibration as first-class, SR 11-7 Model Risk Management framework, leakage audit DAG, multi-person approval, incident response runbooks, upstream data observability, execution realism monitoring.

1. Executive Summary¶

This document proposes the target-state ML platform architecture for CVNTrade. It builds on three tools already in the stack — MLflow, Weights & Biases, Evidently — each with exclusive, non-overlapping responsibilities, and introduces A/B testing as a foundational pattern and a unified model observability layer.

The design is incremental: MLflow stays as the foundation. W&B and Evidently get promoted from "installed but under-used" to first-class roles. The production runtime gains a routing layer that makes champion/challenger real (not decorative). No single-vendor lock-in, all components open-source or with free tiers.

Objective: pass an authority-level review (regulator, institutional investor, senior MLOps practitioner) on reproducibility, auditability, safety, and observability.

Effort: ~30 days of focused work (revised from 20 per committee dissent — 3/5 experts judged original timeline optimistic given dependencies and validation requirements), broken into 14 issues (listed in §11).

2. Design Principles¶

Separation of concerns. Each tool has one job; overlapping features are ignored.
Defense in depth. Three independent monitoring surfaces so no single failure is silent.
Reproducibility by construction. Any model in production can be rebuilt bit-for-bit from its artifacts + code SHA + dataset version.
Immutable audit trail. Promotions, demotions, killswitch events are append-only, signed, timestamped.
A/B testing is first-class. Not bolted on: the runtime has a routing layer; the registry supports multiple live variants; statistical significance is built into the decision flow.
Observability is a product feature, not an afterthought. Metrics, logs, traces, drift reports are emitted at every layer and reach a single entry point (Grafana).
Automated response with human override. Drift → auto-quarantine. Promotion → human approval. The right level of autonomy at each step.
Regulatory readiness. Model cards, data lineage, decision explainability, audit log — all generated automatically, not manually.

3. Tool Responsibilities¶

MLflow — System of Record¶

Owns exclusively: - Model Registry (versions, aliases, tags) - Artifact storage (pickle, parquet, sklearn pipelines) on S3 - Lineage metadata (which run produced which model version) - Promotion state (which version holds which alias at any time) - Source of truth for production routing decisions

Interface: CVNTrade_MLFlowManager (existing abstraction, enforced single entry point)

Storage: - Backend store: PostgreSQL (champollion DB) - Artifact store: S3 (cvntrade-artifacts/mlflow/)

Exposure: mlflow.cvntrade.eu with OAuth2 proxy (new — today it's internal only)

Does NOT own: - Training dashboards (→ W&B) - Drift detection (→ Evidently) - Monitoring metrics (→ Prometheus) - A/B routing rules (→ Runtime service)

Weights & Biases — Experiment Lab¶

Owns exclusively: - Training curves (loss, F1, Sortino per epoch / per trial) - HPO visualization (parallel coordinates, importance plots, sweep results) - Cross-experiment comparison (cross-crypto, cross-fold, cross-strategy) - Shareable Reports — narrative artifacts with interactive charts and markdown for stakeholders - Public read-only dashboards for external stakeholders (regulators, investors) with role-based access

Interface: wandb.init() at the start of every training run, wandb.log() during, wandb.summary at end.

Cross-reference with MLflow:

wandb.init(
    project="cvntrade-training",
    name=f"{symbol}_{strategy}_{ts}",
    config=hyperparams,
    tags=[symbol, strategy, model_type],
    notes=f"mlflow_run_id={mlflow_run.info.run_id}",
)
# At end, record the cross-link back to MLflow
wandb.summary["mlflow_model_version"] = registered_version
wandb.summary["mlflow_model_uri"] = f"models:/{model_name}/{version}"

Workspace structure: - cvntrade-training — all training runs - cvntrade-sweeps — HPO sweeps - cvntrade-reports — curated reports for promotion decisions

Does NOT own: - Model artifacts (→ MLflow) - Production routing (→ Runtime) - Drift detection (→ Evidently)

Evidently — Production Monitor¶

Owns exclusively: - Data drift reports (PSI, KL divergence, Wasserstein on every feature) - Prediction drift (distribution shift on prob_buy, confidence) - Target drift (when labels arrive H hours after prediction) - Performance decay (Sortino live vs Sortino backtest, precision/recall by regime) - HTML reports archived in MLflow artifacts + S3 cold storage - Data quality checks (missing values, outliers, schema violations)

Interface: Evidently Report + TestSuite objects run daily and on-demand.

Execution: - Daily via dag_pte__6_monitoring at 06:00 UTC - On-demand via API endpoint (for operator or committee inspection)

Outputs: - Prometheus metrics (pushed via Pushgateway): evidently_drift_score, evidently_prediction_drift, sortino_decay_pct - HTML reports attached to MLflow as run artifacts (audit trail) - Alerts via Grafana alerting rules (drift_score > 0.2 for 24h) - Auto-trigger of killswitch DAG when critical thresholds breached

Does NOT own: - Experiment tracking (→ W&B) - Model storage (→ MLflow) - Raw log aggregation (→ Loki)

Feast — Feature Store (new in v2 per committee)¶

Owns exclusively: - Feature definitions (feature_view schemas) as single source of truth - Feature versioning (ADR-23 enforcement — features_hash pinned per model version) - Online serving (low-latency inference path via Redis online store) - Offline serving (training + backfill path via Parquet offline store) - Point-in-time correctness (prevents training-serving skew) - Feature lineage: which upstream tables/computations produced which feature

Cross-system integration: - MLflow: each model version tags features_hash = Feast feature view version at training time. Serving loads the same version. - W&B: training runs log Feast feature view references. - Evidently: monitors feature freshness (SLO: features < 1h stale) and value drift vs Feast offline reference. - Runtime: reads features from Feast online store (Redis) with fallback to re-compute on cache miss.

Leakage prevention: Feast enforces temporal separation via created_timestamp column — point-in-time joins guarantee no future data leaks into training features.

Supporting Infrastructure¶

Airflow: orchestration — DAGs for training, promotion, monitoring, killswitch, rollback, decommission, leakage audit (new), auto-retraining (new)
Loki: log aggregation — 30-day retention, queryable via Grafana
Prometheus: metrics aggregation — hit rates, drift scores, training throughput, calibration metrics (new)
Grafana: single observability entry point (ADR-26) — metrics + logs + traces
Runtime service: model serving for paper/live trading
Console (Streamlit): operator UI — config, run history, PDF download, A/B experiment management (new)
PostgreSQL: operational data — finetune_runs, ftf_config, model_inferences, mlflow_promotion_audit, ab_routing, ab_experiments, pending_approvals, trade_outcomes
Vault or K8s External Secrets (new per committee): HMAC signing keys, OAuth2 credentials with automated rotation

4. Core Workflows¶

4.1 Training → Registration¶

Airflow DAG triggers training
        │
        ▼
┌───────────────────┐  mlflow.create_run() → run_id
│  Python training  │
│    pipeline       │  wandb.init(notes=mlflow_run_id)
│                   │
│  HPO loop         │  wandb.log({trial, f1, sortino, ...})  ─► W&B dashboard
│  (n trials)       │  optuna study also logged to W&B
│                   │
│  Fit + validate   │  mlflow.log_params(), mlflow.log_metrics()
│                   │
│  Save artifacts   │  mlflow.log_model(flavor=xgboost)
│                   │  mlflow.log_artifact(feature_list.json)
└───────────────────┘
        │
        ▼
MLflow Model Registry:
  register_model(CVNTrade_XGBoost_UNIUSDC_1h_ATR1.5_3.0_H5)
  set_alias("challenger", new_version)
        │
        ▼
W&B summary:
  mlflow_model_version = 42
  mlflow_model_uri = "models:/.../42"
  dvc_dataset_hash = "..."  (Phase 5 #575)
        │
        ▼
Slack notification:
  "New challenger: ...v42 — W&B report: <url>"

4.2 Validation → Promotion¶

Operator receives Slack notification with W&B Report URL
        │
        ▼
Reviews the W&B Report (training curves, HPO landscape, Sortino vs baseline)
        │
        ▼
Reviews the Evidently pre-deployment report (optional — run once on
reference OHLCV data to confirm no input schema drift)
        │
        ▼
Decision: Approve / Reject via Slack interactive button
        │
        ├─ Reject → ModelVersion stays at "challenger" alias
        │          (or moved to "archived" after review period)
        │
        └─ Approve
              │
              ▼
Webhook → Airflow → dag_pte__5_promotion
              │
              ▼
MLflow transition (atomic, ADR-42):
  - Get current champion version → set as "champion_previous"
  - Set new version alias to "champion"
  - Tag with promoted_at, promoted_by, approval_reason
              │
              ▼
Append to mlflow_promotion_audit (PG append-only):
  - id, timestamp, operator_slack_id, model_name, from_version,
    to_version, wandb_report_url, evidently_report_uri,
    artifact_sha256, signature_hmac
              │
              ▼
Slack notification to #trading-ops:
  "Promoted: ... v42 → champion (was v41).
   Report: <wandb>. Audit: <pg_id>"

4.3 Serving¶

Runtime service loads champion via MLflow URI:

model = mlflow.pyfunc.load_model(
    f"models:/{model_name}@champion"
)

Every inference logs to model_inferences table: - timestamp, symbol, model_name, model_version, features_hash, prob_buy, prob_sell, prob_hold, confidence, decision

This table is the ground truth for: - Evidently drift reports (compare with training distribution) - Target drift (join with trade_outcomes H hours later) - A/B analysis (see §5)

4.4 Monitoring (replaces stubbed DAG 6)¶

Daily at 06:00 UTC, dag_pte__6_monitoring runs:

Collect reference data from MLflow: the training features DataFrame used for the current champion (artifact).
Collect current data from model_inferences table for last 24h.
Run Evidently Report:
DataDriftPreset → PSI, KL on every feature
TargetDriftPreset → conditional on available labels (prediction outcomes 5h later)
RegressionPerformancePreset (or custom) → Sortino live vs Sortino backtest
Push metrics to Prometheus:
evidently_data_drift{crypto, model_version}
evidently_prediction_drift{crypto, model_version}
sortino_decay_pct{crypto, model_version}
Archive HTML report as MLflow artifact tagged monitoring_report/evidently/{date}.html
Evaluate alerting rules (Grafana):
drift_score > 0.25 for 24h → warning
drift_score > 0.40 for 6h → critical → auto-trigger killswitch DAG
sortino_decay > 50% for 72h → critical → auto-trigger killswitch DAG
Write audit entry: monitoring_run_log table (ran, verdict, actions taken)

4.5 A/B Testing (see §5)¶

4.6 Rollback / Killswitch¶

Killswitch (dag_pte__7_killswitch): tag champion with killswitch=true, lifecycle_phase=quarantine. Runtime checks this tag on every inference and returns HOLD. Append to audit table.
Rollback (dag_pte__8_rollback): promote champion_previous to champion, tag former champion as rolled_back. Append to audit table with reason.

Both operations require human approval via Slack buttons (same flow as promotion).

5. A/B Testing Framework (First-Class)¶

A/B testing in trading is tricky because you cannot split the same market between two models in real time without taking positions with both. This design handles it via four complementary modes, each appropriate for a different risk level.

5.1 Four A/B Modes¶

Mode 1: Shadow (zero risk, zero blast radius)¶

Challenger runs alongside champion on the same candles, produces predictions, but never executes trades. Both predictions are logged to model_inferences. Comparison is statistical (hit rate, Sortino simulation on retro-trades).

Implemented by: the CandlePipeline shadow mode from #568 (Phase 3). Just needs extension to log challenger predictions with a flag is_shadow=true.

Use case: validate a brand-new challenger before any real exposure.

Mode 2: Canary (bounded risk) — revised per committee (v2)¶

Champion handles N-1 cryptos, challenger handles 1 crypto in production. Regime-aware stratification (per data-scientist + crypto-trader feedback) — the canary crypto is chosen by volatility quintile, not market cap, to reduce regime bias. Rollout ladder with power-analysis-derived sample sizes:

Stage	Cryptos on challenger	Minimum sample size	Duration	Promotion gate
1	1 (mid-cap, mid-volatility)	N = power analysis for MDE 0.3 Sortino	7d or N trades (whichever longer)	No critical drift + calibration ECE < 0.05 + liquidity score ≥ 0.7
2	2 (different volatility quintiles)	N per crypto	7d or N each	Sortino within 80% of champion + regime-balanced performance
3	50% of cryptos	aggregate N	14 days	BH p<0.05 on paired comparison + MDE of 15% Sortino improvement
4	All	—	—	Promote to sole champion

Sample size: derived from power analysis (power=0.8, alpha=0.05, MDE = 0.3 Sortino units) based on observed Sortino variance in training backtests. 50 trades was a placeholder; actual minimum is typically 200-500 depending on strategy frequency.

Routing layer with fallback (per architect + ml-engineer): - Runtime service reads ab_routing table on every inference - On query failure → default to champion alias (ADR-25 compliance — no silent fallback, but a safe default rather than no serving) - Automatic rollback: error rate > 2× champion baseline for 15 min → revert crypto to champion

Implemented by: ab_routing table + routing layer with fallback + auto-rollback daemon.

Mode 3: Time-Split (for temporal analysis)¶

Champion runs Mon-Wed, challenger runs Thu-Sat. Less statistically rigorous but useful for exploring regime sensitivity.

Use case: hypotheses about regime-dependent performance.

Mode 4: Paper Parallel (pre-production)¶

Challenger runs in paper trading on the same cryptos as champion in production. Full realistic trades simulated. No real money, no correlation with live.

Use case: final validation before canary.

5.2 Routing Layer (new component)¶

New table ab_routing:

CREATE TABLE ab_routing (
    id SERIAL PRIMARY KEY,
    crypto TEXT NOT NULL,
    active_alias TEXT NOT NULL,  -- 'champion' | 'challenger' | 'champion_previous'
    experiment_id TEXT,          -- UUID of the A/B experiment
    assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    assigned_by TEXT NOT NULL,
    UNIQUE (crypto, experiment_id)
);

Runtime service reads this on every inference:

def get_model_for_crypto(crypto: str) -> str:
    """Returns the MLflow alias for this crypto."""
    row = pg.query(
        "SELECT active_alias FROM ab_routing WHERE crypto = %s AND active_at <= NOW() "
        "ORDER BY assigned_at DESC LIMIT 1",
        (crypto,),
    )
    return row["active_alias"] if row else "champion"

alias = get_model_for_crypto(candle.symbol)
model = mlflow.pyfunc.load_model(f"models:/{model_name}@{alias}")

5.3 Statistical Decision Framework (revised v2 per data-scientist + crypto-trader)¶

A challenger is promoted only if ALL gates pass:

Gate	Criterion	Rationale
Sample size	N trades ≥ power analysis output for strategy-specific Sortino variance	Replaces `50 trades` placeholder (insufficient for Sortino high variance). Computed per strategy based on observed σ.
Minimum Detectable Effect (MDE)	Promote only if observed effect ≥ MDE of 15% Sortino improvement OR 0.3 Sortino absolute	MDE prevents false positives from small samples.
Statistical significance	BH-corrected paired Wilcoxon p < 0.05 across folds + stratified by volatility regime	Paired test controls for market regime; Wilcoxon is non-parametric (Sortino isn't normal).
Confidence intervals	Sortino 95% CI (bootstrap, n=1000) must exclude champion's Sortino CI lower bound	Ensures improvement is robust, not a fluke.
Cost-aware	All metrics computed NET of fees + slippage + funding	Sortino optimization without costs is deceptive (crypto-trader).
No critical drift	Evidently PSI < 0.25 on top-20 features AND drift-performance correlation verified	Drift threshold must be tied to actual Sortino degradation observed OOS (ADR-15).
Calibration quality	Expected Calibration Error (ECE) < 0.05	Uncalibrated probabilities break position sizing and cost filters (ml-engineer).
Liquidity sanity	All cryptos in challenger bucket have 24h volume ≥ $10M AND bid-ask spread ≤ 5bps	Avoids illiquid regime where backtests diverge from reality (crypto-trader).
Operational	Error rate ≤ champion + 2σ, latency p99 ≤ champion + 2σ	No regression in observability or safety.
No leakage	Leakage audit DAG (new) validates purging + embargo + feature version pinning	Blocks the experiment if temporal separation is violated (ml-engineer).

Decision logic:

# scripts/ab_decision.py (weekly in dag_pte__ab_decision)
def decide(experiment_id):
    gates = [
        check_sample_size(mde=0.3_sortino, power=0.8, alpha=0.05),
        check_mde_achieved(min_mde_pct=15),
        check_statistical_significance(method="wilcoxon_paired_bh_corrected"),
        check_bootstrap_ci_excludes_baseline(n_bootstrap=1000),
        check_cost_net_metrics(),
        check_no_critical_drift(psi_threshold=0.25, linked_to_sortino=True),
        check_calibration_ece(threshold=0.05),
        check_liquidity_sanity(min_volume=10_000_000, max_spread_bps=5),
        check_operational_regression(sigma_threshold=2),
        check_leakage_audit_passed(),
    ]
    return all(gates), [g for g in gates if not g.passed]

5.4 A/B Experiment Lifecycle¶

1. Create experiment (UUID + name + hypothesis) via Console UI
2. Define rollout plan (list of stages, criteria, durations)
3. Deploy challenger as 'challenger' alias in MLflow
4. ab_routing assigns 1 crypto to challenger
5. Wait 7 days (monitored by Evidently + Grafana)
6. Weekly analysis job evaluates promotion gate
7. If passed: next stage (more cryptos); if failed: rollback
8. Final stage: challenger becomes sole champion
9. W&B Report archived summarizing the experiment
10. Audit entry written to mlflow_promotion_audit

6. Model Observability (Unified Layer)¶

6.1 What is observed¶

Layer	Metric	Source	Frequency
Input quality	Schema violations, missing values, outliers	Evidently tests	Per inference (sampled)
Input drift	PSI, KL on each feature	Evidently daily	Daily
Prediction drift	Distribution shift on `prob_buy`, `confidence`	Evidently daily	Daily
Target drift	Label distribution change (5h after prediction)	Evidently daily	Daily
Concept drift	Prediction vs outcome alignment over time	Custom query + Evidently	Weekly
Performance decay	Sortino live vs Sortino backtest	PG query on `trade_outcomes`	Daily
Explainability drift	SHAP value distribution per feature	Custom + W&B	Weekly
Operational	Latency p50/p99, error rate, throughput	Runtime service → Prometheus	Real-time
Business	Net PnL, win rate, Sharpe, drawdown	PG query on `trade_outcomes`	Daily
Cost	Compute hours, artifact storage, inference volume	Cloud cost API + PG	Daily

6.2 Where it is observed¶

Single entry point: Grafana (ADR-26).

Dashboards:

Model Health Overview — one panel per active model with traffic-light status. Green = no drift + Sortino in spec + no errors. Yellow = warning. Red = auto-quarantined.
Drift Deep-Dive — per-feature drift evolution over last 90 days. Evidently reports embedded via iframe for detailed analysis.
Performance Decay — Sortino live (rolling 7d) vs Sortino backtest per model version. Gap > 30% = critical.
A/B Experiment Dashboard — one per active experiment. Statistical power, current metrics, gate status.
Training Lab (embedded W&B) — curves, HPO, comparisons.
Cost Dashboard — per model per crypto per day.

6.3 Alerting¶

Grafana alert rules trigger on:

drift_score > 0.25 for 24h → warning to #trading-ops
drift_score > 0.40 for 6h → critical + auto-killswitch
sortino_decay > 50% for 72h → critical + auto-killswitch
inference_error_rate > 1% for 10min → critical + auto-killswitch
inference_latency_p99 > 500ms for 10min → warning

Critical alerts reach Slack + PagerDuty (when deployed).

6.4 Explainability (revised v2)¶

SHAP values computed asynchronously on a sampling basis (default 10% of inferences) to avoid latency impact on the hot path. Configurable per crypto.

On-demand endpoint: operators can query /explain/{inference_id} for full SHAP computation of a specific decision (audit use case).
Aggregated reports: weekly W&B Reports show SHAP value distributions per feature, flag shifts in feature importance over time (explainability drift).
Cached top-10: top-10 contributing features per inference are stored synchronously in model_inferences.top_features (JSONB) for fast display in Grafana drill-down.

Regulator-ready: any decision can be traced back to features, values, and marginal contribution. Meets MiFID II Article 28 explainability requirements.

6.5 Cost & Execution Realism Monitoring (new v2 per crypto-trader)¶

Bridges the gap between ML metrics and economic reality:

Metric	Computation	Alert
Effective spread per trade	`(exit_price - entry_price) / entry_price - gross_return_backtest`	> 2× backtest assumption → warning
Slippage incurred	Actual fill price vs theoretical (close price at signal)	> 5bps deviation from assumption → investigate
Fill rate	% of orders filled vs placed	< 95% → flag partial-fill risk
Funding rate impact	Σ funding paid on open positions (derivatives only)	> 0.5% of notional / day → review leverage
Liquidity score	min(volume_24h / $10M, 1) × (1 - spread_bps / 50)	< 0.7 → flag liquidity-constrained
P&L reconciliation	Daily: model predictions → executed trades → realized P&L	Discrepancy > 10% → investigation

The P&L sanity check (crypto-trader's ONE improvement request) is the reconciliation layer: each day, for each model version, compare the aggregate predicted edge (Σ prob_buy - threshold) to the realized P&L, adjusted for costs. Systematic deviation indicates model drift OR execution degradation — a single Grafana panel "P&L Reconciliation" flags the gap.

6.6 Upstream Data Observability (new v2 per ops)¶

Monitors the data pipeline before features reach the model:

OHLCV ingestion freshness: last Binance candle timestamp vs now (SLO: < 2 min lag)
Schema drift: column types and names vs expected schema
Missing data: gaps in OHLCV, unexpected NULLs
Data quality score: rollup of validation tests (Great Expectations-compatible format)
Exchange API health: error rate, rate limit hits

Exposes to Grafana dashboard "Data Pipeline Health".

6.7 Calibration Monitoring (new v2 per ml-engineer)¶

Model probabilities must be calibrated to be usable for cost/Kelly filters (ADR-46).

Reliability diagrams computed weekly from model_inferences vs outcomes
Expected Calibration Error (ECE) exposed to Prometheus per model version
Brier Score tracked over time — regression in calibration triggers alert
Post-training calibration: isotonic regression or Platt scaling applied on held-out OOS fold (not train set to avoid leakage). Calibrated model is the one registered in MLflow.
Threshold calibration OOS (ADR-15): decision thresholds (threshold_buy) derived on walk-forward validation set, not HPO val set. Pinned per model version in model_inferences for audit.

7. Audit & Compliance¶

7.1 Immutable Audit Log¶

Table mlflow_promotion_audit is append-only (PG trigger blocks UPDATE/DELETE):

CREATE TABLE mlflow_promotion_audit (
    id BIGSERIAL PRIMARY KEY,
    event_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    event_type TEXT NOT NULL CHECK (event_type IN (
        'promotion', 'rollback', 'killswitch', 'decommission',
        'ab_start', 'ab_promote', 'ab_rollback'
    )),
    model_name TEXT NOT NULL,
    from_version INT,
    to_version INT,
    from_alias TEXT,
    to_alias TEXT,
    actor_type TEXT NOT NULL CHECK (actor_type IN ('human', 'automated')),
    actor_id TEXT NOT NULL,  -- Slack user ID or system identity
    reason TEXT,
    wandb_report_url TEXT,
    evidently_report_uri TEXT,
    artifact_sha256 TEXT NOT NULL,
    signature_hmac TEXT NOT NULL,  -- HMAC of the row content with secret key
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Prevent updates and deletes
CREATE RULE no_update AS ON UPDATE TO mlflow_promotion_audit DO INSTEAD NOTHING;
CREATE RULE no_delete AS ON DELETE TO mlflow_promotion_audit DO INSTEAD NOTHING;

-- Daily export to S3 cold storage
-- (separate cron job; S3 bucket has Object Lock enabled for WORM compliance)

7.2 Model Cards (auto-generated)¶

For every promoted model, a Model Card is generated and stored alongside the MLflow artifact:

Model name, version, creation date
Training data summary (period, n_samples, cryptos included)
Features used (full list, top-10 by importance)
Hyperparameters
Evaluation metrics (train, val, test, walk-forward, holdout)
Known limitations and failure modes
Fairness considerations (cross-crypto performance disparity)
Link to W&B training report
Link to Evidently pre-deployment report
Link to MLflow artifact URI

Format: HTML + JSON. Stored in s3://cvntrade-artifacts/model_cards/.

7.3 Lineage¶

Every production prediction can be traced back through:

Inference log (model_inferences) → features used, model version
MLflow model version → MLflow run_id, artifact, training parameters
MLflow run tags → W&B run URL, DVC dataset hash (Phase 5 #575), Git commit SHA
DVC → exact training data
Git commit → exact code version

This satisfies MiFID II Article 28 (algorithmic trading audit trail) and SOC2 change control requirements.

7.4 Model Risk Management (SR 11-7 framework, new v2 per data-scientist)¶

Adopts the SR 11-7 Supervisory Guidance on Model Risk Management (Federal Reserve / OCC) as the formalized framework for validation, monitoring, and decommissioning. Key practices:

Independent validation: each production model undergoes validation by a reviewer NOT the one who built it (peer review, expert committee)
Challenger testing: continuous A/B framework (§5) satisfies the "effective challenge" requirement
Uncertainty quantification: all reported metrics carry 95% CI (bootstrap-derived)
Documentation: Model Card + validation report stored in s3://cvntrade-artifacts/model_cards/ and linked from MLflow
Ongoing monitoring: monitoring DAG (#584) with defined thresholds for drift-driven revalidation
Limits and compensating controls: position sizing limits via Kelly filter; max drawdown limits per model variant enforced at routing layer
Remediation triggers: drift > threshold → retraining; persistent underperformance → decommission

7.5 Leakage Audit DAG (new v2 per ml-engineer — ONE critical improvement)¶

A pre-promotion DAG validates temporal integrity of training data:

Purging audit: no overlap between train and validation candles (ADR-14/15)
Embargo audit: minimum N-hour gap between train end and validation start (per triple barrier horizon)
Feature version pinning: features_hash in model matches Feast feature view version at training time (ADR-23)
Timestamp monotonicity: all features computed from data ≤ inference timestamp (point-in-time correctness)
Label delay audit: labels used in training were not observable at the time of the corresponding features

Blocking behavior: if audit fails → model registration fails with explicit error. No promotion possible without audit pass.

8. Incident Response Runbooks (new v2 per ops)¶

For each critical alert, a runbook in documentation/runbooks/ with:

Immediate human actions (0-5 min)
Escalation path (on-call → team lead → committee)
Diagnostic steps (Grafana panels to check, Loki queries, DB queries)
Recovery procedures (rollback, retrain, manual override)
Post-incident (audit entry, root cause, preventive action)

Required runbooks (before production):

runbook_drift_critical.md — drift_score > 0.40 for 6h
runbook_sortino_decay.md — sortino decay > 50% for 72h
runbook_inference_errors.md — error rate > 1% for 10 min
runbook_latency_spike.md — p99 latency > 500ms for 10 min
runbook_killswitch_triggered.md — auto or manual killswitch
runbook_ab_rollback.md — A/B auto-rollback triggered
runbook_calibration_regression.md — ECE > 0.10
runbook_liquidity_degradation.md — liquidity_score < 0.5 on production crypto

Runbooks are validated via tabletop exercises before production deployment.

9. Migration Plan¶

9.1 Current state → Target state¶

Component	Current	Target	Issue
MLflow UI	Internal only	Public via OAuth2	#583
Monitoring DAG	Stubbed	Evidently + auto-response	#584
W&B integration	Minimal	First-class with cross-links	#585
Audit trail	Mutable tags	Append-only table + WORM	#586
Promotion	Direct Airflow trigger	Slack approval workflow (2-person)	#587
A/B testing	None	Four-mode framework + routing	#588
Model observability	Scattered	Unified Grafana dashboards	#589
Model cards	None	Auto-generated	#590
Feature store	Ad-hoc	Feast (versioning + online/offline)	#591
Leakage prevention	Partial (ADR-14/15)	Blocking leakage audit DAG	#592
Retraining	Manual	Auto-retraining DAG (drift-driven)	#593
Runbooks	None	8 runbooks + tabletop exercises	#594
Model risk framework	Informal	SR 11-7 formalized	#595
Calibration monitoring	None	ECE + Brier + reliability diagrams	#596

9.2 Sequencing (revised v2 — 30 days)¶

Phase 1 — Observability foundation (9 days) - Issue #584 (monitoring DAG with Evidently) — unblocks everything else - Issue #585 (W&B first-class integration) — enables richer reports - Issue #586 (audit trail with HMAC + WORM) — enables trust in subsequent phases - Issue #596 (calibration monitoring) — required before A/B gates can depend on ECE

Phase 2 — Safety & exposure (6 days) - Issue #583 (MLflow UI exposure with OAuth2) - Issue #587 (Slack approval workflow, multi-person for prod) - Issue #592 (leakage audit DAG — blocking gate for promotion) - Issue #594 (runbooks — before any critical auto-response ships)

Phase 3 — Feature store & retraining (7 days) - Issue #591 (Feast feature store with online/offline) - Issue #593 (auto-retraining DAG driven by drift thresholds)

Phase 4 — A/B and lifecycle (8 days) - Issue #588 (A/B framework — 4 modes + routing layer + fallback) - Issue #589 (unified observability dashboards + cost + P&L reconciliation) - Issue #590 (model cards) - Issue #595 (SR 11-7 model risk formalization)

Total: ~30 days, phased so each delivers value independently. Committee dissent on 20d resolved by explicit buffer for validation + runbook exercises.

9.3 Risks¶

W&B cost: free tier limit (100 GB artifacts, 5 users). Escalation path (v2): if monthly usage > 70% → upgrade to Team plan (~€50/month/user); if > 100 GB artifacts → migrate to self-hosted W&B on K8s.
Evidently performance: large DataFrames slow. Sample + parallelize reports if needed.
PG WORM trigger compatibility: must test on managed Scaleway PG that triggers are allowed.
A/B routing complexity: adds a critical path; fallback to champion alias on query failure (documented in §5.2); daemon auto-rollback on error-rate regression.
HMAC key management (v2 per ops): keys managed by Vault or K8s External Secrets; automated rotation every 90 days; old keys kept for signature verification of historical records for 7 years.
Feast adoption friction: training code refactor non-trivial. Mitigation: gradual — run Feast alongside existing feature loading for 2 weeks (shadow mode), then cut over.

9.4 Rollback¶

Each phase is additive; rollback is simple: - Disable monitoring DAG → back to current state - Skip W&B logging → no training impact - Revert audit table trigger → mutable as before - A/B routing: single row in ab_routing returns all cryptos to champion - Feast: keep legacy feature loading code paths for 2 sprints as fallback

10. What Authorities See¶

When this design is presented to an authority (regulator, institutional investor, senior MLOps practitioner), they see:

Defense in depth — three independent monitoring surfaces (Evidently, W&B, Prometheus/Grafana) plus Feast feature-level observability plus upstream data observability, so no failure mode is silent.
Deterministic reproducibility — any production prediction can be reconstructed exactly from its lineage chain (features_hash pinned to Feast view, model_version pinned to MLflow artifact, dataset pinned via DVC, code pinned via Git SHA).
Immutable audit with cryptographic integrity — all lifecycle events in an append-only store with HMAC signatures (keys in Vault with automated 90d rotation), 7-year WORM cold storage for MiFID II compliance.
A/B testing as a first-class pattern — formal rollout ladder with regime-aware stratification, 10-gate statistical decision framework (power analysis MDE + bootstrap CI + paired Wilcoxon BH + cost-net metrics + calibration ECE + liquidity sanity + leakage audit), routing layer with auto-rollback.
Leakage prevention as a blocking gate — dedicated DAG validates purging, embargo, feature-version pinning, and point-in-time correctness before any promotion.
Automated response with multi-person human authority — drift auto-quarantines (killswitch); promotions require 2-person approval (ML owner + operator) via signed Slack button; auto-retraining is gated (challenger only, never direct promotion).
Model Risk Management per SR 11-7 — independent validation, challenger testing, uncertainty quantification, position-sizing limits, documented remediation triggers.
Regulatory compliance-ready — Model Cards, lineage, explainability (async SHAP, on-demand per-inference), audit trail all generated automatically. Meets MiFID II Article 28 + SOC2.
Economic realism — ML metrics always computed NET of fees/slippage/funding; P&L reconciliation dashboard catches divergence between predicted edge and realized return.
Incident readiness — 8 runbooks per critical alert, tabletop-exercised before production, linked from Grafana alert descriptions.
Observability is end-to-end — Grafana is the single entry point (ADR-26); no tool-hopping required to diagnose an issue.
No vendor lock-in — MLflow, W&B (free tier → self-host escalation path), Evidently, Feast, Grafana, Loki, Prometheus are all open-source or with clear migration paths.

11. Issues List¶

Issue #583 — Expose MLflow UI with OAuth2 (Major, 1 day)¶

Ingress mlflow.cvntrade.eu with TLS
OAuth2 proxy (same IdP as rest of stack)
Roles: viewer (read), operator (promote/rollback), admin (registry edit)
Grafana link-out to MLflow run pages

Issue #584 — Monitoring DAG with Evidently (Critical, 5 days)¶

Implement dag_pte__6_monitoring (currently stubbed)
model_inferences table for production predictions
Evidently Report daily: data drift, prediction drift, target drift, performance decay
Push metrics to Prometheus via Pushgateway
Grafana alerting rules (warning + critical)
Auto-trigger dag_pte__7_killswitch on critical breach
Archive HTML reports in MLflow artifacts + S3
Grafana dashboard "Production Model Health"

Issue #585 — W&B first-class integration (Major, 3 days)¶

Every training run calls wandb.init(notes=mlflow_run_id)
HPO trials logged to W&B (replaces raw Optuna visualization)
W&B summary includes mlflow_model_version cross-link
Auto-generated W&B Report per challenger (before promotion)
Slack promotion message includes W&B Report URL
Public dashboard for external stakeholders (read-only)
W&B workspace structure: cvntrade-training, cvntrade-sweeps, cvntrade-reports

Issue #586 — Immutable audit trail (Major, 2 days)¶

Table mlflow_promotion_audit with append-only trigger (PG rules)
HMAC signature per row using secret key from K8s
Writes on promote / rollback / killswitch / decommission / A/B events
Daily export to S3 cold storage (WORM with Object Lock)
Grafana panel "Audit Trail Viewer" (read-only)
Alembic migration

Issue #587 — Slack approval workflow with multi-person gate (Major, 2 days)¶

Promotion request created in PG table pending_approvals
Slack interactive message (Block Kit) with Approve / Reject buttons
Webhook endpoint → Airflow API → dag_pte__5_promotion
48h timeout → auto-reject
Multi-person approval for production promotions (2 distinct approvers required, per committee ops feedback): 1 ML owner + 1 operator
All approvals signed (HMAC) and logged to audit table
Emergency-override path (1 signer) for killswitch/rollback, logged with escalation trail

Issue #588 — A/B testing framework with routing layer + fallback (Major, 6 days)¶

ab_routing table with per-crypto routing rules + query-failure fallback to champion
Runtime service reads ab_routing on every inference (cached 30s to bound PG load)
Shadow mode extension (from #568): log challenger predictions with is_shadow=true
Canary ladder: 1 crypto (mid-volatility quintile) → 2 (diverse quintiles) → 50% → 100%
Time-split mode (optional) + Paper-parallel mode
Statistical decision framework (scripts/ab_decision.py) — 10 gates from §5.3: power-analysis MDE, bootstrap CI, BH-corrected paired Wilcoxon, cost-net metrics, ECE calibration, liquidity sanity, leakage audit, operational
Console UI to create/view experiments with regime-aware stratification config
Auto-rollback daemon: error rate > 2× champion baseline for 15 min → revert to champion
Weekly cron dag_pte__ab_decision for stage gates

Issue #589 — Unified observability dashboards + P&L reconciliation (Major, 4 days)¶

Grafana dashboard "Production Model Health" (status per model with traffic light)
Grafana dashboard "Drift Deep-Dive" (per-feature PSI over 90 days)
Grafana dashboard "Performance Decay" (Sortino live vs backtest)
Grafana dashboard "A/B Experiment" (one per active experiment)
Grafana dashboard "Cost per Model" (compute + storage + inference)
Grafana dashboard "P&L Reconciliation" (v2 per crypto-trader): daily predicted edge vs realized P&L, discrepancy > 10% alert
Grafana dashboard "Data Pipeline Health" (v2 per ops): upstream data observability (OHLCV freshness, schema drift, exchange API)
SHAP explainability aggregation in W&B Reports (async + sampling)

Issue #590 — Auto-generated Model Cards (Nice, 2 days)¶

Template (HTML + JSON) with all required fields
Generated on each promotion and stored in s3://cvntrade-artifacts/model_cards/
Linked from MLflow registry page
Format compatible with Google Model Cards spec

Issue #591 — Feast Feature Store integration (Major, 4 days — new v2)¶

Deploy Feast with Redis online store + Parquet offline store on S3
Define feature_view for every existing feature set (OHLCV, indicators, CUSUM diagnostics)
Training path reads from Feast offline store (point-in-time correctness enforced)
Runtime path reads from Feast online store (low-latency)
MLflow model version tags feast_feature_view_version for reproducibility
Shadow-run alongside existing feature loading for 2 weeks before cut-over
Evidently monitors feature freshness SLO (< 1h stale)

Issue #592 — Leakage Audit DAG (Critical, 2 days — new v2 per ml-engineer)¶

dag_pte__0_leakage_audit runs before every model registration
Purging check: no train/val candle overlap
Embargo check: N-hour gap between train end and val start (per triple barrier horizon)
Feature version pinning check: features_hash matches Feast version at training time
Point-in-time check: all features computed from data ≤ label timestamp
Blocking behavior: audit fail → model registration fails, no promotion path
Audit entry logged to leakage_audit_log table (append-only)

Issue #593 — Auto-retraining DAG (Major, 3 days — new v2 per architect)¶

dag_pte__9_auto_retrain triggered by drift alerts (Evidently → Grafana → webhook)
Triggers on: PSI > 0.4 for 6h OR sortino_decay > 50% for 72h OR calibration ECE > 0.10
Budget governor: max 1 retrain per crypto per week to bound compute
New model enters A/B framework as challenger (§5) — not auto-promoted
Notification to #trading-ops with diff vs champion (hyperparams, feature importance)
Retraining uses latest Feast offline snapshot + pinned code SHA

Issue #594 — Incident Response Runbooks (Critical, 3 days — new v2 per ops)¶

8 runbooks in documentation/runbooks/ for all critical alerts
Tabletop exercises with team before production deployment
Each runbook has: immediate actions / escalation / diagnostics / recovery / post-incident
Linked from Grafana alert descriptions (click-through)
Reviewed quarterly

Issue #595 — SR 11-7 Model Risk Management formalization (Major, 3 days — new v2 per data-scientist)¶

Validation report template per new model (independent reviewer required)
Uncertainty quantification (95% CI via bootstrap) on all metrics
Model inventory register in PG (model_risk_register table) — tier classification, limits, review date
Position sizing limits codified per model variant, enforced at runtime
Quarterly MRM review meeting output archived to s3://cvntrade-artifacts/mrm/

Issue #596 — Calibration Monitoring (Major, 2 days — new v2 per ml-engineer)¶

Compute ECE and Brier Score weekly from model_inferences vs trade_outcomes
Prometheus metrics per model version: ece, brier_score, reliability_bin_counts
Grafana panel "Calibration" with reliability diagrams
Alert: ECE > 0.10 → warning, ECE > 0.15 → trigger auto-retrain gate
Post-training calibration (isotonic) mandatory on held-out fold (ADR-15 enforcement)

12. Committee decisions (v2 — resolved from previous open questions)¶

The 7 open questions from v1 were submitted to the expert committee (session cb99b702). Resolutions:

W&B tier → Start on Free, automate usage tracking in Prometheus, upgrade to Team (~€50/month/user) at 70% quota. Self-host if artifacts > 100 GB. (Issue #589 exposes cost dashboard.)
Audit trail retention → 7 years in S3 WORM with Object Lock (MiFID II Article 28 record-keeping requirement).
A/B canary starting crypto → Choose by volatility quintile (mid), NOT market cap. Regime-aware stratification (§5.1 Mode 2).
Approval authorities → 2-person approval required for production champion promotion (1 ML owner + 1 operator). 1-person OK for killswitch/rollback (safety-first override).
Public W&B dashboard → Read-only, role-based, external stakeholders (board/investors) via W&B workspace-level ACL. No PII; only aggregated metrics.
Regulatory framework → Design for MiFID II + SOC2 intersection (superset of both). MRM follows SR 11-7 as an internal best-practice framework (not regulatory binding here but a readiness signal). See §7.4.
Cost tracking granularity → Per-model per-crypto per-day (business decisions) + per-trial for HPO sweeps (engineering cost optimization). §6.1 Cost layer.

13. Remaining open questions for future committee¶

Feast deployment footprint: Redis online store scaling at 10+ cryptos × 1s cadence — do we need Redis Cluster?
A/B crypto bucket overlap: if we have 30 cryptos, does 2-crypto canary produce statistically meaningful results, or do we need correlated-group analysis?
Auto-retraining economics: retraining cost per trigger vs expected Sortino recovery — is the budget governor (1/week/crypto) the right ceiling?
Runtime A/B cache TTL: 30s cache on ab_routing reads OK for candle cadence — but for tick-level runtime in the future, do we need pub/sub invalidation?
SR 11-7 tier mapping: which of our model families qualify as "material" vs "non-material" — affects validation intensity.
Multi-region failover: A/B routing is currently single-region (Paris) — multi-region adds routing table replication complexity.

14. Appendix — ADR impacts¶

This design reinforces and extends:

ADR-26: Grafana as single entry point — preserved, strengthened
ADR-29: Naive baseline — Evidently reports include comparison to naive baseline
ADR-30: Structured logs as stable interface — monitoring DAG uses log_event
ADR-31–38: Logging standards — all new code complies
ADR-40: Same kernel BT/paper/live — routing layer preserves this
ADR-42: Atomic promotion per crypto — A/B framework extends this with staging
ADR-56: A/B testable by design — every factor remains togglable

New ADRs proposed as follow-up:

ADR-63: Promotion requires immutable audit entry (no direct MLflow alias change without audit write)
ADR-64: A/B routing is the only way to deploy a model in production (no bypass)
ADR-65: Model observability metrics must reach Prometheus (no private monitoring)