Skip to content

Architecture — Training Pipeline

Version: 3.3 Date: 2026-05-13 (V3.3 — restructured for committee review : flat structure, harness-first, S18 isolated, legacy demoted) Status: Canonical harness path in production ; model promotion blocked pending S18-S19 regression diagnosis

Status decomposition (committee gate)

Domain Status Anchor
Harness architecture production canonical (since PR #891 + #898, 2026-05-10) §4 Harness Path
Hyperparameters externalization production via ADR-90 (since PR #904, 2026-05-12) §5 Hyperparameter Resolution (ADR-90)
Harness-trained model promotion frozen — promotion blocked until S18-S19 close §2 Current Status & Promotion Freeze
S18 regression diagnostic in flight (plan_review PASSED OK 4298520f, OP wp#154) §6 Known Regression — S18

1. Executive Summary

The training pipeline is now canonically routed through the Hamilton-based training harness for XGBoost, LightGBM and CatBoost. Since PR #904, all numeric hyperparameter defaults and HPO ranges are resolved at runtime from PG ftf_config.base_env via ADR-90 ; source-code hyperparameter values are forbidden and enforced by CI gate G5.

However, the post-harness training path is currently under diagnostic. The S17 canary FTF run exposed shallow-training behavior : LightGBM stops after 1-2 iterations across all trials, XGBoost shows median best_iteration around 17, and model quality is materially below the pre-#891 baseline. Current post-S17 f1_buy results on defi_top5 5m are : CatBoost 0.39, LightGBM 0.37 and XGBoost 0.09, versus a pre-#891 baseline around 0.42.

As a result, the harness architecture is production-canonical, but harness-trained model promotion is frozen until CVN-N001-EE-S18 and S19 are closed. The current leaderboard is informational only and must not be used for live trading promotion.

Pipeline flow (canonical, post-#898)

OHLCV → Enrichment → Labeling → Feature Engineering
    → Temporal Split → Harness DAG
    → { XGB | LGB | CB } via `harness.train_one(model_type, datasets, hpo_params)`
    → Evaluation → MLflow

CUSUM is applied at inference (per ADR-46 + committee reco) ; it is NOT a training filter. The legacy "Stage 7 HPO + XGBoost Training" name is retained in §8 Legacy Pipeline Analysis for historical context but the current canonical path is described in §4-§5.

Consolidated performance snapshot

Single source of truth — referenced by every other section :

Baseline State Model f1_buy Notes
pre-#891 legacy autonomous trainers XGB+LGB+CB (3 trainers, ~2400 LOC) ~0.42 reference target — what we want to recover
post-#891 harness migration (all 3) ~0.22 the 2026-05-11 incident that motivated S17
post-S17 harness + Console-seeded HPs CatBoost 0.39 within range, best_iter 12-155 mixed
post-S17 harness + Console-seeded HPs LightGBM 0.37 shallow + θ-sweep at θ=0.2 compensates (best_iter=1 in 53/53 trials)
post-S17 harness + Console-seeded HPs XGBoost 0.09 shallow + θ=0.5 fixed → near-empty buy classifier ; best_iter p50=17

Promotion-freeze decision is anchored on this table : no harness-trained model goes live until f1_buy on defi_top5 5m ATR0.5_1.5_H4 returns to ≥ 0.40 on ≥ 4/5 cryptos for all 3 model types (mlops_readiness.md §4 canary criterion).


2. Current Status & Promotion Freeze

Component Production state Promotion gate
Harness binary (src/training/harness/) production canonical (PR #891 + #898) n/a (architectural change, not a model)
Hyperparameter resolver (commun.finetune.hyperparams.resolve) production canonical (PR #904 / ADR-90) n/a (config plumbing)
Console seeding (455 keys in ftf_config.base_env) seeded 2026-05-12T18:34 UTC, 100% coverage n/a
Harness-trained XGB / LGB / CB models trained, results in finetune_results FROZEN — promotion blocked pending S18-S19
FTF leaderboard running daily informational only — not a promotion signal
Live trading on harness-trained models forbidden gate = S18 verdict LOCK + S19 fix verified on canary

Until both CVN-N001-EE-S18 (diagnostic) and the follow-up CVN-N001-EE-S19 (remediation) close with a verified canary, the operator MUST NOT promote any harness-trained model to live trading. Backtests and shadow runs are permitted ; auto-promotion is OFF (ADR-2).


3. Current Canonical Architecture

The pipeline below is the version in production as of 2026-05-13. See §8 Legacy Pipeline Analysis for the pre-#891 path (kept for historical bottleneck analysis ; do NOT reference for current implementation).

Binance OHLCV
Enrichment (technical indicators)
Labeling (triple-barrier ATR-dynamic)
Feature Engineering (Hamilton DAG, fitted at train, frozen at inference)
Temporal split (walk-forward, 5 folds × {train, val, test})
Harness DAG (Hamilton, plugin registry)
   ┌─ XGB ──┐
   ├─ LGB ──┤ → resolver `resolve(MODEL, TF, PARAM)` reads from PG ftf_config.base_env
   └─ CB ──┘   (455 keys, Console-editable, ADR-90)
Evaluation (per-fold f1_buy / sortino / expectancy / drawdown, with bootstrap CI)
MLflow registry (versioned model + feature contract pinned, ADR-23 / ADR-24)
{Promotion gate} ← currently FROZEN pending S18-S19 — see §2

Each upstream stage (OHLCV / Enrichment / Labeling / FE / Split) is unchanged from the legacy ; only the training step (Stage 7 in legacy numbering) was replaced. The harness is therefore a drop-in replacement at the training boundary.



4. Harness Path

The legacy per-model autonomous trainer approach (Stages 7-9 of the legacy pipeline, see §8 Legacy Pipeline Analysis) has been replaced by a Hamilton-based training harness with plugin registries (ADR-89). This is the canonical path for every single-model and ensemble training in production since 2026-05-10 (PR #891 + #898). Promotion of harness-trained models is currently FROZEN — see §2 Current Status & Promotion Freeze.

High-level data flow

flowchart LR
    subgraph FTF[FTF runner / orchestrator]
        OR["CVNTrade_AutonomousOrchestrator
._create_autonomous_trainer(model_type)"] end subgraph Harness[training.harness] REG{{"MODEL_REGISTRY
ENSEMBLE_REGISTRY"}} AMT[CVNTrade_AutonomousModelTrainer
generic, parameterised by model_type] AET[CVNTrade_AutonomousEnsembleTrainer
generic, parameterised by ensemble_name] T1["train_one(model_type, datasets, hpo)
Hamilton Driver"] TE["train_ensemble(name, datasets, hpo)
Hamilton Driver"] end subgraph DAGs[harness.dags] XGD["xgboost_dag.py"] LGD["lightgbm_dag.py"] CBD["catboost_dag.py"] ENS1["blend_avg_dag.py"] ENS2["stack_logreg_dag.py"] ENS3["stack_xgb_meta_dag.py"] end subgraph Adapters[harness.adapters] XA[XGBAdapter] LA[LGBAdapter] CA[CBAdapter] EA[EnsembleAdapter] end OR --> REG REG -- single model --> AMT REG -- ensemble --> AET AMT --> T1 AET --> TE T1 --> XGD & LGD & CBD TE --> ENS1 & ENS2 & ENS3 XGD --> XA LGD --> LA CBD --> CA ENS1 & ENS2 & ENS3 --> EA EA -. composes .-> XA & LA & CA

Adding a new model

  1. Drop a new file src/training/harness/dags/models/<name>_dag.py declaring the Hamilton DAG nodes (<name>_native, <name>_adapter, <name>_metrics_val, <name>_metrics_test, <name>_artifact) + the per-model HPO search space. Each numeric default MUST be read via commun.finetune.hyperparams.resolve(_MODEL, current_timeframe(), "<PARAM>") (per ADR-90 + CI grep gate G5) ; HPO range MIN/MAX/SCALE via resolve_hpo_range(_MODEL, current_timeframe(), "<PARAM>").
  2. Append a register_model(...) call at the bottom of the file (or use the convention scan in harness/registry.py).
  3. Seed the Console with the model's HP defaults + HPO ranges for the 5 timeframes (1M / 5M / 15M / 30M / 1H). Add the values to scripts/seed_hyperparams_console.py::LEGACY_DEFAULTS + LEGACY_HPO_RANGES, update commun.finetune.hyperparams.expected_keys() to enumerate the new (model, param) pairs, then run : python scripts/seed_hyperparams_console.py --apply against prod PG (via kubectl exec on a scheduler pod, see documentation/runbooks/break-glass-hyperparams.md for the Console-bypass pattern when needed). The parity tests tests/unit/training_harness/test_hyperparams_seeding_parity.py MUST pass before merge.
  4. Done. The orchestrator picks the new model up automatically — no edits to any caller. Synthetic-pickup test : tests/unit/training_harness/test_phase4_lgb_cutover.py::TestSyntheticFourthModelPickup.

Adding a new ensemble

Same pattern under dags/ensembles/. The EnsembleAdapter composes N base TrainedArtifacts ; Hamilton resolves the base model dependencies and caches them within a driver call. Ensembles do NOT own hyperparameters themselves today — they reuse the per-base-model seeded values from ftf_config.base_env. If an aggregator gains tunable HPs (e.g., LogRegShrinkAggregator.l2 floor), externalize via ADR-90 with a CVN_HPO_ENS_<NAME>_<PARAM> namespace extension (currently unused — file an ADR addendum if needed).

File map (current)

Concern File LOC
Public API src/training/harness/__init__.py ~180
Typed contracts src/training/harness/contracts.py ~150
Plugin registries src/training/harness/registry.py ~120
Per-model DAGs src/training/harness/dags/models/*_dag.py ~200 each
Per-ensemble DAGs src/training/harness/dags/ensembles/*_dag.py ~180 each
Per-model adapters src/training/harness/adapters/{xgb,lgb,cb}.py ~50 each
Ensemble adapter src/training/harness/adapters/ensemble.py ~80
Reused Hamilton nodes src/training/harness/nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py ~700 total
Generic single-model autonomous wrapper src/training/harness/autonomous_model_trainer.py ~280
Generic ensemble autonomous wrapper src/training/harness/autonomous_ensemble_trainer.py ~240
Hamilton-native wrappers (PR #896 wiring) src/training/harness/dags/wrappers/ varies
Orchestrator dispatch src/training/cvntrade_autonomous_orchestrator.py ~210
Hyperparameter resolver (ADR-90) src/commun/finetune/hyperparams.py ~410
Console seeding script (ADR-90) scripts/seed_hyperparams_console.py ~570

Backtest contract

Every harness DAG returns a TrainedArtifact whose .adapter exposes the canonical predict_proba(X) -> (n, 2) Protocol. The backtest engine's create_model_adapter factory recognises harness adapters via duck-typing and wraps them through _HarnessAdapterShim to satisfy the legacy CVNTrade_ModelAdapter ABC. No type-sniffing in consumer code (regime trainer, backtest engine, OOS predictor).


5. Hyperparameter Resolution (ADR-90)

Since PR #904 (CVN-N001-EE-S17, 2026-05-12) :

Key family Naming convention Count today Owner
Defaults CVN_HPO_<MODEL>_<TF>_<PARAM> 110 HPO-tunable + 15 EARLY_STOPPING_ROUNDS = 125 Console (@dococeven)
HPO ranges CVN_HPO_<MODEL>_<TF>_<PARAM>_RANGE_{MIN,MAX,SCALE} 330 (3 per HPO-tunable default × 110) Console (@dococeven)
Total 455 canonical keys

The resolver contract :

  • Env key present → parse to typed value (per _PARAM_TYPES in commun/finetune/hyperparams.py)
  • Env key missing + fallback=NoneRuntimeError with canonical message pointing to ADR-90 (operator MUST seed the key)
  • Env key missing + fallback=<value> → emit WARN event=hpo_fallback_applied then return the (type-coerced) fallback (transitional warning window per ADR-90 Clause 2)
  • NaN / ±Inf rejected at parse time (_parse fail-fasts)
  • HPO range MIN >= MAX rejected at resolve_hpo_range exit
  • Fallback type-coerced through the same _parse path as env-sourced values (string "150" → int 150 for EARLY_STOPPING_ROUNDS)

Operator workflow : - Tune a value : Console UI → Config → ftf_config → edit key → save (writes audit row to ftf_config_history). No deploy needed (rule 1 of ADR-90). - Inspect coverage : Grafana dashboard cvntrade-hp-coverage (5 panels : 24h fallback count, 6h fallback %, 7d timeseries, top missing keys, training stack health). Target = 0 fallback events. - Rollback bad write : Console UI restore-from-history button NOT IMPLEMENTED today (Epic CCP wp#149 scope) ; until then, use the break-glass PG path with the gating ritual in documentation/runbooks/break-glass-hyperparams.md. - Seed a new namespace : python scripts/seed_hyperparams_console.py --apply after extending LEGACY_DEFAULTS / LEGACY_HPO_RANGES / expected_keys(). Idempotent (re-running is a no-op on already-seeded keys).


6. Known Regression — S18

Post-S17 canary FTF ftf_20260512_184337_fdee27_ATR0.5_1.5_H4 (2026-05-12) exposed that harness training stops shallow for all 3 model types :

Model best_iteration p50 f1_buy mean Profile
LightGBM 1 (53/53 trials hit iter 1-2) 0.37 Shallow + val-tuned θ-sweep at θ=0.2 grabs random-distribution tail
CatBoost 14 (some trials reach 336) 0.39 Mixed — bimodal
XGBoost 17 0.09 Shallow + fixed θ=0.5 → near-empty buy classifier (1-44 trades vs 100-280 for LGB/CB)

Root cause is upstream of S17. Hyperparameter externalization (PR #904) is verified clean : event=hpo_fallback_applied count = 0, Optuna picks are within seeded ranges, parity tests test_hyperparams_seeding_parity.py confirm byte-for-byte legacy values. The regression is in the #891 harness migration itself — diagnostic Story scope under CVN-N001-EE-S18 plan dossier (plan_review committee session 4298520f PASSED OK strong consensus, 0 blockers, OP wp#154, OP Meeting #133).

Diagnostic plan (S18) — 6 steps, 5.25-day budget

Step Action Budget
0 Pre-validate captured fold reproduces FTF metrics in staging (committee reco #4) 0.25 day
1 Capture reference fold AAVEUSDC fold=3 best_iter=1 from FTF run with verbose logging 1 day
2 Reconstruct pre-#891 legacy train_with_fixed_params_lgbm from git (git show e75418ca^:src/training/LightGBM/grid_utils.py) 0.5 day
3 Build parity reproducer scripts/diagnostic_s18_harness_vs_legacy.py — all 3 models + per-iter AUC + binary_logloss 2 days
4 Bisect harness commits (#891 dc3d86c6, #896 f56fa33f, #899 e75418ca) 1 day
5 Final dossier documentation/missions/cvn-n001-ee-s18-diagnostic/report.md + verdict 0.5 day

Verdict tree (per ADR-79, adapted for diagnostic Story) :

  • LOCK : root cause + fix < 30 LOC → open CVN-N001-EE-S19 1-week plan
  • KEEP_AVAILABLE : root cause + fix non-trivial → S19 with 2-3 sprint plan + design dossier
  • ABANDON : 30d no convergence OR fix > 3× revert surface → S19 reverts #891 + #896 + #899, accept ADR-89 loss, redo migration

Until S18 + S19 close, the harness FTF leaderboard is informational only ; no live trading promotion of harness-trained models is permitted (see §2).


7. Observability & Operator Workflow

Observability contract — Loki event schema

Every harness DAG emits the same event= schema, traceable in Loki via the {namespace="cvntrade"} stream :

Event When Key labels
event=training_started DAG entry model_type, harness_schema_version, n_train, n_val, n_test, n_features, class_dist_train_buy_pct, n_estimators, learning_rate (golden fields — full bundle since CR round 1 PR #904 lgb_effective_hyperparams)
event=class_balance_applied After class-weight computation model_type, binary, scale_pos_weight, n_pos, n_neg, imbalance_ratio
event=theta_picked After pick_threshold_on_val model_type, threshold, f1_buy_at_threshold, n_candidates
event=training_complete DAG exit model_type, best_iteration, training_time_sec, theta_picked, f1_buy_val, f1_buy_test, auc_buy_val, auc_buy_test
event=autonomous_trained Cache-aware wrapper success model_type, crypto, strategy, run_id
event=hpo_fallback_applied (NEW, S17) Resolver hit fallback path model, timeframe, param, key, fallback, reason=env_key_missing — Grafana panel cvntrade-hp-coverage thresholds : > 1% rate of event=training_started → P3, > 5% → P2

See also



8. Legacy Pipeline Analysis

Historical reference only. The current canonical implementation is described in §3-§7 above. Stages 1-6 below remain technically accurate as inputs to the harness (Stage 7 is the part replaced). Stages 7-9 are the pre-#891 trainers — deprecated since 2026-05-10. This section is retained for the bottleneck analysis it informs (§9) and for the FIX A-H historical action plan (Appendix A, §12) that motivated the harness refactor.

8.1 Pipeline Stages 1-6 (still current as harness inputs)

Stage 1: Data Ingestion (OHLCV)

Aspect Detail
Source Binance Futures API via ccxt
File src/ETL/cvntrade_external_data.py:20-26
Timeframe 30m (default), configurable via CVN_TIMEFRAME
History 24 months (configurable via history_months)
Data Open, High, Low, Close, Volume + Funding Rate (hourly, resampled)
Cache Redis L1 (10 connections, 5s timeout) → Disk L2 (parquet)
Volume ~24,000 candles per crypto for 24m @ 30m

Checkpoint: Raw OHLCV dataframe, no NaN, timezone-aware index.

Stage 2: Enrichment (Technical Indicators)

Aspect Detail
File src/commun/pipeline/enrichment_api.py:82-90
Processor CVNTrade_Enrich.process(df, mode="inference")
Indicators SMA (multi-period), RSI (7,14,28), MACD (slow+signal), ATR, Stochastic (K,D), Volume MA, MFI, ADX, Bollinger Bands
Gate v5 features MA spectrum (5,10,20,50), rolling window stats, volatility regime, momentum patterns, support/resistance, volume flow, market structure
Output ~300+ columns (raw indicators + derived features)
Warmup ~500 candles consumed for indicator bootstrap
Parity guarantee enrich_batch(ohlcv).iloc[-1] == enrich_streaming(window).iloc[-1] (ADR)

Checkpoint: Enriched dataframe with ~300 columns, ~23,500 rows (after warmup).

Stage 3: Labeling (Triple Barrier)

Aspect Detail
File src/ETL/cvntrade_label.py:188-269
Method Triple barrier with ATR-dynamic levels
Strategy format ATR{sl}_{tp}_H{horizon} (e.g. ATR1.5_3.0_H5) or legacy SL{x}_TP{y}_H{z} (e.g. SL0.5_TP1_H4)
TP level open × (1 + atr × tp_mult / 100)
SL level open × (1 - atr × sl_mult / 100)
Horizon H candles forward (e.g. H5 = 5 hours @ 30m = 10 candles)
Labels +1 (BUY: TP hit first), -1 (SELL: SL hit first), 0 (HOLD: timeout)
Anti-look-ahead Uses T+1 to T+horizon window, decision at T based on T-1 return
ATR floor 0.6% minimum (prevents degenerate labels on low-vol periods)
NaN handling Labels that don't resolve (no TP/SL/timeout) → dropped (~10-20%)

Class distribution (typical for ATR1.5_3.0_H5):

BUY (+1):   ~15-25%  ← minority class
HOLD (0):   ~40-50%  ← majority class (timeout)
SELL (-1):  ~25-35%

Checkpoint: Labeled dataframe. ~20,000 rows after NaN drop. Imbalanced toward HOLD.

Known issue: High HOLD ratio means model defaults to HOLD → low BUY recall.

Stage 4: CUSUM Pre-Filter

Aspect Detail
File src/backtest/filters/cvntrade_cusum_filter.py:279-409
Applied BEFORE train/val/test split (cvntrade_autonomous_fe.py:125-200)
Algorithm Cumulative sum control chart on log returns
Formula S+[t] = max(0, S+[t-1] + (r[t-1] - μ - k)), alert if S+ > h or S- < -h
Parameters k=0.5σ (allowance), h=3.0σ (threshold, from CVN_CUSUM_THRESHOLD_H)
Warmup 100 bars (all marked as transition)
Cooldown 10 bars after each detection
Sigma Calibrated on training data, cached, NEVER re-fit on test (ADR)
Mode "Stable mode" — keeps only NON-transition bars
Output ~5% of candles survive (regime change events only)

Checkpoint: Filtered dataframe. ~1,000-1,500 rows from 20,000. 95% data loss.

CRITICAL BOTTLENECK #1

Input:  20,000 labeled candles
CUSUM:  ~1,000 survive (h=3.0σ)
Output: 1,000 rows → split 70/15/15 → Train: 700, Val: 150, Test: 150

With more aggressive h (e.g. h=4.0σ from env):
Output: ~300 rows → Train: 210, Val: 45, Test: 45

Stage 5: Feature Engineering

Aspect Detail
File src/commun/cache/components/cvntrade_autonomous_fe.py:125-170
Steps Imputation → Stationarization (ADF/KPSS) → Normalization (StandardScaler)
Feature selection Variance threshold + adaptive cap
Adaptive cap min(80, max(10, n_features / 20))
Example 300 features → cap = 15-80 → typically 30-50 selected
Fit scope Pipeline fitted on TRAIN set only, applied to val/test
Feature lag Disabled (feature_lag=0)
Label remapping 3-class: {-1,0,1} → {0,1,2}. Binary: {-1,0} → 0, {1} → 1

Checkpoint: Feature matrix X (n_samples × n_features), label vector y.

BOTTLENECK #2 — Aggressive dimensionality reduction

300+ raw indicators → 30-50 selected features
Potential signal loss: MACD variants, stochastic slopes, volume patterns removed
Model capacity limited by feature count

Stage 6: Train/Val/Test Split

Aspect Detail
File src/commun/cache/components/cvntrade_autonomous_fe.py:143-146
Method Temporal split 70/15/15 (sequential in time)
Purge 10 bars between train/val boundary (CVN_PURGE_BARS)
Embargo 5 bars between val/test boundary (CVN_EMBARGO_BARS)
Walk-forward Supported via FTF folds (5 folds, sliding window)

Checkpoint: Three datasets (X_train, y_train), (X_val, y_val), (X_test, y_test).

BOTTLENECK #3 — Tiny datasets after CUSUM

With 1,000 post-CUSUM rows:
  Train: 700 samples
  Val: 150 samples
  Test: 150 samples

With 300 post-CUSUM rows (aggressive h):
  Train: 210 samples
  Val: 45 samples
  Test: 45 samples

XGBoost on 210 training samples with 30 features = severe overfitting risk

8.2 Pipeline Stages 7-9 — Deprecated, kept for historical analysis

Stage 7: HPO + XGBoost Training (legacy — superseded by harness in PR #891 + #898)

Aspect Detail
File src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py:52-250
Optimizer Optuna TPESampler (seed=42)
Trials 30 (from CVN_HPO_N_TRIALS), timeout 3600s
Pruner MedianPruner
Warm start 5 startup trials (random), then TPE

Hyperparameter search space (from config, timeframe-dependent):

max_depth:        [3, 8]
learning_rate:    [0.01, 0.3]
n_estimators:     [100, 500]
subsample:        [0.6, 1.0]
colsample_bytree: [0.6, 1.0]
min_child_weight: [1, 10]
gamma:            [0, 5]
reg_alpha:        [0.01, 5.0]  (timeframe-dependent)
reg_lambda:       [0.5, 7.0]   (timeframe-dependent)
threshold_buy:    [0.30, 0.50]
threshold_sell:   [0.30, 0.50]

Objective function (CVN_HPO_OBJECTIVE):

Objective Formula Default weights
precision_recall_auc w_prec × precision_buy + w_auc × auc + w_log × (1-logloss) 0.45 / 0.35 / 0.20
fbeta_buy F_β(buy) with β from CVN_BUY_BETA β=1.0 (F1)
logloss_auc w × (1-logloss) + (1-w) × auc w=0.5 (binary mode)

Guard conditions (kill trial if):

if action_rate < 0.08 or action_rate > 0.60:   score = 0.0
if recall_buy < 0.15:                           score = 0.0

BOTTLENECK #4 — HPO guards constrain the solution space

action_rate [8%, 60%] forces model to predict BUY on 8-60% of samples
recall_buy > 15% forces minimum BUY detection
Combined: model must predict BUY often enough but not too often
Result: model converges to "safe middle" with f1_buy ~0.45

Class balancing (XGBoost trainer lines 900-911):

if _is_class_balancing_enabled():  # CVN_CLASS_BALANCING=1
    class_weights = compute_class_weight("balanced", classes, y_train)
    sample_weights = [class_weights[int(y)] for y in y_train]
else:
    sample_weights = ones(len(y_train))  # NO BALANCING (default)

BOTTLENECK #5 — Class balancing disabled by default

With 70% HOLD, 15% BUY, 15% SELL:
  Model learns to predict HOLD for most samples (safe bet)
  BUY precision high but recall low → f1_buy ~0.45
  Enabling balancing: 0.05-0.10 improvement expected

Early stopping (trainer line 957):

early_stopping_rounds = config.early_stopping_rounds  # default 150

Calibration (trainer lines 961-963):

if config.calibration != "none":
    self._apply_calibration(X_train, y_train, config.calibration)
# Options: "none" (default), "isotonic", "platt"

Stage 8: Model Evaluation (legacy — superseded by harness, kept for context)

Aspect Detail
File src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py:1025-1114
Metrics f1_macro, f1_buy, precision_buy, recall_buy, auc_buy, logloss, accuracy, action_rate, fbeta_buy
Evaluation sets Val (for HPO selection) + Test (for reporting)
Overfit gap f1_val - f1_test (added in #521)

Typical metrics (ATR1.5_3.0_H5, defi_top5):

f1_buy:        0.43-0.49     ← weak
precision_buy: 0.50-0.65     ← decent
recall_buy:    0.20-0.35     ← very low (model too conservative)
auc_buy:       0.55-0.65     ← barely above random (0.50)
action_rate:   0.10-0.30     ← low (few BUY predictions)

Stage 9: MLflow Storage (legacy — superseded by harness, kept for context)

Aspect Detail
File src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py:353-398
Model name CVNTrade_XGBoost_{CRYPTO}_{STRATEGY}_{TIMEFRAME}
Artifacts Model pickle, feature names (tag), feature count (tag), validation metrics (metadata)
Registry MLflow Model Registry with stages (Staging, Production)
Promotion Manual via lock_winner() (ADR-2: no auto-promotion)


9. Historical Bottlenecks

Historical reference only. The bottleneck analysis below was the V1 / V2 committee assessment that motivated the harness refactor (now done) and the FIX A-H action plan (Appendix A, §12). It describes the pre-#891 pipeline ; the harness path has different bottlenecks, currently under diagnosis in CVN-N001-EE-S18 (see §6).

Bottleneck Map

                     Raw Data (24,000 candles)
                    ┌─────────┴─────────┐
                    │  ENRICHMENT        │ → 300+ features
                    └─────────┬─────────┘   No data loss
                    ┌─────────┴─────────┐
                    │  LABELING          │ → ~20,000 labeled
                    └─────────┬─────────┘   10-20% NaN drop
                              │                              ← BOTTLENECK #6: label dropout
                    ┌─────────┴─────────┐
                    │  CUSUM FILTER      │ → ~1,000 rows     ← BOTTLENECK #1: 95% data loss
                    └─────────┬─────────┘   (h=3.0σ)
                    ┌─────────┴─────────┐
                    │  FEATURE ENGIN.    │ → 30-50 features   ← BOTTLENECK #2: signal loss
                    └─────────┬─────────┘   (adaptive cap)
                    ┌─────────┴─────────┐
                    │  SPLIT 70/15/15   │ → Train: 700
                    └─────────┬─────────┘   Val: 150          ← BOTTLENECK #3: tiny datasets
                              │             Test: 150
                    ┌─────────┴─────────┐
                    │  HPO (30 trials)  │ → f1_buy ~0.45      ← BOTTLENECK #4: guards limit
                    └─────────┬─────────┘                     ← BOTTLENECK #5: no class balance
                    ┌─────────┴─────────┐
                    │  EVALUATION        │ → 1-10 trades/fold
                    └─────────┬─────────┘   (tiny test set)
                    ┌─────────┴─────────┐
                    │  MLFLOW STORAGE    │
                    └───────────────────┘

Bottleneck Severity Ranking

# Bottleneck Severity Impact Root Cause
1 CUSUM filters 95% of data CRITICAL Train/val/test too small for reliable ML h=3.0-4.0σ too conservative for crypto volatility
2 Feature cap (30-50 features) HIGH Model can't learn complex patterns Adaptive cap n/20 too aggressive
3 Tiny test sets (7-150 samples) HIGH Metrics statistically unreliable, few trades Consequence of #1
4 HPO guards (action_rate, recall) MEDIUM Solution space artificially constrained Guards too tight for imbalanced data
5 Class balancing disabled MEDIUM Model biased toward HOLD CVN_CLASS_BALANCING=0 default
6 Label dropout (~15%) LOW Reduced training set Window too short for some candles


10. Roadmap / Open Stories

Current in-flight + immediate-next Stories tied to this pipeline :

Story Status Owner Outcome gate
CVN-N001-EE-S16 (harness unification) ✅ Closed — PR #891 merged 2026-05-09, full cutover #899 2026-05-10 @dococeven n/a — closed
CVN-N001-EE-S17 (hyperparameters externalization, ADR-90) ✅ Closed — PR #904 merged 2026-05-12 @dococeven n/a — closed, sunset milestone-anchored on Epic CCP wp#149
CVN-N001-EE-S18 (harness shallow-training diagnostic) 🔄 In flight — plan_review PASSED OK 4298520f, implementation starts post-PR-#910-merge @dococeven Verdict LOCK / KEEP_AVAILABLE / ABANDON per §6 decision tree
CVN-N001-EE-S19 (remediation, scope TBD by S18 verdict) ⏸️ Pending S18 @dococeven Canary FTF f1_buy ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types
Epic CCP wp#149 ⏸️ Planned (immediate successor to training stabilization) @dococeven Typed schemas + Console UI restore + scoped resolution + snapshots + approval flow + OpenFeature integration

Promotion-gate criteria (S19 closure)

Recover the pre-#891 baseline AND demonstrate the harness path matches or exceeds it :

Metric pre-#891 baseline S19 closure gate Method
f1_buy (defi_top5 5m ATR0.5_1.5_H4) ~0.42 on all 3 model types ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types FTF canary, 5 folds, bootstrap CI
best_iteration (LightGBM) 100-500 typical > 50 median Loki event=training_complete query
event=hpo_fallback_applied rate n/a (pre-ADR-90) 0 over 7 days Grafana cvntrade-hp-coverage dashboard
Trades per fold (XGB) 30-100 ≥ 30 mean across folds finetune_results PG query

Only when ALL four gates pass on the canary FTF does the operator unfreeze harness-trained model promotion (§2). If S18 returns ABANDON, the promotion-freeze persists indefinitely and the operator pivots to revert.


11. ADR Compliance

ADR Topic V1 V3 (target)
ADR-2 No auto-promotion COMPLIANT COMPLIANT + trade count gate
ADR-4 Cache explicit COMPLIANT COMPLIANT
ADR-14 Multi-fold evaluation COMPLIANT COMPLIANT (5 folds)
ADR-15 Theta calibrated OOS COMPLIANT COMPLIANT + all params OOS
ADR-16 Labels coherent COMPLIANT COMPLIANT
ADR-23 Features version-pinned COMPLIANT COMPLIANT
ADR-24 Feature set = model contract COMPLIANT COMPLIANT
ADR-25 No silent fallback PARTIAL COMPLIANT (all fallbacks documented + alerted)
ADR-28 Binary classification COMPLIANT COMPLIANT + FTF ablation (#517)
ADR-29 Baseline naïve obligatoire COMPLIANT COMPLIANT + edge validation (§1b)
ADR-46 Class balancing NON-COMPLIANT COMPLIANT (Fix B: enabled by default)
ADR-47 Meta-label validation COMPLIANT COMPLIANT

12. Appendix A — Historical Action Plan (FIX A-H, V1+V2 Committee Recos)

Historical action plan that motivated the harness refactor. The FIX A-H items below were the V1 + V2 committee deliverables that drove the V3 plan. Some are now closed (e.g. FIX B — class balancing — integrated into the harness class_balance node ; FIX A — CUSUM decoupling — applied at inference per ADR-46). Others (FIX D trading-centric HPO, FIX F MLOps controls) remain partially open. Do NOT use this as the current action plan — the current open work is in §10. This appendix is kept for the rationale and traceability it provides.

4. Committee Verdict & Action Plan

V1 Committee Result: REJECTED (STARVATION, 4.0/10)

8 recommendations received. ALL addressed below with concrete fixes.

Action Plan — 3 Structural Fixes + 5 Improvements

PRIORITY 1 — CRITICAL (blocks everything)
┌─────────────────────────────────────────────────────────────────┐
│  FIX A: Decouple CUSUM from training                            │
│  FIX B: Enable class balancing by default                       │
│  FIX C: Integrate realistic cost model into HPO                 │
└─────────────────────────────────────────────────────────────────┘

PRIORITY 2 — HIGH (multiplier on Fix A/B/C)
┌─────────────────────────────────────────────────────────────────┐
│  FIX D: Trading-centric HPO objective (Sortino-based)           │
│  FIX E: Increase feature cap + regularization                   │
│  FIX F: MLOps operational controls                              │
└─────────────────────────────────────────────────────────────────┘

PRIORITY 3 — MEDIUM (validation & exploration)
┌─────────────────────────────────────────────────────────────────┐
│  FIX G: Test CUSUM/ML paradigm conflict                         │
│  FIX H: Binary classification (#517)                            │
└─────────────────────────────────────────────────────────────────┘

5. FIX A: Decouple CUSUM from Training (committee reco #1, #7)

Problem

CUSUM filters 95% of training data at h=3.0σ. Result: 20,000 candles → 1,000 → split → Train: 700 samples. XGBoost on 700 samples with 30 features = severe overfitting, unreliable metrics.

Solution: Train on ALL data, CUSUM only at inference

CURRENT (broken):
OHLCV → Enrichment → Label → CUSUM FILTER → FE → Split → Train
                              └── 95% data lost HERE

TARGET (fixed):
OHLCV → Enrichment → Label → FE → Split → Train  (CUSUM REMOVED from training)
                                                    ▼ at inference only:
                                            CUSUM → Inference → Filters → Trade

Implementation

File: src/commun/cache/components/cvntrade_autonomous_fe.py

Current (~line 125):

# Apply CUSUM before split (issue #295)
filtered_df = self._apply_cusum_before_split(enriched_df)
X_train, X_val, X_test = self._temporal_split(filtered_df)

Target:

# Train on ALL data — CUSUM only at inference (committee reco #1)
cusum_training_mode = os.environ.get("CVN_CUSUM_TRAINING_MODE", "disabled")
if cusum_training_mode in ("enabled", "relaxed_1_5", "legacy_3_0"):
    filtered_df = self._apply_cusum_before_split(enriched_df)
else:
    filtered_df = enriched_df  # NO CUSUM filtering for training
X_train, X_val, X_test = self._temporal_split(filtered_df)

New env var: CVN_CUSUM_TRAINING_MODE — canonical variants: {disabled, relaxed_1_5, legacy_3_0} (default: disabled)

FTF factor: cusum_training_mode with variants {disabled, relaxed_1_5, legacy_3_0} to A/B test the impact.

Expected Impact

Metric Before (h=3.0σ) After (no CUSUM training)
Training samples ~700 ~14,000 (20×)
Val samples ~150 ~3,000 (20×)
Test samples ~150 ~3,000 (20×)
f1_buy ~0.45 Target: 0.55-0.65
Trades per fold 1-10 Target: 30-100

CUSUM remains at inference

CUSUM is still applied at inference time (backtest candle loop, line 869). It gates which candles are processed by the ML model. The model is trained on ALL data but only makes predictions on CUSUM-validated candles.

Rationale: The model learns patterns from the full distribution but only acts during regime transitions. This resolves Hypothesis G (CUSUM/ML paradigm conflict) — the model is no longer trained on a biased subset.

Paradigm Test (committee reco #7)

To validate this architectural change, run 3 variants via FTF:

Variant Training data Inference CUSUM
cusum_disabled ALL candles YES (h=3.0)
cusum_relaxed CUSUM h=1.5σ YES (h=3.0)
cusum_legacy CUSUM h=3.0σ YES (h=3.0)

Compare Sortino, f1_buy, trades/fold. If cusum_disabled wins → confirm architectural fix.


6. FIX B: Enable Class Balancing by Default (committee reco #2)

Problem

CVN_CLASS_BALANCING=0 (disabled). With 70% HOLD, 15% BUY, 15% SELL, the model learns to predict HOLD for safety. BUY recall = 0.25 (misses 75% of opportunities). ADR-46 non-compliant.

Solution

Change default to CVN_CLASS_BALANCING=1 in BASE_ENV (ablation_matrix.py).

File: src/commun/finetune/ablation_matrix.py

BASE_ENV = {
    ...
    "CVN_CLASS_BALANCING": "1",  # Changed from "0" — ADR-46 compliance
    ...
}

Effect: compute_class_weight("balanced") → BUY weight ~3.3×, SELL weight ~1.5×, HOLD weight ~0.7×.

Expected Impact

Metric Before (no balance) After (balanced)
recall_buy ~0.25 Target: 0.40-0.55
precision_buy ~0.55 May decrease to 0.45-0.50
f1_buy ~0.45 Target: 0.50-0.55
action_rate ~0.15 Target: 0.20-0.35

Trade-off: More BUY predictions → more trades → better statistical power BUT potentially more false positives. The filters (trend, meta-label, regime) catch false positives.


7. FIX C: Realistic Cost Model in HPO (committee reco #3)

Problem

HPO optimizes precision_recall_auc which ignores transaction costs entirely. A model with f1_buy=0.50 and 30 trades at 15bps cost may have negative expectancy. HPO doesn't know.

Solution: Cost-aware HPO objective

New objective: sortino_net — backtest Sortino after costs.

# In hyperoptimizer, after model training:
# 1. Quick backtest on validation set with cost model
# 2. Compute net Sortino
# 3. Use as HPO score

def _compute_sortino_net(model, X_val, y_val, ohlcv_val, cost_bps=15):
    """Quick backtest for HPO scoring — net of costs."""
    predictions = model.predict(X_val)
    trades = _simulate_trades(predictions, ohlcv_val, cost_bps)
    return _sortino_ratio(trades)

Cost components: - Base fee: CVN_TRADE_FEE_BPS (default 15) - Slippage: base_bps + impact × √(size/volume) (from cost_model.py) - Funding rate: CVN_FUNDING_RATE_BPS × expected hold duration (from Binance API)

FTF factor: hpo_objective already has variants {fbeta_buy, precision_recall_auc, f1_macro}. Add sortino_net.

Implementation

File: src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py

Add new objective handler:

elif objective == "sortino_net":
    from commun.config.cost_model import compute_trade_cost
    # Quick backtest on validation fold
    net_sortino = _quick_backtest_sortino(model, datasets, cost_bps=15)
    if net_sortino is None or np.isnan(net_sortino):
        return 0.0
    return max(0.0, net_sortino / 10.0)  # normalize to [0, 1] range


8. FIX D: Trading-Centric HPO & Relaxed Guards (committee reco #4)

Problem

HPO guards kill promising trials: - action_rate < 0.08 → score = 0 (too few BUY predictions) - action_rate > 0.60 → score = 0 (too many BUY predictions) - recall_buy < 0.15 → score = 0 (not enough BUY detection)

These guards constrain the solution space. With class balancing (Fix B), the model will naturally have higher action_rate.

Solution

Relax guards and shift to trading-centric scoring:

Current guards:

if action_rate < 0.08 or action_rate > 0.60: return 0.0
if recall_buy < 0.15: return 0.0

Target guards (relaxed):

if action_rate < 0.03 or action_rate > 0.80: return 0.0  # Much wider
if recall_buy < 0.05: return 0.0  # Minimal floor

Objective shift: From classification-centric to trading-centric:

Objective Current Target
Primary metric precision × 0.45 + AUC × 0.35 + (1-logloss) × 0.20 Sortino net (after costs)
Guard: action_rate [0.08, 0.60] [0.03, 0.80]
Guard: recall_buy > 0.15 > 0.05
Guard: n_trades none > 10 (minimum trades for valid backtest)

9. FIX E: Feature Selection Strategy (committee reco #5)

Problem

Adaptive cap: min(80, max(10, n_features / 20)) → typically 30-50 features from 300+. Too aggressive — removes signal.

Solution

  1. Increase cap: min(150, max(30, n_features / 10)) → typically 80-150 features
  2. Rely on XGBoost regularization instead of pre-selection:
  3. reg_alpha (L1) already in HPO range [0.01, 5.0]
  4. reg_lambda (L2) already in HPO range [0.5, 7.0]
  5. max_depth [3, 8] limits tree complexity
  6. XGBoost handles feature selection internally via feature_importance
  7. OOS feature selection (strict): Feature importance computed ONLY on training fold, applied to val/test

Implementation: - src/commun/cache/components/cvntrade_autonomous_fe.py: Change cap formula - BASE_ENV: CVN_MAX_FEATURES=0 (0 = auto) → keep, but change auto formula - FTF factor n_features already tests {top_30, top_50, top_100, full}

Expected Impact

More features → more signal → BUT only if regularization prevents overfitting. With Fix A (20× more training data), overfitting risk is dramatically reduced. 700 samples + 150 features = overfit. 14,000 samples + 150 features = healthy ratio (93:1).


10. FIX F: MLOps Operational Controls (committee reco #6)

Kill-Switch

Mechanism Scope Control
CVN_FTF_ENABLED=0 All FTF runs Helm ConfigMap
DAG pause Stop specific DAG Airflow UI
Model rollback Revert to previous model MLflow: promote previous Production stage
Filter disable Bypass specific filter CVN_USE_{FILTER}=0

Live Observability

What Where Alert
Model f1/Sortino per run Grafana FTF dashboard (§10 drift) >10% drop → warning
CUSUM pass rate Grafana infra dashboard <1% → too aggressive
Funnel survival rate Grafana FTF funnel panel <5% → starvation
Training sample count Grafana FTF stats >15% drop → data issue
Action rate drift Grafana FTF ML metrics >20% change → model shift

Drift Detection

Type Method Trigger
Data drift PSI on top features (30d rolling) PSI > 0.2
Concept drift F1/Sortino trend across runs >10% drop from 7d mean
Label drift BUY/HOLD/SELL ratio shift >20% change
Behavior drift Action rate, filter block rates >20% change

Staged Rollout

Stage Environment Duration Gate
FTF ablation Backtest 2-3h/factor BH p < 0.05
Committee review Document 1 session Score ≥ 8
Shadow mode Paper trading 7 days No degradation
Canary Live 10% capital 14 days Sortino ≥ baseline
Full rollout Live 100% Monitoring continues

11. FIX G: CUSUM/ML Paradigm Test (committee reco #7)

Experimental Design

3-arm study via FTF cusum_training_mode factor:

Arm Training CUSUM Inference CUSUM Hypothesis
A: disabled None (all data) h=3.0σ ML learns full distribution, CUSUM gates inference
B: relaxed h=1.5σ h=3.0σ Partial filter, more data than current
C: legacy h=3.0σ h=3.0σ Current baseline

Success Criteria

Metric Minimum acceptable Target
f1_buy 0.50 0.60
Sortino (15bps) 1.0 2.0
Trades per fold 20 50+
Powered variants all 3 all 3

Expected Outcome

Arm A should dominate: - 20× more training data → better model generalization - CUSUM at inference → same signal quality (only regime-change trades) - f1_buy improvement: +0.10-0.20 (from 0.45 to 0.55-0.65)

If Arm A does NOT dominate → fundamental signal-to-noise issue (Hypothesis F). Requires alternative approaches (different features, different model architecture, different timeframe).


12. FIX H: Binary Classification (committee reco #8)

Already implemented as FTF factor classification_mode (issue #517):

Variant Config Expected Impact
3class (baseline) SELL/HOLD/BUY Current performance
binary_balanced BUY/NOT_BUY, logloss_auc w=0.5 +precision (focused decision boundary)
binary_precision BUY/NOT_BUY, logloss_auc w=0.7 +precision (calibration-biased)

Rationale: 3-class wastes model capacity learning SELL patterns we don't act on (LdP pipeline only opens long positions). Binary focuses 100% of capacity on the BUY decision.



13. Appendix B — V2 Stricter Recos (historical)

13. Walk-Forward Leakage Prevention (V2 reco #5)

Every component in the pipeline must be strictly in-sample for each walk-forward fold:

Fold k timeline:
├── [                   Train window                   ]──[Purge]──[  Val  ]──[Emb]──[  Test  ]
│    FE fitted here only                                   10 bars   15%        5 bars   15%
│    CUSUM sigma calibrated here (lagged 500 bars)
│    Feature selection on train only
│    Class weights computed on train only
│    Thresholds optimized on val only
│    Test: NEVER seen during training or tuning

Per-Component Leakage Audit

Component Leakage vector Current status V3 fix
Labels Future prices in TP/SL window SAFE — window T+1..T+H, anti-look-ahead
Enrichment Indicators use future data SAFE — all backward-looking (SMA, RSI, etc.)
CUSUM sigma Fit on full dataset TO FIX — sigma sees test data volatility Lagged window: train_start - 500 bars
FE pipeline (imputer, scaler, stationarizer) Fit on val/test data SAFE — fitted on train split only
Feature selection Importance on full dataset TO VERIFY — may see test features Force: feature_importance(X_train, y_train) only
Class weights Computed on val/test SAFEcompute_class_weight(y_train)
HPO Optimizes on test SAFE — optimizes on val, test is holdout
Walk-forward thresholds Threshold optimized on test SAFE — walk-forward uses val for threshold, test for evaluation
Meta-label model Trained on same fold SAFE — ADR-47 requires separate fold

Verification Protocol

Before each sprint: 1. Code audit: grep -n "X_test\|y_test" src/training/ — no test data in training 2. FE pipeline: verify pipeline.fit(X_train) not pipeline.fit(X_full) 3. Feature selection: verify importance computed on (X_train, y_train) only 4. CUSUM sigma: verify sigma_window_end < train_start - purge_bars


14. Uncertainty Quantification (V2 reco #8)

All key metrics must be reported with confidence intervals to quantify statistical reliability.

Methods

Metric CI Method Implementation
f1_buy Bootstrap (10,000 resamples) ablation_stats.py:bootstrap_ci() — already implemented
Sortino Bootstrap (10,000 resamples) Same — already implemented
Total return Bootstrap Same — already implemented
Win rate Wilson score interval New — robust for small n, asymptotic-free
Expectancy Bootstrap on trade PnL distribution New
Trades per fold Poisson CI New — for count data
Survival rate Binomial CI (proportion_confint) New
Block rate per filter Binomial CI New

Reporting Standard

Every metric in FTF reports and Grafana dashboards includes:

Sortino: 1.75 [1.43, 2.10] (95% CI, n=51 runs)

If CI includes 0 → metric is not statistically distinguishable from zero. Flag in report.

If CI width > mean → metric is unreliable. Flag as "WIDE CI — insufficient data".

BH Correction for Multiple Comparisons

Already implemented: ablation_stats.py:benjamini_hochberg() applied to all pairwise variant comparisons. Controls false discovery rate at 5%.


15. Minimum Trade Count Gates (V2 reco #7)

Problem

With 1-10 trades per fold, ALL metrics are unreliable. A single trade can swing Sortino from 0 to 100.

Trade count thresholds (committed)

Threshold Count Purpose Action if below
Statistical minimum 30 Minimum for bootstrap CI to be meaningful Flag: "UNDERPOWERED — CI unreliable"
Power analysis 63 d=0.5, α=5%, power=80% (from compute_min_sample_size()) Flag: "INSUFFICIENT POWER for pairwise comparison"
Production minimum 100 Robust strategy evaluation Gate: do not promote to production

Implementation

  1. FTF report: Already flags underpowered variants. Strengthen: exclude < 30 trades from variant ranking.
  2. Grafana: n_trades >= 3 filter (outlier protection). Add panel showing trade count per variant.
  3. Promotion gate: Model cannot be promoted to Production in MLflow if any evaluation fold has < 30 trades.

Expected post-Fix-A trade counts

With CUSUM removed from training (Fix A): - Training: 14,000 samples (20× current) - Model sees more examples → better action_rate calibration - Expected: 30-100 trades per fold (vs 1-10 current)


16. Operational Controls & Runbooks (V2 reco #6)

Kill-Switch Hierarchy

Level Mechanism Scope Activation Response time
L0: Emergency halt CVN_FTF_ENABLED=0 in Helm All FTF + trading Helm deploy <5 min
L1: DAG pause Airflow UI → pause DAG Specific DAG Immediate <1 min
L2: Model rollback MLflow: promote previous Production One crypto Manual <10 min
L3: Filter bypass CVN_USE_{FILTER}=0 in Helm One filter Helm deploy <5 min

Configuration Audit Trail

Every config change is traceable: - Git: Helm values changes → PR → CodeRabbit → merge (audit via git log) - MLflow: Model promotion → registry (version, timestamp, user) - Committee: Design decisions → committee/sessions/*.json (ADR-52) - Airflow: DAG runs → execution logs (who triggered, when, params)

Rollback Procedures

Scenario Steps RTO
Bad model deployed 1. Pause DAG 2. MLflow: promote previous version 3. Verify in paper trading 15 min
Bad config deployed 1. git revert the Helm values PR 2. CI/CD auto-deploy 3. Verify pods restarted 10 min
Pipeline data corruption 1. Identify bad run_id 2. DELETE FROM finetune_results WHERE run_id = X 3. Re-run 30 min
Full system failure 1. Emergency halt (L0) 2. Investigate 3. Fix + committee review 4. Staged re-deploy 2-4h

Runbooks (to create in documentation/RUNBOOKS/)

Runbook Trigger First 3 checks
MODEL_DEGRADATION.md f1/Sortino drops >10% 1. Data freshness 2. Feature drift (PSI) 3. Label distribution
STARVATION.md Survival rate < 5% 1. CUSUM threshold 2. Action rate 3. Filter block rates
HPO_FAILURE.md HPO returns score=0 for all trials 1. Action rate guards 2. Class distribution 3. Feature count
DATA_PIPELINE.md n_train_samples drops >15% 1. Binance API status 2. ETL logs 3. Date range
COST_SPIKE.md Avg cost > 50bps 1. Market liquidity 2. Funding rate 3. Slippage model params


14. Appendix C — Market Hypothesis & Edge (V2 reco #9)

What edge are we exploiting?

The system targets short-term mean-reversion at regime transition points in DeFi altcoin markets.

Thesis: When a DeFi altcoin's volatility regime shifts (detected by CUSUM), the initial price reaction overshoots. The ML model identifies overshooting candles where the probability of a profitable mean-reversion trade (TP hit before SL) exceeds the breakeven probability after costs.

Why this edge should exist

  1. DeFi altcoin microstructure: Lower liquidity than BTC/ETH → larger overreactions to regime shifts → mean-reversion opportunity.
  2. Retail-dominated order flow: DeFi tokens have higher retail participation → predictable behavioral patterns (panic selling on vol spikes, FOMO buying on breakouts).
  3. CUSUM as regime detector: CUSUM identifies structural breaks in volatility — these are exactly the moments when market participants overreact and create temporary mispricings.
  4. Triple barrier as target: The ATR-based SL/TP captures the mean-reversion: TP is set at the "fair" reversion level, SL limits the cost of being wrong.

Why it hasn't worked yet

The edge exists in principle but the pipeline has structural issues that prevent the model from capturing it: - CUSUM during training eliminates 95% of examples → model can't learn the full distribution of regime transitions - Class imbalance → model defaults to HOLD instead of predicting BUY at transition points - HPO objective optimizes classification accuracy, not trading profit → model maximizes F1 but generates few trades - Feature cap removes market microstructure features (volume patterns, order flow proxies) that capture the edge

How we validate the edge

  1. Statistical test: Sortino of our model > Sortino of random entry at same CUSUM-filtered candles (ADR-29 baseline)
  2. Economic test: Net expectancy per trade > 0 after costs at 30 bps
  3. Robustness test: Edge persists across 5 cryptos, 5 folds, 3 cost scenarios
  4. Decay test: Edge doesn't degrade significantly over recent folds (concept drift monitoring)

Success criteria for edge validation

Metric Minimum Target Method
Sortino (15bps) > 1.0 > 2.5 FTF ablation, 5 folds
Net expectancy/trade > 0 > 0.5% After costs, per trade
vs random baseline > 1.5× > 3× ADR-29 comparison
Edge stability (fold variance) CV < 100% CV < 50% Cross-fold coefficient of variation
Trades per fold > 30 > 50 Statistical power for evaluation


15. Appendix D — Pre-Harness Implementation Roadmap (historical)

Sprint 1: Critical Fixes — DEADLINE: 1 week

Fix Change Effort Impact Gate
A: Decouple CUSUM cvntrade_autonomous_fe.py + env var 20 lines 20× training data FTF run confirms >5000 train samples
B: Enable class balancing ablation_matrix.py BASE_ENV 1 line +0.10 f1_buy ADR-46 compliant
C: Cost model validation Verify cost_model.py integrated 0 (audit) Cost-aware eval Expectancy computed in backtest
FTF factors Add cusum_training_mode 15 lines A/B test paradigm Factor appears in DAG dropdown
Leakage audit Code review per §13 checklist 2h Experimental validity All items SAFE or FIXED

Sprint 2: HPO & Features — DEADLINE: 1 week

Fix Change Effort Impact Gate
D: sortino_net objective XGBoost_hyperoptimizer.py 40 lines Trading-centric HPO HPO uses backtest Sortino as score
D: Relax guards XGBoost_hyperoptimizer.py 5 lines Wider solution space action_rate [0.03, 0.80]
E: Increase feature cap cvntrade_autonomous_fe.py 5 lines More signal 100-150 features available
Uncertainty CIs ablation_report.py 30 lines Statistical rigor All metrics reported with CI

Sprint 3: Validation & Operations — DEADLINE: 1 week

Fix Change Effort Impact Gate
G: Paradigm test (3-arm) FTF run cusum_training_mode 3h compute Validate Fix A Arm A Sortino > Arm C
H: Binary classification FTF run classification_mode 3h compute Validate binary Compare to 3-class baseline
F: MLOps controls Grafana alerts + 5 runbooks 2 days Operational safety Runbooks reviewed
Edge validation Compare to random baseline (ADR-29) 1h analysis Confirm economic edge Sortino > 1.5× random

Success Gate — Hard Criteria

After Sprint 3, ALL must be met to proceed to production:

Metric Current Gate (minimum) Target Method
f1_buy 0.45 ≥ 0.50 0.60 FTF ablation, bootstrap CI
Sortino (15bps) 1.5 ≥ 1.5 2.5 FTF, CI excludes 0
Trades per fold 1-10 ≥ 30 50+ All folds, all cryptos
recall_buy 0.25 ≥ 0.30 0.45 FTF ablation
Net expectancy/trade unknown > 0 > 0.5% After 15bps costs
vs random baseline unknown > 1.5× > 3× ADR-29 comparison
Powered variants rare all all Power analysis (63 trades)
Edge stability (CV) unknown < 100% < 50% Cross-fold Sortino variance

If gate not met: Escalate to architectural review. Options: 1. Alternative models (LightGBM, transformer, LSTM) 2. Alternative features (order book, funding rate dynamics, cross-asset) 3. Alternative timeframes (1h, 4h) 4. Alternative strategy (momentum instead of mean-reversion)



16. Files Reference

File Lines Purpose
src/ETL/cvntrade_external_data.py 20-132 Binance API, funding rates
src/ETL/cvntrade_label.py 188-290 Triple barrier labeling
src/commun/pipeline/enrichment_api.py 82-90 OHLCV → indicators
src/commun/pipeline/feature_engineering_api.py 1-150 FE pipeline
src/commun/cache/components/cvntrade_autonomous_fe.py 125-200 CUSUM + split + FE (Fix A target)
src/backtest/filters/cvntrade_cusum_filter.py 279-409 CUSUM algorithm
src/training/harness/__init__.py Harness public API (train_one, train_ensemble, run_hpo, train_with_fixed_params)
src/training/harness/contracts.py Typed payloads (Datasets, HPOParams, SplitMetrics, TrainedArtifact, FeatureVersion)
src/training/harness/registry.py Plugin registries + register_model / register_ensemble decorators
src/training/harness/dags/models/{xgboost,lightgbm,catboost}_dag.py One file = one model. HPO search space + Hamilton DAG nodes
src/training/harness/dags/ensembles/{blend_avg,stack_logreg,stack_xgb_meta}_dag.py One file = one ensemble
src/training/harness/adapters/{xgb,lgb,cb,ensemble}.py Per-model predict_proba shims (Protocol implementations)
src/training/harness/nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py Reused Hamilton-pure functions (the 85% shared code)
src/training/harness/autonomous_model_trainer.py Generic cache-aware wrapper (any registered model)
src/training/harness/autonomous_ensemble_trainer.py Generic cache-aware wrapper (any registered ensemble)
src/training/cvntrade_autonomous_orchestrator.py 45-160 Orchestration — dispatches to the two generic wrappers above
src/commun/config/cost_model.py Non-linear slippage model
src/commun/finetune/ablation_matrix.py 37-97 BASE_ENV (Fix B target)
src/commun/mlflow/cvntrade_mlflow_manager.py Model registry
documentation/ADR.md 744-799 ADR-43/44/45 (funnel)