Architecture — Training Pipeline¶

Version: 3.3 Date: 2026-05-13 (V3.3 — restructured for committee review : flat structure, harness-first, S18 isolated, legacy demoted) Status: Canonical harness path in production ; model promotion blocked pending S18-S19 regression diagnosis

Status decomposition (committee gate)¶

Domain	Status	Anchor
Harness architecture	production canonical (since PR #891 + #898, 2026-05-10)	§4 Harness Path
Hyperparameters externalization	production via ADR-90 (since PR #904, 2026-05-12)	§5 Hyperparameter Resolution (ADR-90)
Harness-trained model promotion	frozen — promotion blocked until S18-S19 close	§2 Current Status & Promotion Freeze
S18 regression diagnostic	in flight (plan_review PASSED OK 4298520f, OP wp#154)	§6 Known Regression — S18

1. Executive Summary¶

The training pipeline is now canonically routed through the Hamilton-based training harness for XGBoost, LightGBM and CatBoost. Since PR #904, all numeric hyperparameter defaults and HPO ranges are resolved at runtime from PG ftf_config.base_env via ADR-90 ; source-code hyperparameter values are forbidden and enforced by CI gate G5.

However, the post-harness training path is currently under diagnostic. The S17 canary FTF run exposed shallow-training behavior : LightGBM stops after 1-2 iterations across all trials, XGBoost shows median best_iteration around 17, and model quality is materially below the pre-#891 baseline. Current post-S17 f1_buy results on defi_top5 5m are : CatBoost 0.39, LightGBM 0.37 and XGBoost 0.09, versus a pre-#891 baseline around 0.42.

As a result, the harness architecture is production-canonical, but harness-trained model promotion is frozen until CVN-N001-EE-S18 and S19 are closed. The current leaderboard is informational only and must not be used for live trading promotion.

Pipeline flow (canonical, post-#898)¶

OHLCV → Enrichment → Labeling → Feature Engineering
    → Temporal Split → Harness DAG
    → { XGB | LGB | CB } via `harness.train_one(model_type, datasets, hpo_params)`
    → Evaluation → MLflow

CUSUM is applied at inference (per ADR-46 + committee reco) ; it is NOT a training filter. The legacy "Stage 7 HPO + XGBoost Training" name is retained in §8 Legacy Pipeline Analysis for historical context but the current canonical path is described in §4-§5.

Consolidated performance snapshot¶

Single source of truth — referenced by every other section :

Baseline	State	Model	f1_buy	Notes
pre-#891	legacy autonomous trainers	XGB+LGB+CB (3 trainers, ~2400 LOC)	~0.42	reference target — what we want to recover
post-#891	harness migration	(all 3)	~0.22	the 2026-05-11 incident that motivated S17
post-S17	harness + Console-seeded HPs	CatBoost	0.39	within range, best_iter 12-155 mixed
post-S17	harness + Console-seeded HPs	LightGBM	0.37	shallow + θ-sweep at θ=0.2 compensates (best_iter=1 in 53/53 trials)
post-S17	harness + Console-seeded HPs	XGBoost	0.09	shallow + θ=0.5 fixed → near-empty buy classifier ; best_iter p50=17

Promotion-freeze decision is anchored on this table : no harness-trained model goes live until f1_buy on defi_top5 5m ATR0.5_1.5_H4 returns to ≥ 0.40 on ≥ 4/5 cryptos for all 3 model types (mlops_readiness.md §4 canary criterion).

2. Current Status & Promotion Freeze¶

Component	Production state	Promotion gate
Harness binary (`src/training/harness/`)	production canonical (PR #891 + #898)	n/a (architectural change, not a model)
Hyperparameter resolver (`commun.finetune.hyperparams.resolve`)	production canonical (PR #904 / ADR-90)	n/a (config plumbing)
Console seeding (455 keys in `ftf_config.base_env`)	seeded 2026-05-12T18:34 UTC, 100% coverage	n/a
Harness-trained XGB / LGB / CB models	trained, results in `finetune_results`	FROZEN — promotion blocked pending S18-S19
FTF leaderboard	running daily	informational only — not a promotion signal
Live trading on harness-trained models	forbidden	gate = S18 verdict LOCK + S19 fix verified on canary

Until both CVN-N001-EE-S18 (diagnostic) and the follow-up CVN-N001-EE-S19 (remediation) close with a verified canary, the operator MUST NOT promote any harness-trained model to live trading. Backtests and shadow runs are permitted ; auto-promotion is OFF (ADR-2).

3. Current Canonical Architecture¶

The pipeline below is the version in production as of 2026-05-13. See §8 Legacy Pipeline Analysis for the pre-#891 path (kept for historical bottleneck analysis ; do NOT reference for current implementation).

Binance OHLCV
   ↓
Enrichment (technical indicators)
   ↓
Labeling (triple-barrier ATR-dynamic)
   ↓
Feature Engineering (Hamilton DAG, fitted at train, frozen at inference)
   ↓
Temporal split (walk-forward, 5 folds × {train, val, test})
   ↓
Harness DAG (Hamilton, plugin registry)
   ↓
   ┌─ XGB ──┐
   ├─ LGB ──┤ → resolver `resolve(MODEL, TF, PARAM)` reads from PG ftf_config.base_env
   └─ CB ──┘   (455 keys, Console-editable, ADR-90)
   ↓
Evaluation (per-fold f1_buy / sortino / expectancy / drawdown, with bootstrap CI)
   ↓
MLflow registry (versioned model + feature contract pinned, ADR-23 / ADR-24)
   ↓
{Promotion gate} ← currently FROZEN pending S18-S19 — see §2

Each upstream stage (OHLCV / Enrichment / Labeling / FE / Split) is unchanged from the legacy ; only the training step (Stage 7 in legacy numbering) was replaced. The harness is therefore a drop-in replacement at the training boundary.

4. Harness Path¶

The legacy per-model autonomous trainer approach (Stages 7-9 of the legacy pipeline, see §8 Legacy Pipeline Analysis) has been replaced by a Hamilton-based training harness with plugin registries (ADR-89). This is the canonical path for every single-model and ensemble training in production since 2026-05-10 (PR #891 + #898). Promotion of harness-trained models is currently FROZEN — see §2 Current Status & Promotion Freeze.

High-level data flow¶

flowchart LR
    subgraph FTF[FTF runner / orchestrator]
        OR["CVNTrade_AutonomousOrchestrator
._create_autonomous_trainer(model_type)"]
    end

    subgraph Harness[training.harness]
        REG{{"MODEL_REGISTRY
ENSEMBLE_REGISTRY"}}
        AMT[CVNTrade_AutonomousModelTrainer
generic, parameterised by model_type]
        AET[CVNTrade_AutonomousEnsembleTrainer
generic, parameterised by ensemble_name]
        T1["train_one(model_type, datasets, hpo)
Hamilton Driver"]
        TE["train_ensemble(name, datasets, hpo)
Hamilton Driver"]
    end

    subgraph DAGs[harness.dags]
        XGD["xgboost_dag.py"]
        LGD["lightgbm_dag.py"]
        CBD["catboost_dag.py"]
        ENS1["blend_avg_dag.py"]
        ENS2["stack_logreg_dag.py"]
        ENS3["stack_xgb_meta_dag.py"]
    end

    subgraph Adapters[harness.adapters]
        XA[XGBAdapter]
        LA[LGBAdapter]
        CA[CBAdapter]
        EA[EnsembleAdapter]
    end

    OR --> REG
    REG -- single model --> AMT
    REG -- ensemble --> AET
    AMT --> T1
    AET --> TE
    T1 --> XGD & LGD & CBD
    TE --> ENS1 & ENS2 & ENS3
    XGD --> XA
    LGD --> LA
    CBD --> CA
    ENS1 & ENS2 & ENS3 --> EA
    EA -. composes .-> XA & LA & CA

Adding a new model¶

Drop a new file src/training/harness/dags/models/<name>_dag.py declaring the Hamilton DAG nodes (<name>_native, <name>_adapter, <name>_metrics_val, <name>_metrics_test, <name>_artifact) + the per-model HPO search space. Each numeric default MUST be read via commun.finetune.hyperparams.resolve(_MODEL, current_timeframe(), "<PARAM>") (per ADR-90 + CI grep gate G5) ; HPO range MIN/MAX/SCALE via resolve_hpo_range(_MODEL, current_timeframe(), "<PARAM>").
Append a register_model(...) call at the bottom of the file (or use the convention scan in harness/registry.py).
Seed the Console with the model's HP defaults + HPO ranges for the 5 timeframes (1M / 5M / 15M / 30M / 1H). Add the values to scripts/seed_hyperparams_console.py::LEGACY_DEFAULTS + LEGACY_HPO_RANGES, update commun.finetune.hyperparams.expected_keys() to enumerate the new (model, param) pairs, then run : python scripts/seed_hyperparams_console.py --apply against prod PG (via kubectl exec on a scheduler pod, see documentation/runbooks/break-glass-hyperparams.md for the Console-bypass pattern when needed). The parity tests tests/unit/training_harness/test_hyperparams_seeding_parity.py MUST pass before merge.
Done. The orchestrator picks the new model up automatically — no edits to any caller. Synthetic-pickup test : tests/unit/training_harness/test_phase4_lgb_cutover.py::TestSyntheticFourthModelPickup.

Adding a new ensemble¶

Same pattern under dags/ensembles/. The EnsembleAdapter composes N base TrainedArtifacts ; Hamilton resolves the base model dependencies and caches them within a driver call. Ensembles do NOT own hyperparameters themselves today — they reuse the per-base-model seeded values from ftf_config.base_env. If an aggregator gains tunable HPs (e.g., LogRegShrinkAggregator.l2 floor), externalize via ADR-90 with a CVN_HPO_ENS_<NAME>_<PARAM> namespace extension (currently unused — file an ADR addendum if needed).

File map (current)¶

Concern	File	LOC
Public API	`src/training/harness/__init__.py`	~180
Typed contracts	`src/training/harness/contracts.py`	~150
Plugin registries	`src/training/harness/registry.py`	~120
Per-model DAGs	`src/training/harness/dags/models/*_dag.py`	~200 each
Per-ensemble DAGs	`src/training/harness/dags/ensembles/*_dag.py`	~180 each
Per-model adapters	`src/training/harness/adapters/{xgb,lgb,cb}.py`	~50 each
Ensemble adapter	`src/training/harness/adapters/ensemble.py`	~80
Reused Hamilton nodes	`src/training/harness/nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py`	~700 total
Generic single-model autonomous wrapper	`src/training/harness/autonomous_model_trainer.py`	~280
Generic ensemble autonomous wrapper	`src/training/harness/autonomous_ensemble_trainer.py`	~240
Hamilton-native wrappers (PR #896 wiring)	`src/training/harness/dags/wrappers/`	varies
Orchestrator dispatch	`src/training/cvntrade_autonomous_orchestrator.py`	~210
Hyperparameter resolver (ADR-90)	`src/commun/finetune/hyperparams.py`	~410
Console seeding script (ADR-90)	`scripts/seed_hyperparams_console.py`	~570

Backtest contract¶

Every harness DAG returns a TrainedArtifact whose .adapter exposes the canonical predict_proba(X) -> (n, 2) Protocol. The backtest engine's create_model_adapter factory recognises harness adapters via duck-typing and wraps them through _HarnessAdapterShim to satisfy the legacy CVNTrade_ModelAdapter ABC. No type-sniffing in consumer code (regime trainer, backtest engine, OOS predictor).

5. Hyperparameter Resolution (ADR-90)¶

¶

Since PR #904 (CVN-N001-EE-S17, 2026-05-12) :

Key family	Naming convention	Count today	Owner
Defaults	`CVN_HPO_<MODEL>_<TF>_<PARAM>`	110 HPO-tunable + 15 EARLY_STOPPING_ROUNDS = 125	Console (`@dococeven`)
HPO ranges	`CVN_HPO_<MODEL>_<TF>_<PARAM>_RANGE_{MIN,MAX,SCALE}`	330 (3 per HPO-tunable default × 110)	Console (`@dococeven`)
Total		455 canonical keys

The resolver contract :

Env key present → parse to typed value (per _PARAM_TYPES in commun/finetune/hyperparams.py)
Env key missing + fallback=None → RuntimeError with canonical message pointing to ADR-90 (operator MUST seed the key)
Env key missing + fallback=<value> → emit WARN event=hpo_fallback_applied then return the (type-coerced) fallback (transitional warning window per ADR-90 Clause 2)
NaN / ±Inf rejected at parse time (_parse fail-fasts)
HPO range MIN >= MAX rejected at resolve_hpo_range exit
Fallback type-coerced through the same _parse path as env-sourced values (string "150" → int 150 for EARLY_STOPPING_ROUNDS)

Operator workflow : - Tune a value : Console UI → Config → ftf_config → edit key → save (writes audit row to ftf_config_history). No deploy needed (rule 1 of ADR-90). - Inspect coverage : Grafana dashboard cvntrade-hp-coverage (5 panels : 24h fallback count, 6h fallback %, 7d timeseries, top missing keys, training stack health). Target = 0 fallback events. - Rollback bad write : Console UI restore-from-history button NOT IMPLEMENTED today (Epic CCP wp#149 scope) ; until then, use the break-glass PG path with the gating ritual in documentation/runbooks/break-glass-hyperparams.md. - Seed a new namespace : python scripts/seed_hyperparams_console.py --apply after extending LEGACY_DEFAULTS / LEGACY_HPO_RANGES / expected_keys(). Idempotent (re-running is a no-op on already-seeded keys).

6. Known Regression — S18¶

Post-S17 canary FTF ftf_20260512_184337_fdee27_ATR0.5_1.5_H4 (2026-05-12) exposed that harness training stops shallow for all 3 model types :

Model	best_iteration p50	f1_buy mean	Profile
LightGBM	1 (53/53 trials hit iter 1-2)	0.37	Shallow + val-tuned θ-sweep at θ=0.2 grabs random-distribution tail
CatBoost	14 (some trials reach 336)	0.39	Mixed — bimodal
XGBoost	17	0.09	Shallow + fixed θ=0.5 → near-empty buy classifier (1-44 trades vs 100-280 for LGB/CB)

Root cause is upstream of S17. Hyperparameter externalization (PR #904) is verified clean : event=hpo_fallback_applied count = 0, Optuna picks are within seeded ranges, parity tests test_hyperparams_seeding_parity.py confirm byte-for-byte legacy values. The regression is in the #891 harness migration itself — diagnostic Story scope under CVN-N001-EE-S18 plan dossier (plan_review committee session 4298520f PASSED OK strong consensus, 0 blockers, OP wp#154, OP Meeting #133).

Diagnostic plan (S18) — 6 steps, 5.25-day budget¶

Step	Action	Budget
0	Pre-validate captured fold reproduces FTF metrics in staging (committee reco #4)	0.25 day
1	Capture reference fold `AAVEUSDC fold=3 best_iter=1` from FTF run with verbose logging	1 day
2	Reconstruct pre-#891 legacy `train_with_fixed_params_lgbm` from git (`git show e75418ca^:src/training/LightGBM/grid_utils.py`)	0.5 day
3	Build parity reproducer `scripts/diagnostic_s18_harness_vs_legacy.py` — all 3 models + per-iter AUC + binary_logloss	2 days
4	Bisect harness commits (#891 `dc3d86c6`, #896 `f56fa33f`, #899 `e75418ca`)	1 day
5	Final dossier `documentation/missions/cvn-n001-ee-s18-diagnostic/report.md` + verdict	0.5 day

Verdict tree (per ADR-79, adapted for diagnostic Story) :

LOCK : root cause + fix < 30 LOC → open CVN-N001-EE-S19 1-week plan
KEEP_AVAILABLE : root cause + fix non-trivial → S19 with 2-3 sprint plan + design dossier
ABANDON : 30d no convergence OR fix > 3× revert surface → S19 reverts #891 + #896 + #899, accept ADR-89 loss, redo migration

Until S18 + S19 close, the harness FTF leaderboard is informational only ; no live trading promotion of harness-trained models is permitted (see §2).

7. Observability & Operator Workflow¶

Observability contract — Loki event schema¶

Every harness DAG emits the same event= schema, traceable in Loki via the {namespace="cvntrade"} stream :

Event	When	Key labels
`event=training_started`	DAG entry	`model_type`, `harness_schema_version`, `n_train`, `n_val`, `n_test`, `n_features`, `class_dist_train_buy_pct`, `n_estimators`, `learning_rate` (golden fields — full bundle since CR round 1 PR #904 `lgb_effective_hyperparams`)
`event=class_balance_applied`	After class-weight computation	`model_type`, `binary`, `scale_pos_weight`, `n_pos`, `n_neg`, `imbalance_ratio`
`event=theta_picked`	After `pick_threshold_on_val`	`model_type`, `threshold`, `f1_buy_at_threshold`, `n_candidates`
`event=training_complete`	DAG exit	`model_type`, `best_iteration`, `training_time_sec`, `theta_picked`, `f1_buy_val`, `f1_buy_test`, `auc_buy_val`, `auc_buy_test`
`event=autonomous_trained`	Cache-aware wrapper success	`model_type`, `crypto`, `strategy`, `run_id`
`event=hpo_fallback_applied` (NEW, S17)	Resolver hit fallback path	`model`, `timeframe`, `param`, `key`, `fallback`, `reason=env_key_missing` — Grafana panel `cvntrade-hp-coverage` thresholds : > 1% rate of `event=training_started` → P3, > 5% → P2

8. Legacy Pipeline Analysis¶

Historical reference only. The current canonical implementation is described in §3-§7 above. Stages 1-6 below remain technically accurate as inputs to the harness (Stage 7 is the part replaced). Stages 7-9 are the pre-#891 trainers — deprecated since 2026-05-10. This section is retained for the bottleneck analysis it informs (§9) and for the FIX A-H historical action plan (Appendix A, §12) that motivated the harness refactor.

8.1 Pipeline Stages 1-6 (still current as harness inputs)¶

Stage 1: Data Ingestion (OHLCV)¶

Aspect	Detail
Source	Binance Futures API via ccxt
File	`src/ETL/cvntrade_external_data.py:20-26`
Timeframe	30m (default), configurable via `CVN_TIMEFRAME`
History	24 months (configurable via `history_months`)
Data	Open, High, Low, Close, Volume + Funding Rate (hourly, resampled)
Cache	Redis L1 (10 connections, 5s timeout) → Disk L2 (parquet)
Volume	~24,000 candles per crypto for 24m @ 30m

Checkpoint: Raw OHLCV dataframe, no NaN, timezone-aware index.

Stage 2: Enrichment (Technical Indicators)¶

Aspect	Detail
File	`src/commun/pipeline/enrichment_api.py:82-90`
Processor	`CVNTrade_Enrich.process(df, mode="inference")`
Indicators	SMA (multi-period), RSI (7,14,28), MACD (slow+signal), ATR, Stochastic (K,D), Volume MA, MFI, ADX, Bollinger Bands
Gate v5 features	MA spectrum (5,10,20,50), rolling window stats, volatility regime, momentum patterns, support/resistance, volume flow, market structure
Output	~300+ columns (raw indicators + derived features)
Warmup	~500 candles consumed for indicator bootstrap
Parity guarantee	`enrich_batch(ohlcv).iloc[-1]` == `enrich_streaming(window).iloc[-1]` (ADR)

Checkpoint: Enriched dataframe with ~300 columns, ~23,500 rows (after warmup).

Stage 3: Labeling (Triple Barrier)¶

Aspect	Detail
File	`src/ETL/cvntrade_label.py:188-269`
Method	Triple barrier with ATR-dynamic levels
Strategy format	`ATR{sl}_{tp}_H{horizon}` (e.g. ATR1.5_3.0_H5) or legacy `SL{x}_TP{y}_H{z}` (e.g. SL0.5_TP1_H4)
TP level	`open × (1 + atr × tp_mult / 100)`
SL level	`open × (1 - atr × sl_mult / 100)`
Horizon	H candles forward (e.g. H5 = 5 hours @ 30m = 10 candles)
Labels	+1 (BUY: TP hit first), -1 (SELL: SL hit first), 0 (HOLD: timeout)
Anti-look-ahead	Uses T+1 to T+horizon window, decision at T based on T-1 return
ATR floor	0.6% minimum (prevents degenerate labels on low-vol periods)
NaN handling	Labels that don't resolve (no TP/SL/timeout) → dropped (~10-20%)

Class distribution (typical for ATR1.5_3.0_H5):

BUY (+1):   ~15-25%  ← minority class
HOLD (0):   ~40-50%  ← majority class (timeout)
SELL (-1):  ~25-35%

Checkpoint: Labeled dataframe. ~20,000 rows after NaN drop. Imbalanced toward HOLD.

Known issue: High HOLD ratio means model defaults to HOLD → low BUY recall.

Stage 4: CUSUM Pre-Filter¶

Aspect	Detail
File	`src/backtest/filters/cvntrade_cusum_filter.py:279-409`
Applied	BEFORE train/val/test split (`cvntrade_autonomous_fe.py:125-200`)
Algorithm	Cumulative sum control chart on log returns
Formula	`S+[t] = max(0, S+[t-1] + (r[t-1] - μ - k))`, alert if `S+ > h` or `S- < -h`
Parameters	`k=0.5σ` (allowance), `h=3.0σ` (threshold, from `CVN_CUSUM_THRESHOLD_H`)
Warmup	100 bars (all marked as transition)
Cooldown	10 bars after each detection
Sigma	Calibrated on training data, cached, NEVER re-fit on test (ADR)
Mode	"Stable mode" — keeps only NON-transition bars
Output	~5% of candles survive (regime change events only)

Checkpoint: Filtered dataframe. ~1,000-1,500 rows from 20,000. 95% data loss.

CRITICAL BOTTLENECK #1

Input:  20,000 labeled candles
CUSUM:  ~1,000 survive (h=3.0σ)
Output: 1,000 rows → split 70/15/15 → Train: 700, Val: 150, Test: 150

With more aggressive h (e.g. h=4.0σ from env):
Output: ~300 rows → Train: 210, Val: 45, Test: 45

Stage 5: Feature Engineering¶

Aspect	Detail
File	`src/commun/cache/components/cvntrade_autonomous_fe.py:125-170`
Steps	Imputation → Stationarization (ADF/KPSS) → Normalization (StandardScaler)
Feature selection	Variance threshold + adaptive cap
Adaptive cap	`min(80, max(10, n_features / 20))`
Example	300 features → cap = 15-80 → typically 30-50 selected
Fit scope	Pipeline fitted on TRAIN set only, applied to val/test
Feature lag	Disabled (`feature_lag=0`)
Label remapping	3-class: {-1,0,1} → {0,1,2}. Binary: {-1,0} → 0, {1} → 1

Checkpoint: Feature matrix X (n_samples × n_features), label vector y.

BOTTLENECK #2 — Aggressive dimensionality reduction

300+ raw indicators → 30-50 selected features
Potential signal loss: MACD variants, stochastic slopes, volume patterns removed
Model capacity limited by feature count

Stage 6: Train/Val/Test Split¶

Aspect	Detail
File	`src/commun/cache/components/cvntrade_autonomous_fe.py:143-146`
Method	Temporal split 70/15/15 (sequential in time)
Purge	10 bars between train/val boundary (`CVN_PURGE_BARS`)
Embargo	5 bars between val/test boundary (`CVN_EMBARGO_BARS`)
Walk-forward	Supported via FTF folds (5 folds, sliding window)

Checkpoint: Three datasets (X_train, y_train), (X_val, y_val), (X_test, y_test).

BOTTLENECK #3 — Tiny datasets after CUSUM

With 1,000 post-CUSUM rows:
  Train: 700 samples
  Val: 150 samples
  Test: 150 samples

With 300 post-CUSUM rows (aggressive h):
  Train: 210 samples
  Val: 45 samples
  Test: 45 samples

XGBoost on 210 training samples with 30 features = severe overfitting risk

8.2 Pipeline Stages 7-9 — Deprecated, kept for historical analysis¶

Stage 7: HPO + XGBoost Training (legacy — superseded by harness in PR #891 + #898)¶

Aspect	Detail
File	`src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py:52-250`
Optimizer	Optuna TPESampler (seed=42)
Trials	30 (from `CVN_HPO_N_TRIALS`), timeout 3600s
Pruner	MedianPruner
Warm start	5 startup trials (random), then TPE

Hyperparameter search space (from config, timeframe-dependent):

max_depth:        [3, 8]
learning_rate:    [0.01, 0.3]
n_estimators:     [100, 500]
subsample:        [0.6, 1.0]
colsample_bytree: [0.6, 1.0]
min_child_weight: [1, 10]
gamma:            [0, 5]
reg_alpha:        [0.01, 5.0]  (timeframe-dependent)
reg_lambda:       [0.5, 7.0]   (timeframe-dependent)
threshold_buy:    [0.30, 0.50]
threshold_sell:   [0.30, 0.50]

Objective function (CVN_HPO_OBJECTIVE):

Objective	Formula	Default weights
`precision_recall_auc`	`w_prec × precision_buy + w_auc × auc + w_log × (1-logloss)`	0.45 / 0.35 / 0.20
`fbeta_buy`	`F_β(buy)` with β from `CVN_BUY_BETA`	β=1.0 (F1)
`logloss_auc`	`w × (1-logloss) + (1-w) × auc`	w=0.5 (binary mode)

Guard conditions (kill trial if):

if action_rate < 0.08 or action_rate > 0.60:  → score = 0.0
if recall_buy < 0.15:                          → score = 0.0

BOTTLENECK #4 — HPO guards constrain the solution space

action_rate [8%, 60%] forces model to predict BUY on 8-60% of samples
recall_buy > 15% forces minimum BUY detection
Combined: model must predict BUY often enough but not too often
Result: model converges to "safe middle" with f1_buy ~0.45

Class balancing (XGBoost trainer lines 900-911):

if _is_class_balancing_enabled():  # CVN_CLASS_BALANCING=1
    class_weights = compute_class_weight("balanced", classes, y_train)
    sample_weights = [class_weights[int(y)] for y in y_train]
else:
    sample_weights = ones(len(y_train))  # NO BALANCING (default)

BOTTLENECK #5 — Class balancing disabled by default

With 70% HOLD, 15% BUY, 15% SELL:
  Model learns to predict HOLD for most samples (safe bet)
  BUY precision high but recall low → f1_buy ~0.45
  Enabling balancing: 0.05-0.10 improvement expected

Early stopping (trainer line 957):

early_stopping_rounds = config.early_stopping_rounds  # default 150

Calibration (trainer lines 961-963):

if config.calibration != "none":
    self._apply_calibration(X_train, y_train, config.calibration)
# Options: "none" (default), "isotonic", "platt"

Stage 8: Model Evaluation (legacy — superseded by harness, kept for context)¶

Aspect	Detail
File	`src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py:1025-1114`
Metrics	f1_macro, f1_buy, precision_buy, recall_buy, auc_buy, logloss, accuracy, action_rate, fbeta_buy
Evaluation sets	Val (for HPO selection) + Test (for reporting)
Overfit gap	`f1_val - f1_test` (added in #521)

Typical metrics (ATR1.5_3.0_H5, defi_top5):

f1_buy:        0.43-0.49     ← weak
precision_buy: 0.50-0.65     ← decent
recall_buy:    0.20-0.35     ← very low (model too conservative)
auc_buy:       0.55-0.65     ← barely above random (0.50)
action_rate:   0.10-0.30     ← low (few BUY predictions)

Stage 9: MLflow Storage (legacy — superseded by harness, kept for context)¶

Aspect	Detail
File	`src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py:353-398`
Model name	`CVNTrade_XGBoost_{CRYPTO}_{STRATEGY}_{TIMEFRAME}`
Artifacts	Model pickle, feature names (tag), feature count (tag), validation metrics (metadata)
Registry	MLflow Model Registry with stages (Staging, Production)
Promotion	Manual via `lock_winner()` (ADR-2: no auto-promotion)

9. Historical Bottlenecks¶

Historical reference only. The bottleneck analysis below was the V1 / V2 committee assessment that motivated the harness refactor (now done) and the FIX A-H action plan (Appendix A, §12). It describes the pre-#891 pipeline ; the harness path has different bottlenecks, currently under diagnosis in CVN-N001-EE-S18 (see §6).

Bottleneck Map¶

                     Raw Data (24,000 candles)
                              │
                    ┌─────────┴─────────┐
                    │  ENRICHMENT        │ → 300+ features
                    └─────────┬─────────┘   No data loss
                              │
                    ┌─────────┴─────────┐
                    │  LABELING          │ → ~20,000 labeled
                    └─────────┬─────────┘   10-20% NaN drop
                              │                              ← BOTTLENECK #6: label dropout
                    ┌─────────┴─────────┐
                    │  CUSUM FILTER      │ → ~1,000 rows     ← BOTTLENECK #1: 95% data loss
                    └─────────┬─────────┘   (h=3.0σ)
                              │
                    ┌─────────┴─────────┐
                    │  FEATURE ENGIN.    │ → 30-50 features   ← BOTTLENECK #2: signal loss
                    └─────────┬─────────┘   (adaptive cap)
                              │
                    ┌─────────┴─────────┐
                    │  SPLIT 70/15/15   │ → Train: 700
                    └─────────┬─────────┘   Val: 150          ← BOTTLENECK #3: tiny datasets
                              │             Test: 150
                    ┌─────────┴─────────┐
                    │  HPO (30 trials)  │ → f1_buy ~0.45      ← BOTTLENECK #4: guards limit
                    └─────────┬─────────┘                     ← BOTTLENECK #5: no class balance
                              │
                    ┌─────────┴─────────┐
                    │  EVALUATION        │ → 1-10 trades/fold
                    └─────────┬─────────┘   (tiny test set)
                              │
                    ┌─────────┴─────────┐
                    │  MLFLOW STORAGE    │
                    └───────────────────┘

Bottleneck Severity Ranking¶

#	Bottleneck	Severity	Impact	Root Cause
1	CUSUM filters 95% of data	CRITICAL	Train/val/test too small for reliable ML	h=3.0-4.0σ too conservative for crypto volatility
2	Feature cap (30-50 features)	HIGH	Model can't learn complex patterns	Adaptive cap `n/20` too aggressive
3	Tiny test sets (7-150 samples)	HIGH	Metrics statistically unreliable, few trades	Consequence of #1
4	HPO guards (action_rate, recall)	MEDIUM	Solution space artificially constrained	Guards too tight for imbalanced data
5	Class balancing disabled	MEDIUM	Model biased toward HOLD	`CVN_CLASS_BALANCING=0` default
6	Label dropout (~15%)	LOW	Reduced training set	Window too short for some candles

10. Roadmap / Open Stories¶

Current in-flight + immediate-next Stories tied to this pipeline :

Story	Status	Owner	Outcome gate
CVN-N001-EE-S16 (harness unification)	✅ Closed — PR #891 merged 2026-05-09, full cutover #899 2026-05-10	`@dococeven`	n/a — closed
CVN-N001-EE-S17 (hyperparameters externalization, ADR-90)	✅ Closed — PR #904 merged 2026-05-12	`@dococeven`	n/a — closed, sunset milestone-anchored on Epic CCP wp#149
CVN-N001-EE-S18 (harness shallow-training diagnostic)	🔄 In flight — plan_review PASSED OK 4298520f, implementation starts post-PR-#910-merge	`@dococeven`	Verdict LOCK / KEEP_AVAILABLE / ABANDON per §6 decision tree
CVN-N001-EE-S19 (remediation, scope TBD by S18 verdict)	⏸️ Pending S18	`@dococeven`	Canary FTF f1_buy ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types
Epic CCP wp#149	⏸️ Planned (immediate successor to training stabilization)	`@dococeven`	Typed schemas + Console UI restore + scoped resolution + snapshots + approval flow + OpenFeature integration

Promotion-gate criteria (S19 closure)¶

Recover the pre-#891 baseline AND demonstrate the harness path matches or exceeds it :

Metric	pre-#891 baseline	S19 closure gate	Method
f1_buy (defi_top5 5m ATR0.5_1.5_H4)	~0.42 on all 3 model types	≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types	FTF canary, 5 folds, bootstrap CI
best_iteration (LightGBM)	100-500 typical	> 50 median	Loki `event=training_complete` query
`event=hpo_fallback_applied` rate	n/a (pre-ADR-90)	0 over 7 days	Grafana `cvntrade-hp-coverage` dashboard
Trades per fold (XGB)	30-100	≥ 30 mean across folds	`finetune_results` PG query

Only when ALL four gates pass on the canary FTF does the operator unfreeze harness-trained model promotion (§2). If S18 returns ABANDON, the promotion-freeze persists indefinitely and the operator pivots to revert.

11. ADR Compliance¶

ADR	Topic	V1	V3 (target)
ADR-2	No auto-promotion	COMPLIANT	COMPLIANT + trade count gate
ADR-4	Cache explicit	COMPLIANT	COMPLIANT
ADR-14	Multi-fold evaluation	COMPLIANT	COMPLIANT (5 folds)
ADR-15	Theta calibrated OOS	COMPLIANT	COMPLIANT + all params OOS
ADR-16	Labels coherent	COMPLIANT	COMPLIANT
ADR-23	Features version-pinned	COMPLIANT	COMPLIANT
ADR-24	Feature set = model contract	COMPLIANT	COMPLIANT
ADR-25	No silent fallback	PARTIAL	COMPLIANT (all fallbacks documented + alerted)
ADR-28	Binary classification	COMPLIANT	COMPLIANT + FTF ablation (#517)
ADR-29	Baseline naïve obligatoire	COMPLIANT	COMPLIANT + edge validation (§1b)
ADR-46	Class balancing	NON-COMPLIANT	COMPLIANT (Fix B: enabled by default)
ADR-47	Meta-label validation	COMPLIANT	COMPLIANT

12. Appendix A — Historical Action Plan (FIX A-H, V1+V2 Committee Recos)¶

Historical action plan that motivated the harness refactor. The FIX A-H items below were the V1 + V2 committee deliverables that drove the V3 plan. Some are now closed (e.g. FIX B — class balancing — integrated into the harness class_balance node ; FIX A — CUSUM decoupling — applied at inference per ADR-46). Others (FIX D trading-centric HPO, FIX F MLOps controls) remain partially open. Do NOT use this as the current action plan — the current open work is in §10. This appendix is kept for the rationale and traceability it provides.

4. Committee Verdict & Action Plan¶

V1 Committee Result: REJECTED (STARVATION, 4.0/10)¶

8 recommendations received. ALL addressed below with concrete fixes.

Action Plan — 3 Structural Fixes + 5 Improvements¶

PRIORITY 1 — CRITICAL (blocks everything)
┌─────────────────────────────────────────────────────────────────┐
│  FIX A: Decouple CUSUM from training                            │
│  FIX B: Enable class balancing by default                       │
│  FIX C: Integrate realistic cost model into HPO                 │
└─────────────────────────────────────────────────────────────────┘

PRIORITY 2 — HIGH (multiplier on Fix A/B/C)
┌─────────────────────────────────────────────────────────────────┐
│  FIX D: Trading-centric HPO objective (Sortino-based)           │
│  FIX E: Increase feature cap + regularization                   │
│  FIX F: MLOps operational controls                              │
└─────────────────────────────────────────────────────────────────┘

PRIORITY 3 — MEDIUM (validation & exploration)
┌─────────────────────────────────────────────────────────────────┐
│  FIX G: Test CUSUM/ML paradigm conflict                         │
│  FIX H: Binary classification (#517)                            │
└─────────────────────────────────────────────────────────────────┘

5. FIX A: Decouple CUSUM from Training (committee reco #1, #7)¶

Problem¶

CUSUM filters 95% of training data at h=3.0σ. Result: 20,000 candles → 1,000 → split → Train: 700 samples. XGBoost on 700 samples with 30 features = severe overfitting, unreliable metrics.

Solution: Train on ALL data, CUSUM only at inference¶

CURRENT (broken):
OHLCV → Enrichment → Label → CUSUM FILTER → FE → Split → Train
                              ▲
                              └── 95% data lost HERE

TARGET (fixed):
OHLCV → Enrichment → Label → FE → Split → Train  (CUSUM REMOVED from training)
                                                    │
                                                    ▼ at inference only:
                                            CUSUM → Inference → Filters → Trade

Implementation¶

File: src/commun/cache/components/cvntrade_autonomous_fe.py

Current (~line 125):

# Apply CUSUM before split (issue #295)
filtered_df = self._apply_cusum_before_split(enriched_df)
X_train, X_val, X_test = self._temporal_split(filtered_df)

Target:

# Train on ALL data — CUSUM only at inference (committee reco #1)
cusum_training_mode = os.environ.get("CVN_CUSUM_TRAINING_MODE", "disabled")
if cusum_training_mode in ("enabled", "relaxed_1_5", "legacy_3_0"):
    filtered_df = self._apply_cusum_before_split(enriched_df)
else:
    filtered_df = enriched_df  # NO CUSUM filtering for training
X_train, X_val, X_test = self._temporal_split(filtered_df)

New env var: CVN_CUSUM_TRAINING_MODE — canonical variants: {disabled, relaxed_1_5, legacy_3_0} (default: disabled)

FTF factor: cusum_training_mode with variants {disabled, relaxed_1_5, legacy_3_0} to A/B test the impact.

Expected Impact¶

Metric	Before (h=3.0σ)	After (no CUSUM training)
Training samples	~700	~14,000 (20×)
Val samples	~150	~3,000 (20×)
Test samples	~150	~3,000 (20×)
f1_buy	~0.45	Target: 0.55-0.65
Trades per fold	1-10	Target: 30-100

CUSUM remains at inference¶

CUSUM is still applied at inference time (backtest candle loop, line 869). It gates which candles are processed by the ML model. The model is trained on ALL data but only makes predictions on CUSUM-validated candles.

Rationale: The model learns patterns from the full distribution but only acts during regime transitions. This resolves Hypothesis G (CUSUM/ML paradigm conflict) — the model is no longer trained on a biased subset.

Paradigm Test (committee reco #7)¶

To validate this architectural change, run 3 variants via FTF:

Variant	Training data	Inference CUSUM
`cusum_disabled`	ALL candles	YES (h=3.0)
`cusum_relaxed`	CUSUM h=1.5σ	YES (h=3.0)
`cusum_legacy`	CUSUM h=3.0σ	YES (h=3.0)

Compare Sortino, f1_buy, trades/fold. If cusum_disabled wins → confirm architectural fix.

6. FIX B: Enable Class Balancing by Default (committee reco #2)¶

Problem¶

CVN_CLASS_BALANCING=0 (disabled). With 70% HOLD, 15% BUY, 15% SELL, the model learns to predict HOLD for safety. BUY recall = 0.25 (misses 75% of opportunities). ADR-46 non-compliant.

Solution¶

Change default to CVN_CLASS_BALANCING=1 in BASE_ENV (ablation_matrix.py).

File: src/commun/finetune/ablation_matrix.py

BASE_ENV = {
    ...
    "CVN_CLASS_BALANCING": "1",  # Changed from "0" — ADR-46 compliance
    ...
}

Effect: compute_class_weight("balanced") → BUY weight ~3.3×, SELL weight ~1.5×, HOLD weight ~0.7×.

Expected Impact¶

Metric	Before (no balance)	After (balanced)
recall_buy	~0.25	Target: 0.40-0.55
precision_buy	~0.55	May decrease to 0.45-0.50
f1_buy	~0.45	Target: 0.50-0.55
action_rate	~0.15	Target: 0.20-0.35

Trade-off: More BUY predictions → more trades → better statistical power BUT potentially more false positives. The filters (trend, meta-label, regime) catch false positives.

7. FIX C: Realistic Cost Model in HPO (committee reco #3)¶

Problem¶

HPO optimizes precision_recall_auc which ignores transaction costs entirely. A model with f1_buy=0.50 and 30 trades at 15bps cost may have negative expectancy. HPO doesn't know.

Solution: Cost-aware HPO objective¶

New objective: sortino_net — backtest Sortino after costs.

# In hyperoptimizer, after model training:
# 1. Quick backtest on validation set with cost model
# 2. Compute net Sortino
# 3. Use as HPO score

def _compute_sortino_net(model, X_val, y_val, ohlcv_val, cost_bps=15):
    """Quick backtest for HPO scoring — net of costs."""
    predictions = model.predict(X_val)
    trades = _simulate_trades(predictions, ohlcv_val, cost_bps)
    return _sortino_ratio(trades)

Cost components: - Base fee: CVN_TRADE_FEE_BPS (default 15) - Slippage: base_bps + impact × √(size/volume) (from cost_model.py) - Funding rate: CVN_FUNDING_RATE_BPS × expected hold duration (from Binance API)

FTF factor: hpo_objective already has variants {fbeta_buy, precision_recall_auc, f1_macro}. Add sortino_net.

Implementation¶

File: src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py

Add new objective handler:

elif objective == "sortino_net":
    from commun.config.cost_model import compute_trade_cost
    # Quick backtest on validation fold
    net_sortino = _quick_backtest_sortino(model, datasets, cost_bps=15)
    if net_sortino is None or np.isnan(net_sortino):
        return 0.0
    return max(0.0, net_sortino / 10.0)  # normalize to [0, 1] range

8. FIX D: Trading-Centric HPO & Relaxed Guards (committee reco #4)¶

Problem¶

HPO guards kill promising trials: - action_rate < 0.08 → score = 0 (too few BUY predictions) - action_rate > 0.60 → score = 0 (too many BUY predictions) - recall_buy < 0.15 → score = 0 (not enough BUY detection)

These guards constrain the solution space. With class balancing (Fix B), the model will naturally have higher action_rate.

Solution¶

Relax guards and shift to trading-centric scoring:

Current guards:

if action_rate < 0.08 or action_rate > 0.60: return 0.0
if recall_buy < 0.15: return 0.0

Target guards (relaxed):

if action_rate < 0.03 or action_rate > 0.80: return 0.0  # Much wider
if recall_buy < 0.05: return 0.0  # Minimal floor

Objective shift: From classification-centric to trading-centric:

Objective	Current	Target
Primary metric	precision × 0.45 + AUC × 0.35 + (1-logloss) × 0.20	Sortino net (after costs)
Guard: action_rate	[0.08, 0.60]	[0.03, 0.80]
Guard: recall_buy	> 0.15	> 0.05
Guard: n_trades	none	> 10 (minimum trades for valid backtest)

9. FIX E: Feature Selection Strategy (committee reco #5)¶

Problem¶

Adaptive cap: min(80, max(10, n_features / 20)) → typically 30-50 features from 300+. Too aggressive — removes signal.

Solution¶

Increase cap: min(150, max(30, n_features / 10)) → typically 80-150 features
Rely on XGBoost regularization instead of pre-selection:
reg_alpha (L1) already in HPO range [0.01, 5.0]
reg_lambda (L2) already in HPO range [0.5, 7.0]
max_depth [3, 8] limits tree complexity
XGBoost handles feature selection internally via feature_importance
OOS feature selection (strict): Feature importance computed ONLY on training fold, applied to val/test

Implementation: - src/commun/cache/components/cvntrade_autonomous_fe.py: Change cap formula - BASE_ENV: CVN_MAX_FEATURES=0 (0 = auto) → keep, but change auto formula - FTF factor n_features already tests {top_30, top_50, top_100, full}

Expected Impact¶

More features → more signal → BUT only if regularization prevents overfitting. With Fix A (20× more training data), overfitting risk is dramatically reduced. 700 samples + 150 features = overfit. 14,000 samples + 150 features = healthy ratio (93:1).

10. FIX F: MLOps Operational Controls (committee reco #6)¶

Kill-Switch¶

Mechanism	Scope	Control
`CVN_FTF_ENABLED=0`	All FTF runs	Helm ConfigMap
DAG pause	Stop specific DAG	Airflow UI
Model rollback	Revert to previous model	MLflow: promote previous Production stage
Filter disable	Bypass specific filter	`CVN_USE_{FILTER}=0`

Live Observability¶

What	Where	Alert
Model f1/Sortino per run	Grafana FTF dashboard (§10 drift)	>10% drop → warning
CUSUM pass rate	Grafana infra dashboard	<1% → too aggressive
Funnel survival rate	Grafana FTF funnel panel	<5% → starvation
Training sample count	Grafana FTF stats	>15% drop → data issue
Action rate drift	Grafana FTF ML metrics	>20% change → model shift

Drift Detection¶

Type	Method	Trigger
Data drift	PSI on top features (30d rolling)	PSI > 0.2
Concept drift	F1/Sortino trend across runs	>10% drop from 7d mean
Label drift	BUY/HOLD/SELL ratio shift	>20% change
Behavior drift	Action rate, filter block rates	>20% change

Staged Rollout¶

Stage	Environment	Duration	Gate
FTF ablation	Backtest	2-3h/factor	BH p < 0.05
Committee review	Document	1 session	Score ≥ 8
Shadow mode	Paper trading	7 days	No degradation
Canary	Live 10% capital	14 days	Sortino ≥ baseline
Full rollout	Live 100%	—	Monitoring continues

11. FIX G: CUSUM/ML Paradigm Test (committee reco #7)¶

Experimental Design¶

3-arm study via FTF cusum_training_mode factor:

Arm	Training CUSUM	Inference CUSUM	Hypothesis
A: `disabled`	None (all data)	h=3.0σ	ML learns full distribution, CUSUM gates inference
B: `relaxed`	h=1.5σ	h=3.0σ	Partial filter, more data than current
C: `legacy`	h=3.0σ	h=3.0σ	Current baseline

Success Criteria¶

Metric	Minimum acceptable	Target
f1_buy	0.50	0.60
Sortino (15bps)	1.0	2.0
Trades per fold	20	50+
Powered variants	all 3	all 3

Expected Outcome¶

Arm A should dominate: - 20× more training data → better model generalization - CUSUM at inference → same signal quality (only regime-change trades) - f1_buy improvement: +0.10-0.20 (from 0.45 to 0.55-0.65)

If Arm A does NOT dominate → fundamental signal-to-noise issue (Hypothesis F). Requires alternative approaches (different features, different model architecture, different timeframe).

12. FIX H: Binary Classification (committee reco #8)¶

Already implemented as FTF factor classification_mode (issue #517):

Variant	Config	Expected Impact
`3class` (baseline)	SELL/HOLD/BUY	Current performance
`binary_balanced`	BUY/NOT_BUY, logloss_auc w=0.5	+precision (focused decision boundary)
`binary_precision`	BUY/NOT_BUY, logloss_auc w=0.7	+precision (calibration-biased)

Rationale: 3-class wastes model capacity learning SELL patterns we don't act on (LdP pipeline only opens long positions). Binary focuses 100% of capacity on the BUY decision.

13. Appendix B — V2 Stricter Recos (historical)¶

13. Walk-Forward Leakage Prevention (V2 reco #5)¶

Every component in the pipeline must be strictly in-sample for each walk-forward fold:

Fold k timeline:
├── [                   Train window                   ]──[Purge]──[  Val  ]──[Emb]──[  Test  ]
│    FE fitted here only                                   10 bars   15%        5 bars   15%
│    CUSUM sigma calibrated here (lagged 500 bars)
│    Feature selection on train only
│    Class weights computed on train only
│    Thresholds optimized on val only
│    Test: NEVER seen during training or tuning

Per-Component Leakage Audit¶

Component	Leakage vector	Current status	V3 fix
Labels	Future prices in TP/SL window	SAFE — window T+1..T+H, anti-look-ahead	—
Enrichment	Indicators use future data	SAFE — all backward-looking (SMA, RSI, etc.)	—
CUSUM sigma	Fit on full dataset	TO FIX — sigma sees test data volatility	Lagged window: `train_start - 500 bars`
FE pipeline (imputer, scaler, stationarizer)	Fit on val/test data	SAFE — fitted on train split only	—
Feature selection	Importance on full dataset	TO VERIFY — may see test features	Force: `feature_importance(X_train, y_train)` only
Class weights	Computed on val/test	SAFE — `compute_class_weight(y_train)`	—
HPO	Optimizes on test	SAFE — optimizes on val, test is holdout	—
Walk-forward thresholds	Threshold optimized on test	SAFE — walk-forward uses val for threshold, test for evaluation	—
Meta-label model	Trained on same fold	SAFE — ADR-47 requires separate fold	—

Verification Protocol¶

Before each sprint: 1. Code audit: grep -n "X_test\|y_test" src/training/ — no test data in training 2. FE pipeline: verify pipeline.fit(X_train) not pipeline.fit(X_full) 3. Feature selection: verify importance computed on (X_train, y_train) only 4. CUSUM sigma: verify sigma_window_end < train_start - purge_bars

14. Uncertainty Quantification (V2 reco #8)¶

All key metrics must be reported with confidence intervals to quantify statistical reliability.

Methods¶

Metric	CI Method	Implementation
f1_buy	Bootstrap (10,000 resamples)	`ablation_stats.py:bootstrap_ci()` — already implemented
Sortino	Bootstrap (10,000 resamples)	Same — already implemented
Total return	Bootstrap	Same — already implemented
Win rate	Wilson score interval	New — robust for small n, asymptotic-free
Expectancy	Bootstrap on trade PnL distribution	New
Trades per fold	Poisson CI	New — for count data
Survival rate	Binomial CI (`proportion_confint`)	New
Block rate per filter	Binomial CI	New

Reporting Standard¶

Every metric in FTF reports and Grafana dashboards includes:

Sortino: 1.75 [1.43, 2.10] (95% CI, n=51 runs)

If CI includes 0 → metric is not statistically distinguishable from zero. Flag in report.

If CI width > mean → metric is unreliable. Flag as "WIDE CI — insufficient data".

BH Correction for Multiple Comparisons¶

Already implemented: ablation_stats.py:benjamini_hochberg() applied to all pairwise variant comparisons. Controls false discovery rate at 5%.

15. Minimum Trade Count Gates (V2 reco #7)¶

Problem¶

With 1-10 trades per fold, ALL metrics are unreliable. A single trade can swing Sortino from 0 to 100.

Trade count thresholds (committed)¶

Threshold	Count	Purpose	Action if below
Statistical minimum	30	Minimum for bootstrap CI to be meaningful	Flag: "UNDERPOWERED — CI unreliable"
Power analysis	63	d=0.5, α=5%, power=80% (from `compute_min_sample_size()`)	Flag: "INSUFFICIENT POWER for pairwise comparison"
Production minimum	100	Robust strategy evaluation	Gate: do not promote to production

Implementation¶

FTF report: Already flags underpowered variants. Strengthen: exclude < 30 trades from variant ranking.
Grafana: n_trades >= 3 filter (outlier protection). Add panel showing trade count per variant.
Promotion gate: Model cannot be promoted to Production in MLflow if any evaluation fold has < 30 trades.

Expected post-Fix-A trade counts¶

With CUSUM removed from training (Fix A): - Training: 14,000 samples (20× current) - Model sees more examples → better action_rate calibration - Expected: 30-100 trades per fold (vs 1-10 current)

16. Operational Controls & Runbooks (V2 reco #6)¶

Kill-Switch Hierarchy¶

Level	Mechanism	Scope	Activation	Response time
L0: Emergency halt	`CVN_FTF_ENABLED=0` in Helm	All FTF + trading	Helm deploy	<5 min
L1: DAG pause	Airflow UI → pause DAG	Specific DAG	Immediate	<1 min
L2: Model rollback	MLflow: promote previous Production	One crypto	Manual	<10 min
L3: Filter bypass	`CVN_USE_{FILTER}=0` in Helm	One filter	Helm deploy	<5 min

Configuration Audit Trail¶

Every config change is traceable: - Git: Helm values changes → PR → CodeRabbit → merge (audit via git log) - MLflow: Model promotion → registry (version, timestamp, user) - Committee: Design decisions → committee/sessions/*.json (ADR-52) - Airflow: DAG runs → execution logs (who triggered, when, params)

Rollback Procedures¶

Scenario	Steps	RTO
Bad model deployed	1. Pause DAG 2. MLflow: promote previous version 3. Verify in paper trading	15 min
Bad config deployed	1. `git revert` the Helm values PR 2. CI/CD auto-deploy 3. Verify pods restarted	10 min
Pipeline data corruption	1. Identify bad run_id 2. `DELETE FROM finetune_results WHERE run_id = X` 3. Re-run	30 min
Full system failure	1. Emergency halt (L0) 2. Investigate 3. Fix + committee review 4. Staged re-deploy	2-4h

Runbooks (to create in `documentation/RUNBOOKS/`)¶

Runbook	Trigger	First 3 checks
`MODEL_DEGRADATION.md`	f1/Sortino drops >10%	1. Data freshness 2. Feature drift (PSI) 3. Label distribution
`STARVATION.md`	Survival rate < 5%	1. CUSUM threshold 2. Action rate 3. Filter block rates
`HPO_FAILURE.md`	HPO returns score=0 for all trials	1. Action rate guards 2. Class distribution 3. Feature count
`DATA_PIPELINE.md`	n_train_samples drops >15%	1. Binance API status 2. ETL logs 3. Date range
`COST_SPIKE.md`	Avg cost > 50bps	1. Market liquidity 2. Funding rate 3. Slippage model params

14. Appendix C — Market Hypothesis & Edge (V2 reco #9)¶

What edge are we exploiting?¶

The system targets short-term mean-reversion at regime transition points in DeFi altcoin markets.

Thesis: When a DeFi altcoin's volatility regime shifts (detected by CUSUM), the initial price reaction overshoots. The ML model identifies overshooting candles where the probability of a profitable mean-reversion trade (TP hit before SL) exceeds the breakeven probability after costs.

Why this edge should exist¶

DeFi altcoin microstructure: Lower liquidity than BTC/ETH → larger overreactions to regime shifts → mean-reversion opportunity.
Retail-dominated order flow: DeFi tokens have higher retail participation → predictable behavioral patterns (panic selling on vol spikes, FOMO buying on breakouts).
CUSUM as regime detector: CUSUM identifies structural breaks in volatility — these are exactly the moments when market participants overreact and create temporary mispricings.
Triple barrier as target: The ATR-based SL/TP captures the mean-reversion: TP is set at the "fair" reversion level, SL limits the cost of being wrong.

Why it hasn't worked yet¶

The edge exists in principle but the pipeline has structural issues that prevent the model from capturing it: - CUSUM during training eliminates 95% of examples → model can't learn the full distribution of regime transitions - Class imbalance → model defaults to HOLD instead of predicting BUY at transition points - HPO objective optimizes classification accuracy, not trading profit → model maximizes F1 but generates few trades - Feature cap removes market microstructure features (volume patterns, order flow proxies) that capture the edge

How we validate the edge¶

Statistical test: Sortino of our model > Sortino of random entry at same CUSUM-filtered candles (ADR-29 baseline)
Economic test: Net expectancy per trade > 0 after costs at 30 bps
Robustness test: Edge persists across 5 cryptos, 5 folds, 3 cost scenarios
Decay test: Edge doesn't degrade significantly over recent folds (concept drift monitoring)

Success criteria for edge validation¶

Metric	Minimum	Target	Method
Sortino (15bps)	> 1.0	> 2.5	FTF ablation, 5 folds
Net expectancy/trade	> 0	> 0.5%	After costs, per trade
vs random baseline	> 1.5×	> 3×	ADR-29 comparison
Edge stability (fold variance)	CV < 100%	CV < 50%	Cross-fold coefficient of variation
Trades per fold	> 30	> 50	Statistical power for evaluation

15. Appendix D — Pre-Harness Implementation Roadmap (historical)¶

Sprint 1: Critical Fixes — DEADLINE: 1 week¶

Fix	Change	Effort	Impact	Gate
A: Decouple CUSUM	`cvntrade_autonomous_fe.py` + env var	20 lines	20× training data	FTF run confirms >5000 train samples
B: Enable class balancing	`ablation_matrix.py` BASE_ENV	1 line	+0.10 f1_buy	ADR-46 compliant
C: Cost model validation	Verify cost_model.py integrated	0 (audit)	Cost-aware eval	Expectancy computed in backtest
FTF factors	Add `cusum_training_mode`	15 lines	A/B test paradigm	Factor appears in DAG dropdown
Leakage audit	Code review per §13 checklist	2h	Experimental validity	All items SAFE or FIXED

Sprint 2: HPO & Features — DEADLINE: 1 week¶

Fix	Change	Effort	Impact	Gate
D: `sortino_net` objective	`XGBoost_hyperoptimizer.py`	40 lines	Trading-centric HPO	HPO uses backtest Sortino as score
D: Relax guards	`XGBoost_hyperoptimizer.py`	5 lines	Wider solution space	action_rate [0.03, 0.80]
E: Increase feature cap	`cvntrade_autonomous_fe.py`	5 lines	More signal	100-150 features available
Uncertainty CIs	`ablation_report.py`	30 lines	Statistical rigor	All metrics reported with CI

Sprint 3: Validation & Operations — DEADLINE: 1 week¶

Fix	Change	Effort	Impact	Gate
G: Paradigm test (3-arm)	FTF run `cusum_training_mode`	3h compute	Validate Fix A	Arm A Sortino > Arm C
H: Binary classification	FTF run `classification_mode`	3h compute	Validate binary	Compare to 3-class baseline
F: MLOps controls	Grafana alerts + 5 runbooks	2 days	Operational safety	Runbooks reviewed
Edge validation	Compare to random baseline (ADR-29)	1h analysis	Confirm economic edge	Sortino > 1.5× random

Success Gate — Hard Criteria¶

After Sprint 3, ALL must be met to proceed to production:

Metric	Current	Gate (minimum)	Target	Method
f1_buy	0.45	≥ 0.50	0.60	FTF ablation, bootstrap CI
Sortino (15bps)	1.5	≥ 1.5	2.5	FTF, CI excludes 0
Trades per fold	1-10	≥ 30	50+	All folds, all cryptos
recall_buy	0.25	≥ 0.30	0.45	FTF ablation
Net expectancy/trade	unknown	> 0	> 0.5%	After 15bps costs
vs random baseline	unknown	> 1.5×	> 3×	ADR-29 comparison
Powered variants	rare	all	all	Power analysis (63 trades)
Edge stability (CV)	unknown	< 100%	< 50%	Cross-fold Sortino variance

If gate not met: Escalate to architectural review. Options: 1. Alternative models (LightGBM, transformer, LSTM) 2. Alternative features (order book, funding rate dynamics, cross-asset) 3. Alternative timeframes (1h, 4h) 4. Alternative strategy (momentum instead of mean-reversion)

16. Files Reference¶

File	Lines	Purpose
`src/ETL/cvntrade_external_data.py`	20-132	Binance API, funding rates
`src/ETL/cvntrade_label.py`	188-290	Triple barrier labeling
`src/commun/pipeline/enrichment_api.py`	82-90	OHLCV → indicators
`src/commun/pipeline/feature_engineering_api.py`	1-150	FE pipeline
`src/commun/cache/components/cvntrade_autonomous_fe.py`	125-200	CUSUM + split + FE (Fix A target)
`src/backtest/filters/cvntrade_cusum_filter.py`	279-409	CUSUM algorithm
`src/training/harness/__init__.py`	—	Harness public API (`train_one`, `train_ensemble`, `run_hpo`, `train_with_fixed_params`)
`src/training/harness/contracts.py`	—	Typed payloads (`Datasets`, `HPOParams`, `SplitMetrics`, `TrainedArtifact`, `FeatureVersion`)
`src/training/harness/registry.py`	—	Plugin registries + `register_model` / `register_ensemble` decorators
`src/training/harness/dags/models/{xgboost,lightgbm,catboost}_dag.py`	—	One file = one model. HPO search space + Hamilton DAG nodes
`src/training/harness/dags/ensembles/{blend_avg,stack_logreg,stack_xgb_meta}_dag.py`	—	One file = one ensemble
`src/training/harness/adapters/{xgb,lgb,cb,ensemble}.py`	—	Per-model `predict_proba` shims (Protocol implementations)
`src/training/harness/nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py`	—	Reused Hamilton-pure functions (the 85% shared code)
`src/training/harness/autonomous_model_trainer.py`	—	Generic cache-aware wrapper (any registered model)
`src/training/harness/autonomous_ensemble_trainer.py`	—	Generic cache-aware wrapper (any registered ensemble)
`src/training/cvntrade_autonomous_orchestrator.py`	45-160	Orchestration — dispatches to the two generic wrappers above
`src/commun/config/cost_model.py`	—	Non-linear slippage model
`src/commun/finetune/ablation_matrix.py`	37-97	BASE_ENV (Fix B target)
`src/commun/mlflow/cvntrade_mlflow_manager.py`	—	Model registry
`documentation/ADR.md`	744-799	ADR-43/44/45 (funnel)

Architecture — Training Pipeline¶

Status decomposition (committee gate)¶

1. Executive Summary¶

Pipeline flow (canonical, post-#898)¶

Consolidated performance snapshot¶

2. Current Status & Promotion Freeze¶

3. Current Canonical Architecture¶

4. Harness Path¶

High-level data flow¶

Adding a new model¶

Adding a new ensemble¶

File map (current)¶

Backtest contract¶

5. Hyperparameter Resolution (ADR-90)¶

¶

6. Known Regression — S18¶

Diagnostic plan (S18) — 6 steps, 5.25-day budget¶

7. Observability & Operator Workflow¶

Observability contract — Loki event schema¶

See also¶

8. Legacy Pipeline Analysis¶

8.1 Pipeline Stages 1-6 (still current as harness inputs)¶

Stage 1: Data Ingestion (OHLCV)¶

Stage 2: Enrichment (Technical Indicators)¶

Stage 3: Labeling (Triple Barrier)¶

Stage 4: CUSUM Pre-Filter¶

Stage 5: Feature Engineering¶

Stage 6: Train/Val/Test Split¶

8.2 Pipeline Stages 7-9 — Deprecated, kept for historical analysis¶

Stage 7: HPO + XGBoost Training (legacy — superseded by harness in PR #891 + #898)¶

Stage 8: Model Evaluation (legacy — superseded by harness, kept for context)¶

Stage 9: MLflow Storage (legacy — superseded by harness, kept for context)¶

9. Historical Bottlenecks¶

Bottleneck Map¶

Bottleneck Severity Ranking¶

10. Roadmap / Open Stories¶

Promotion-gate criteria (S19 closure)¶

11. ADR Compliance¶

12. Appendix A — Historical Action Plan (FIX A-H, V1+V2 Committee Recos)¶

4. Committee Verdict & Action Plan¶

V1 Committee Result: REJECTED (STARVATION, 4.0/10)¶

Action Plan — 3 Structural Fixes + 5 Improvements¶

5. FIX A: Decouple CUSUM from Training (committee reco #1, #7)¶

Problem¶

Solution: Train on ALL data, CUSUM only at inference¶

Implementation¶

Expected Impact¶

CUSUM remains at inference¶

Paradigm Test (committee reco #7)¶

6. FIX B: Enable Class Balancing by Default (committee reco #2)¶

Problem¶

Solution¶

Expected Impact¶

7. FIX C: Realistic Cost Model in HPO (committee reco #3)¶

Problem¶

Solution: Cost-aware HPO objective¶

Implementation¶

8. FIX D: Trading-Centric HPO & Relaxed Guards (committee reco #4)¶

Problem¶

Solution¶

9. FIX E: Feature Selection Strategy (committee reco #5)¶

Problem¶

Solution¶

Expected Impact¶

10. FIX F: MLOps Operational Controls (committee reco #6)¶

Kill-Switch¶

Live Observability¶

Drift Detection¶

Staged Rollout¶

11. FIX G: CUSUM/ML Paradigm Test (committee reco #7)¶

Experimental Design¶

Success Criteria¶

Expected Outcome¶

12. FIX H: Binary Classification (committee reco #8)¶

13. Appendix B — V2 Stricter Recos (historical)¶

13. Walk-Forward Leakage Prevention (V2 reco #5)¶

Per-Component Leakage Audit¶

Verification Protocol¶

14. Uncertainty Quantification (V2 reco #8)¶

Methods¶

Runbooks (to create in `documentation/RUNBOOKS/`)¶