Architecture — Training Pipeline¶
Version: 3.3 Date: 2026-05-13 (V3.3 — restructured for committee review : flat structure, harness-first, S18 isolated, legacy demoted) Status: Canonical harness path in production ; model promotion blocked pending S18-S19 regression diagnosis
Status decomposition (committee gate)¶
| Domain | Status | Anchor |
|---|---|---|
| Harness architecture | production canonical (since PR #891 + #898, 2026-05-10) | §4 Harness Path |
| Hyperparameters externalization | production via ADR-90 (since PR #904, 2026-05-12) | §5 Hyperparameter Resolution (ADR-90) |
| Harness-trained model promotion | frozen — promotion blocked until S18-S19 close | §2 Current Status & Promotion Freeze |
| S18 regression diagnostic | in flight (plan_review PASSED OK 4298520f, OP wp#154) | §6 Known Regression — S18 |
1. Executive Summary¶
The training pipeline is now canonically routed through the Hamilton-based training harness for XGBoost, LightGBM and CatBoost. Since PR #904, all numeric hyperparameter defaults and HPO ranges are resolved at runtime from PG ftf_config.base_env via ADR-90 ; source-code hyperparameter values are forbidden and enforced by CI gate G5.
However, the post-harness training path is currently under diagnostic. The S17 canary FTF run exposed shallow-training behavior : LightGBM stops after 1-2 iterations across all trials, XGBoost shows median best_iteration around 17, and model quality is materially below the pre-#891 baseline. Current post-S17 f1_buy results on defi_top5 5m are : CatBoost 0.39, LightGBM 0.37 and XGBoost 0.09, versus a pre-#891 baseline around 0.42.
As a result, the harness architecture is production-canonical, but harness-trained model promotion is frozen until CVN-N001-EE-S18 and S19 are closed. The current leaderboard is informational only and must not be used for live trading promotion.
Pipeline flow (canonical, post-#898)¶
OHLCV → Enrichment → Labeling → Feature Engineering
→ Temporal Split → Harness DAG
→ { XGB | LGB | CB } via `harness.train_one(model_type, datasets, hpo_params)`
→ Evaluation → MLflow
CUSUM is applied at inference (per ADR-46 + committee reco) ; it is NOT a training filter. The legacy "Stage 7 HPO + XGBoost Training" name is retained in §8 Legacy Pipeline Analysis for historical context but the current canonical path is described in §4-§5.
Consolidated performance snapshot¶
Single source of truth — referenced by every other section :
| Baseline | State | Model | f1_buy | Notes |
|---|---|---|---|---|
| pre-#891 | legacy autonomous trainers | XGB+LGB+CB (3 trainers, ~2400 LOC) | ~0.42 | reference target — what we want to recover |
| post-#891 | harness migration | (all 3) | ~0.22 | the 2026-05-11 incident that motivated S17 |
| post-S17 | harness + Console-seeded HPs | CatBoost | 0.39 | within range, best_iter 12-155 mixed |
| post-S17 | harness + Console-seeded HPs | LightGBM | 0.37 | shallow + θ-sweep at θ=0.2 compensates (best_iter=1 in 53/53 trials) |
| post-S17 | harness + Console-seeded HPs | XGBoost | 0.09 | shallow + θ=0.5 fixed → near-empty buy classifier ; best_iter p50=17 |
Promotion-freeze decision is anchored on this table : no harness-trained model goes live until f1_buy on defi_top5 5m ATR0.5_1.5_H4 returns to ≥ 0.40 on ≥ 4/5 cryptos for all 3 model types (mlops_readiness.md §4 canary criterion).
2. Current Status & Promotion Freeze¶
| Component | Production state | Promotion gate |
|---|---|---|
Harness binary (src/training/harness/) |
production canonical (PR #891 + #898) | n/a (architectural change, not a model) |
Hyperparameter resolver (commun.finetune.hyperparams.resolve) |
production canonical (PR #904 / ADR-90) | n/a (config plumbing) |
Console seeding (455 keys in ftf_config.base_env) |
seeded 2026-05-12T18:34 UTC, 100% coverage | n/a |
| Harness-trained XGB / LGB / CB models | trained, results in finetune_results |
FROZEN — promotion blocked pending S18-S19 |
| FTF leaderboard | running daily | informational only — not a promotion signal |
| Live trading on harness-trained models | forbidden | gate = S18 verdict LOCK + S19 fix verified on canary |
Until both CVN-N001-EE-S18 (diagnostic) and the follow-up CVN-N001-EE-S19 (remediation) close with a verified canary, the operator MUST NOT promote any harness-trained model to live trading. Backtests and shadow runs are permitted ; auto-promotion is OFF (ADR-2).
3. Current Canonical Architecture¶
The pipeline below is the version in production as of 2026-05-13. See §8 Legacy Pipeline Analysis for the pre-#891 path (kept for historical bottleneck analysis ; do NOT reference for current implementation).
Binance OHLCV
↓
Enrichment (technical indicators)
↓
Labeling (triple-barrier ATR-dynamic)
↓
Feature Engineering (Hamilton DAG, fitted at train, frozen at inference)
↓
Temporal split (walk-forward, 5 folds × {train, val, test})
↓
Harness DAG (Hamilton, plugin registry)
↓
┌─ XGB ──┐
├─ LGB ──┤ → resolver `resolve(MODEL, TF, PARAM)` reads from PG ftf_config.base_env
└─ CB ──┘ (455 keys, Console-editable, ADR-90)
↓
Evaluation (per-fold f1_buy / sortino / expectancy / drawdown, with bootstrap CI)
↓
MLflow registry (versioned model + feature contract pinned, ADR-23 / ADR-24)
↓
{Promotion gate} ← currently FROZEN pending S18-S19 — see §2
Each upstream stage (OHLCV / Enrichment / Labeling / FE / Split) is unchanged from the legacy ; only the training step (Stage 7 in legacy numbering) was replaced. The harness is therefore a drop-in replacement at the training boundary.
4. Harness Path¶
The legacy per-model autonomous trainer approach (Stages 7-9 of the legacy pipeline, see §8 Legacy Pipeline Analysis) has been replaced by a Hamilton-based training harness with plugin registries (ADR-89). This is the canonical path for every single-model and ensemble training in production since 2026-05-10 (PR #891 + #898). Promotion of harness-trained models is currently FROZEN — see §2 Current Status & Promotion Freeze.
High-level data flow¶
flowchart LR
subgraph FTF[FTF runner / orchestrator]
OR["CVNTrade_AutonomousOrchestrator
._create_autonomous_trainer(model_type)"]
end
subgraph Harness[training.harness]
REG{{"MODEL_REGISTRY
ENSEMBLE_REGISTRY"}}
AMT[CVNTrade_AutonomousModelTrainer
generic, parameterised by model_type]
AET[CVNTrade_AutonomousEnsembleTrainer
generic, parameterised by ensemble_name]
T1["train_one(model_type, datasets, hpo)
Hamilton Driver"]
TE["train_ensemble(name, datasets, hpo)
Hamilton Driver"]
end
subgraph DAGs[harness.dags]
XGD["xgboost_dag.py"]
LGD["lightgbm_dag.py"]
CBD["catboost_dag.py"]
ENS1["blend_avg_dag.py"]
ENS2["stack_logreg_dag.py"]
ENS3["stack_xgb_meta_dag.py"]
end
subgraph Adapters[harness.adapters]
XA[XGBAdapter]
LA[LGBAdapter]
CA[CBAdapter]
EA[EnsembleAdapter]
end
OR --> REG
REG -- single model --> AMT
REG -- ensemble --> AET
AMT --> T1
AET --> TE
T1 --> XGD & LGD & CBD
TE --> ENS1 & ENS2 & ENS3
XGD --> XA
LGD --> LA
CBD --> CA
ENS1 & ENS2 & ENS3 --> EA
EA -. composes .-> XA & LA & CA
Adding a new model¶
- Drop a new file
src/training/harness/dags/models/<name>_dag.pydeclaring the Hamilton DAG nodes (<name>_native,<name>_adapter,<name>_metrics_val,<name>_metrics_test,<name>_artifact) + the per-model HPO search space. Each numeric default MUST be read viacommun.finetune.hyperparams.resolve(_MODEL, current_timeframe(), "<PARAM>")(per ADR-90 + CI grep gate G5) ; HPO range MIN/MAX/SCALE viaresolve_hpo_range(_MODEL, current_timeframe(), "<PARAM>"). - Append a
register_model(...)call at the bottom of the file (or use the convention scan inharness/registry.py). - Seed the Console with the model's HP defaults + HPO ranges
for the 5 timeframes (1M / 5M / 15M / 30M / 1H). Add the values
to
scripts/seed_hyperparams_console.py::LEGACY_DEFAULTS+LEGACY_HPO_RANGES, updatecommun.finetune.hyperparams.expected_keys()to enumerate the new (model, param) pairs, then run :python scripts/seed_hyperparams_console.py --applyagainst prod PG (viakubectl execon a scheduler pod, seedocumentation/runbooks/break-glass-hyperparams.mdfor the Console-bypass pattern when needed). The parity teststests/unit/training_harness/test_hyperparams_seeding_parity.pyMUST pass before merge. - Done. The orchestrator picks the new model up automatically — no
edits to any caller. Synthetic-pickup test :
tests/unit/training_harness/test_phase4_lgb_cutover.py::TestSyntheticFourthModelPickup.
Adding a new ensemble¶
Same pattern under dags/ensembles/. The EnsembleAdapter composes
N base TrainedArtifacts ; Hamilton resolves the base model
dependencies and caches them within a driver call. Ensembles do NOT
own hyperparameters themselves today — they reuse the per-base-model
seeded values from ftf_config.base_env. If an aggregator gains
tunable HPs (e.g., LogRegShrinkAggregator.l2 floor), externalize via
ADR-90 with a CVN_HPO_ENS_<NAME>_<PARAM> namespace extension
(currently unused — file an ADR addendum if needed).
File map (current)¶
| Concern | File | LOC |
|---|---|---|
| Public API | src/training/harness/__init__.py |
~180 |
| Typed contracts | src/training/harness/contracts.py |
~150 |
| Plugin registries | src/training/harness/registry.py |
~120 |
| Per-model DAGs | src/training/harness/dags/models/*_dag.py |
~200 each |
| Per-ensemble DAGs | src/training/harness/dags/ensembles/*_dag.py |
~180 each |
| Per-model adapters | src/training/harness/adapters/{xgb,lgb,cb}.py |
~50 each |
| Ensemble adapter | src/training/harness/adapters/ensemble.py |
~80 |
| Reused Hamilton nodes | src/training/harness/nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py |
~700 total |
| Generic single-model autonomous wrapper | src/training/harness/autonomous_model_trainer.py |
~280 |
| Generic ensemble autonomous wrapper | src/training/harness/autonomous_ensemble_trainer.py |
~240 |
| Hamilton-native wrappers (PR #896 wiring) | src/training/harness/dags/wrappers/ |
varies |
| Orchestrator dispatch | src/training/cvntrade_autonomous_orchestrator.py |
~210 |
| Hyperparameter resolver (ADR-90) | src/commun/finetune/hyperparams.py |
~410 |
| Console seeding script (ADR-90) | scripts/seed_hyperparams_console.py |
~570 |
Backtest contract¶
Every harness DAG returns a TrainedArtifact whose .adapter exposes
the canonical predict_proba(X) -> (n, 2) Protocol. The backtest
engine's create_model_adapter factory recognises harness adapters
via duck-typing and wraps them through _HarnessAdapterShim to
satisfy the legacy CVNTrade_ModelAdapter ABC. No type-sniffing in
consumer code (regime trainer, backtest engine, OOS predictor).
5. Hyperparameter Resolution (ADR-90)¶
¶
Since PR #904 (CVN-N001-EE-S17, 2026-05-12) :
| Key family | Naming convention | Count today | Owner |
|---|---|---|---|
| Defaults | CVN_HPO_<MODEL>_<TF>_<PARAM> |
110 HPO-tunable + 15 EARLY_STOPPING_ROUNDS = 125 | Console (@dococeven) |
| HPO ranges | CVN_HPO_<MODEL>_<TF>_<PARAM>_RANGE_{MIN,MAX,SCALE} |
330 (3 per HPO-tunable default × 110) | Console (@dococeven) |
| Total | 455 canonical keys |
The resolver contract :
- Env key present → parse to typed value (per
_PARAM_TYPESincommun/finetune/hyperparams.py) - Env key missing +
fallback=None→RuntimeErrorwith canonical message pointing to ADR-90 (operator MUST seed the key) - Env key missing +
fallback=<value>→ emitWARN event=hpo_fallback_appliedthen return the (type-coerced) fallback (transitional warning window per ADR-90 Clause 2) - NaN / ±Inf rejected at parse time (
_parsefail-fasts) - HPO range
MIN >= MAXrejected atresolve_hpo_rangeexit - Fallback type-coerced through the same
_parsepath as env-sourced values (string"150"→ int150forEARLY_STOPPING_ROUNDS)
Operator workflow :
- Tune a value : Console UI → Config → ftf_config → edit key → save (writes audit row to ftf_config_history). No deploy needed (rule 1 of ADR-90).
- Inspect coverage : Grafana dashboard cvntrade-hp-coverage (5 panels : 24h fallback count, 6h fallback %, 7d timeseries, top missing keys, training stack health). Target = 0 fallback events.
- Rollback bad write : Console UI restore-from-history button NOT IMPLEMENTED today (Epic CCP wp#149 scope) ; until then, use the break-glass PG path with the gating ritual in documentation/runbooks/break-glass-hyperparams.md.
- Seed a new namespace : python scripts/seed_hyperparams_console.py --apply after extending LEGACY_DEFAULTS / LEGACY_HPO_RANGES / expected_keys(). Idempotent (re-running is a no-op on already-seeded keys).
6. Known Regression — S18¶
Post-S17 canary FTF ftf_20260512_184337_fdee27_ATR0.5_1.5_H4 (2026-05-12) exposed that harness training stops shallow for all 3 model types :
| Model | best_iteration p50 | f1_buy mean | Profile |
|---|---|---|---|
| LightGBM | 1 (53/53 trials hit iter 1-2) | 0.37 | Shallow + val-tuned θ-sweep at θ=0.2 grabs random-distribution tail |
| CatBoost | 14 (some trials reach 336) | 0.39 | Mixed — bimodal |
| XGBoost | 17 | 0.09 | Shallow + fixed θ=0.5 → near-empty buy classifier (1-44 trades vs 100-280 for LGB/CB) |
Root cause is upstream of S17. Hyperparameter externalization (PR #904) is verified clean : event=hpo_fallback_applied count = 0, Optuna picks are within seeded ranges, parity tests test_hyperparams_seeding_parity.py confirm byte-for-byte legacy values. The regression is in the #891 harness migration itself — diagnostic Story scope under CVN-N001-EE-S18 plan dossier (plan_review committee session 4298520f PASSED OK strong consensus, 0 blockers, OP wp#154, OP Meeting #133).
Diagnostic plan (S18) — 6 steps, 5.25-day budget¶
| Step | Action | Budget |
|---|---|---|
| 0 | Pre-validate captured fold reproduces FTF metrics in staging (committee reco #4) | 0.25 day |
| 1 | Capture reference fold AAVEUSDC fold=3 best_iter=1 from FTF run with verbose logging |
1 day |
| 2 | Reconstruct pre-#891 legacy train_with_fixed_params_lgbm from git (git show e75418ca^:src/training/LightGBM/grid_utils.py) |
0.5 day |
| 3 | Build parity reproducer scripts/diagnostic_s18_harness_vs_legacy.py — all 3 models + per-iter AUC + binary_logloss |
2 days |
| 4 | Bisect harness commits (#891 dc3d86c6, #896 f56fa33f, #899 e75418ca) |
1 day |
| 5 | Final dossier documentation/missions/cvn-n001-ee-s18-diagnostic/report.md + verdict |
0.5 day |
Verdict tree (per ADR-79, adapted for diagnostic Story) :
- LOCK : root cause + fix < 30 LOC → open
CVN-N001-EE-S191-week plan - KEEP_AVAILABLE : root cause + fix non-trivial → S19 with 2-3 sprint plan + design dossier
- ABANDON : 30d no convergence OR fix > 3× revert surface → S19 reverts #891 + #896 + #899, accept ADR-89 loss, redo migration
Until S18 + S19 close, the harness FTF leaderboard is informational only ; no live trading promotion of harness-trained models is permitted (see §2).
7. Observability & Operator Workflow¶
Observability contract — Loki event schema¶
Every harness DAG emits the same event= schema, traceable in Loki
via the {namespace="cvntrade"} stream :
| Event | When | Key labels |
|---|---|---|
event=training_started |
DAG entry | model_type, harness_schema_version, n_train, n_val, n_test, n_features, class_dist_train_buy_pct, n_estimators, learning_rate (golden fields — full bundle since CR round 1 PR #904 lgb_effective_hyperparams) |
event=class_balance_applied |
After class-weight computation | model_type, binary, scale_pos_weight, n_pos, n_neg, imbalance_ratio |
event=theta_picked |
After pick_threshold_on_val |
model_type, threshold, f1_buy_at_threshold, n_candidates |
event=training_complete |
DAG exit | model_type, best_iteration, training_time_sec, theta_picked, f1_buy_val, f1_buy_test, auc_buy_val, auc_buy_test |
event=autonomous_trained |
Cache-aware wrapper success | model_type, crypto, strategy, run_id |
event=hpo_fallback_applied (NEW, S17) |
Resolver hit fallback path | model, timeframe, param, key, fallback, reason=env_key_missing — Grafana panel cvntrade-hp-coverage thresholds : > 1% rate of event=training_started → P3, > 5% → P2 |
See also¶
- ADR-89 — Training harness as plugin registry
- ADR-90 — Training hyperparameters in PG Console only (V3.2 addition)
- ADR-67 — Pluggable feature-selection registry (sister registry pattern)
- Plan dossier (harness original) —
documentation/reviews/2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md - Plan dossier (hyperparameters externalization) —
documentation/reviews/2026-05-11-cvn-n001-ee-s17-hyperparams-externalization-plan.md - Plan dossier (shallow-training regression diagnostic, in flight) —
documentation/reviews/2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md - Break-glass runbook (ADR-90 transition window) —
documentation/runbooks/break-glass-hyperparams.md - Structurizr view —
TrainingHarness(Components of Airflow → harness module)
8. Legacy Pipeline Analysis¶
Historical reference only. The current canonical implementation is described in §3-§7 above. Stages 1-6 below remain technically accurate as inputs to the harness (Stage 7 is the part replaced). Stages 7-9 are the pre-#891 trainers — deprecated since 2026-05-10. This section is retained for the bottleneck analysis it informs (§9) and for the FIX A-H historical action plan (Appendix A, §12) that motivated the harness refactor.
8.1 Pipeline Stages 1-6 (still current as harness inputs)¶
Stage 1: Data Ingestion (OHLCV)¶
| Aspect | Detail |
|---|---|
| Source | Binance Futures API via ccxt |
| File | src/ETL/cvntrade_external_data.py:20-26 |
| Timeframe | 30m (default), configurable via CVN_TIMEFRAME |
| History | 24 months (configurable via history_months) |
| Data | Open, High, Low, Close, Volume + Funding Rate (hourly, resampled) |
| Cache | Redis L1 (10 connections, 5s timeout) → Disk L2 (parquet) |
| Volume | ~24,000 candles per crypto for 24m @ 30m |
Checkpoint: Raw OHLCV dataframe, no NaN, timezone-aware index.
Stage 2: Enrichment (Technical Indicators)¶
| Aspect | Detail |
|---|---|
| File | src/commun/pipeline/enrichment_api.py:82-90 |
| Processor | CVNTrade_Enrich.process(df, mode="inference") |
| Indicators | SMA (multi-period), RSI (7,14,28), MACD (slow+signal), ATR, Stochastic (K,D), Volume MA, MFI, ADX, Bollinger Bands |
| Gate v5 features | MA spectrum (5,10,20,50), rolling window stats, volatility regime, momentum patterns, support/resistance, volume flow, market structure |
| Output | ~300+ columns (raw indicators + derived features) |
| Warmup | ~500 candles consumed for indicator bootstrap |
| Parity guarantee | enrich_batch(ohlcv).iloc[-1] == enrich_streaming(window).iloc[-1] (ADR) |
Checkpoint: Enriched dataframe with ~300 columns, ~23,500 rows (after warmup).
Stage 3: Labeling (Triple Barrier)¶
| Aspect | Detail |
|---|---|
| File | src/ETL/cvntrade_label.py:188-269 |
| Method | Triple barrier with ATR-dynamic levels |
| Strategy format | ATR{sl}_{tp}_H{horizon} (e.g. ATR1.5_3.0_H5) or legacy SL{x}_TP{y}_H{z} (e.g. SL0.5_TP1_H4) |
| TP level | open × (1 + atr × tp_mult / 100) |
| SL level | open × (1 - atr × sl_mult / 100) |
| Horizon | H candles forward (e.g. H5 = 5 hours @ 30m = 10 candles) |
| Labels | +1 (BUY: TP hit first), -1 (SELL: SL hit first), 0 (HOLD: timeout) |
| Anti-look-ahead | Uses T+1 to T+horizon window, decision at T based on T-1 return |
| ATR floor | 0.6% minimum (prevents degenerate labels on low-vol periods) |
| NaN handling | Labels that don't resolve (no TP/SL/timeout) → dropped (~10-20%) |
Class distribution (typical for ATR1.5_3.0_H5):
Checkpoint: Labeled dataframe. ~20,000 rows after NaN drop. Imbalanced toward HOLD.
Known issue: High HOLD ratio means model defaults to HOLD → low BUY recall.
Stage 4: CUSUM Pre-Filter¶
| Aspect | Detail |
|---|---|
| File | src/backtest/filters/cvntrade_cusum_filter.py:279-409 |
| Applied | BEFORE train/val/test split (cvntrade_autonomous_fe.py:125-200) |
| Algorithm | Cumulative sum control chart on log returns |
| Formula | S+[t] = max(0, S+[t-1] + (r[t-1] - μ - k)), alert if S+ > h or S- < -h |
| Parameters | k=0.5σ (allowance), h=3.0σ (threshold, from CVN_CUSUM_THRESHOLD_H) |
| Warmup | 100 bars (all marked as transition) |
| Cooldown | 10 bars after each detection |
| Sigma | Calibrated on training data, cached, NEVER re-fit on test (ADR) |
| Mode | "Stable mode" — keeps only NON-transition bars |
| Output | ~5% of candles survive (regime change events only) |
Checkpoint: Filtered dataframe. ~1,000-1,500 rows from 20,000. 95% data loss.
CRITICAL BOTTLENECK #1
Input: 20,000 labeled candles
CUSUM: ~1,000 survive (h=3.0σ)
Output: 1,000 rows → split 70/15/15 → Train: 700, Val: 150, Test: 150
With more aggressive h (e.g. h=4.0σ from env):
Output: ~300 rows → Train: 210, Val: 45, Test: 45
Stage 5: Feature Engineering¶
| Aspect | Detail |
|---|---|
| File | src/commun/cache/components/cvntrade_autonomous_fe.py:125-170 |
| Steps | Imputation → Stationarization (ADF/KPSS) → Normalization (StandardScaler) |
| Feature selection | Variance threshold + adaptive cap |
| Adaptive cap | min(80, max(10, n_features / 20)) |
| Example | 300 features → cap = 15-80 → typically 30-50 selected |
| Fit scope | Pipeline fitted on TRAIN set only, applied to val/test |
| Feature lag | Disabled (feature_lag=0) |
| Label remapping | 3-class: {-1,0,1} → {0,1,2}. Binary: {-1,0} → 0, {1} → 1 |
Checkpoint: Feature matrix X (n_samples × n_features), label vector y.
BOTTLENECK #2 — Aggressive dimensionality reduction
300+ raw indicators → 30-50 selected features
Potential signal loss: MACD variants, stochastic slopes, volume patterns removed
Model capacity limited by feature count
Stage 6: Train/Val/Test Split¶
| Aspect | Detail |
|---|---|
| File | src/commun/cache/components/cvntrade_autonomous_fe.py:143-146 |
| Method | Temporal split 70/15/15 (sequential in time) |
| Purge | 10 bars between train/val boundary (CVN_PURGE_BARS) |
| Embargo | 5 bars between val/test boundary (CVN_EMBARGO_BARS) |
| Walk-forward | Supported via FTF folds (5 folds, sliding window) |
Checkpoint: Three datasets (X_train, y_train), (X_val, y_val), (X_test, y_test).
BOTTLENECK #3 — Tiny datasets after CUSUM
With 1,000 post-CUSUM rows:
Train: 700 samples
Val: 150 samples
Test: 150 samples
With 300 post-CUSUM rows (aggressive h):
Train: 210 samples
Val: 45 samples
Test: 45 samples
XGBoost on 210 training samples with 30 features = severe overfitting risk
8.2 Pipeline Stages 7-9 — Deprecated, kept for historical analysis¶
Stage 7: HPO + XGBoost Training (legacy — superseded by harness in PR #891 + #898)¶
| Aspect | Detail |
|---|---|
| File | src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py:52-250 |
| Optimizer | Optuna TPESampler (seed=42) |
| Trials | 30 (from CVN_HPO_N_TRIALS), timeout 3600s |
| Pruner | MedianPruner |
| Warm start | 5 startup trials (random), then TPE |
Hyperparameter search space (from config, timeframe-dependent):
max_depth: [3, 8]
learning_rate: [0.01, 0.3]
n_estimators: [100, 500]
subsample: [0.6, 1.0]
colsample_bytree: [0.6, 1.0]
min_child_weight: [1, 10]
gamma: [0, 5]
reg_alpha: [0.01, 5.0] (timeframe-dependent)
reg_lambda: [0.5, 7.0] (timeframe-dependent)
threshold_buy: [0.30, 0.50]
threshold_sell: [0.30, 0.50]
Objective function (CVN_HPO_OBJECTIVE):
| Objective | Formula | Default weights |
|---|---|---|
precision_recall_auc |
w_prec × precision_buy + w_auc × auc + w_log × (1-logloss) |
0.45 / 0.35 / 0.20 |
fbeta_buy |
F_β(buy) with β from CVN_BUY_BETA |
β=1.0 (F1) |
logloss_auc |
w × (1-logloss) + (1-w) × auc |
w=0.5 (binary mode) |
Guard conditions (kill trial if):
BOTTLENECK #4 — HPO guards constrain the solution space
action_rate [8%, 60%] forces model to predict BUY on 8-60% of samples
recall_buy > 15% forces minimum BUY detection
Combined: model must predict BUY often enough but not too often
Result: model converges to "safe middle" with f1_buy ~0.45
Class balancing (XGBoost trainer lines 900-911):
if _is_class_balancing_enabled(): # CVN_CLASS_BALANCING=1
class_weights = compute_class_weight("balanced", classes, y_train)
sample_weights = [class_weights[int(y)] for y in y_train]
else:
sample_weights = ones(len(y_train)) # NO BALANCING (default)
BOTTLENECK #5 — Class balancing disabled by default
With 70% HOLD, 15% BUY, 15% SELL:
Model learns to predict HOLD for most samples (safe bet)
BUY precision high but recall low → f1_buy ~0.45
Enabling balancing: 0.05-0.10 improvement expected
Early stopping (trainer line 957):
Calibration (trainer lines 961-963):
if config.calibration != "none":
self._apply_calibration(X_train, y_train, config.calibration)
# Options: "none" (default), "isotonic", "platt"
Stage 8: Model Evaluation (legacy — superseded by harness, kept for context)¶
| Aspect | Detail |
|---|---|
| File | src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py:1025-1114 |
| Metrics | f1_macro, f1_buy, precision_buy, recall_buy, auc_buy, logloss, accuracy, action_rate, fbeta_buy |
| Evaluation sets | Val (for HPO selection) + Test (for reporting) |
| Overfit gap | f1_val - f1_test (added in #521) |
Typical metrics (ATR1.5_3.0_H5, defi_top5):
f1_buy: 0.43-0.49 ← weak
precision_buy: 0.50-0.65 ← decent
recall_buy: 0.20-0.35 ← very low (model too conservative)
auc_buy: 0.55-0.65 ← barely above random (0.50)
action_rate: 0.10-0.30 ← low (few BUY predictions)
Stage 9: MLflow Storage (legacy — superseded by harness, kept for context)¶
| Aspect | Detail |
|---|---|
| File | src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py:353-398 |
| Model name | CVNTrade_XGBoost_{CRYPTO}_{STRATEGY}_{TIMEFRAME} |
| Artifacts | Model pickle, feature names (tag), feature count (tag), validation metrics (metadata) |
| Registry | MLflow Model Registry with stages (Staging, Production) |
| Promotion | Manual via lock_winner() (ADR-2: no auto-promotion) |
9. Historical Bottlenecks¶
Historical reference only. The bottleneck analysis below was the V1 / V2 committee assessment that motivated the harness refactor (now done) and the FIX A-H action plan (Appendix A, §12). It describes the pre-#891 pipeline ; the harness path has different bottlenecks, currently under diagnosis in CVN-N001-EE-S18 (see §6).
Bottleneck Map¶
Raw Data (24,000 candles)
│
┌─────────┴─────────┐
│ ENRICHMENT │ → 300+ features
└─────────┬─────────┘ No data loss
│
┌─────────┴─────────┐
│ LABELING │ → ~20,000 labeled
└─────────┬─────────┘ 10-20% NaN drop
│ ← BOTTLENECK #6: label dropout
┌─────────┴─────────┐
│ CUSUM FILTER │ → ~1,000 rows ← BOTTLENECK #1: 95% data loss
└─────────┬─────────┘ (h=3.0σ)
│
┌─────────┴─────────┐
│ FEATURE ENGIN. │ → 30-50 features ← BOTTLENECK #2: signal loss
└─────────┬─────────┘ (adaptive cap)
│
┌─────────┴─────────┐
│ SPLIT 70/15/15 │ → Train: 700
└─────────┬─────────┘ Val: 150 ← BOTTLENECK #3: tiny datasets
│ Test: 150
┌─────────┴─────────┐
│ HPO (30 trials) │ → f1_buy ~0.45 ← BOTTLENECK #4: guards limit
└─────────┬─────────┘ ← BOTTLENECK #5: no class balance
│
┌─────────┴─────────┐
│ EVALUATION │ → 1-10 trades/fold
└─────────┬─────────┘ (tiny test set)
│
┌─────────┴─────────┐
│ MLFLOW STORAGE │
└───────────────────┘
Bottleneck Severity Ranking¶
| # | Bottleneck | Severity | Impact | Root Cause |
|---|---|---|---|---|
| 1 | CUSUM filters 95% of data | CRITICAL | Train/val/test too small for reliable ML | h=3.0-4.0σ too conservative for crypto volatility |
| 2 | Feature cap (30-50 features) | HIGH | Model can't learn complex patterns | Adaptive cap n/20 too aggressive |
| 3 | Tiny test sets (7-150 samples) | HIGH | Metrics statistically unreliable, few trades | Consequence of #1 |
| 4 | HPO guards (action_rate, recall) | MEDIUM | Solution space artificially constrained | Guards too tight for imbalanced data |
| 5 | Class balancing disabled | MEDIUM | Model biased toward HOLD | CVN_CLASS_BALANCING=0 default |
| 6 | Label dropout (~15%) | LOW | Reduced training set | Window too short for some candles |
10. Roadmap / Open Stories¶
Current in-flight + immediate-next Stories tied to this pipeline :
| Story | Status | Owner | Outcome gate |
|---|---|---|---|
| CVN-N001-EE-S16 (harness unification) | ✅ Closed — PR #891 merged 2026-05-09, full cutover #899 2026-05-10 | @dococeven |
n/a — closed |
| CVN-N001-EE-S17 (hyperparameters externalization, ADR-90) | ✅ Closed — PR #904 merged 2026-05-12 | @dococeven |
n/a — closed, sunset milestone-anchored on Epic CCP wp#149 |
| CVN-N001-EE-S18 (harness shallow-training diagnostic) | 🔄 In flight — plan_review PASSED OK 4298520f, implementation starts post-PR-#910-merge | @dococeven |
Verdict LOCK / KEEP_AVAILABLE / ABANDON per §6 decision tree |
| CVN-N001-EE-S19 (remediation, scope TBD by S18 verdict) | ⏸️ Pending S18 | @dococeven |
Canary FTF f1_buy ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types |
| Epic CCP wp#149 | ⏸️ Planned (immediate successor to training stabilization) | @dococeven |
Typed schemas + Console UI restore + scoped resolution + snapshots + approval flow + OpenFeature integration |
Promotion-gate criteria (S19 closure)¶
Recover the pre-#891 baseline AND demonstrate the harness path matches or exceeds it :
| Metric | pre-#891 baseline | S19 closure gate | Method |
|---|---|---|---|
| f1_buy (defi_top5 5m ATR0.5_1.5_H4) | ~0.42 on all 3 model types | ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types | FTF canary, 5 folds, bootstrap CI |
| best_iteration (LightGBM) | 100-500 typical | > 50 median | Loki event=training_complete query |
event=hpo_fallback_applied rate |
n/a (pre-ADR-90) | 0 over 7 days | Grafana cvntrade-hp-coverage dashboard |
| Trades per fold (XGB) | 30-100 | ≥ 30 mean across folds | finetune_results PG query |
Only when ALL four gates pass on the canary FTF does the operator unfreeze harness-trained model promotion (§2). If S18 returns ABANDON, the promotion-freeze persists indefinitely and the operator pivots to revert.
11. ADR Compliance¶
| ADR | Topic | V1 | V3 (target) |
|---|---|---|---|
| ADR-2 | No auto-promotion | COMPLIANT | COMPLIANT + trade count gate |
| ADR-4 | Cache explicit | COMPLIANT | COMPLIANT |
| ADR-14 | Multi-fold evaluation | COMPLIANT | COMPLIANT (5 folds) |
| ADR-15 | Theta calibrated OOS | COMPLIANT | COMPLIANT + all params OOS |
| ADR-16 | Labels coherent | COMPLIANT | COMPLIANT |
| ADR-23 | Features version-pinned | COMPLIANT | COMPLIANT |
| ADR-24 | Feature set = model contract | COMPLIANT | COMPLIANT |
| ADR-25 | No silent fallback | PARTIAL | COMPLIANT (all fallbacks documented + alerted) |
| ADR-28 | Binary classification | COMPLIANT | COMPLIANT + FTF ablation (#517) |
| ADR-29 | Baseline naïve obligatoire | COMPLIANT | COMPLIANT + edge validation (§1b) |
| ADR-46 | Class balancing | NON-COMPLIANT | COMPLIANT (Fix B: enabled by default) |
| ADR-47 | Meta-label validation | COMPLIANT | COMPLIANT |
12. Appendix A — Historical Action Plan (FIX A-H, V1+V2 Committee Recos)¶
Historical action plan that motivated the harness refactor. The FIX A-H items below were the V1 + V2 committee deliverables that drove the V3 plan. Some are now closed (e.g. FIX B — class balancing — integrated into the harness
class_balancenode ; FIX A — CUSUM decoupling — applied at inference per ADR-46). Others (FIX D trading-centric HPO, FIX F MLOps controls) remain partially open. Do NOT use this as the current action plan — the current open work is in §10. This appendix is kept for the rationale and traceability it provides.
4. Committee Verdict & Action Plan¶
V1 Committee Result: REJECTED (STARVATION, 4.0/10)¶
8 recommendations received. ALL addressed below with concrete fixes.
Action Plan — 3 Structural Fixes + 5 Improvements¶
PRIORITY 1 — CRITICAL (blocks everything)
┌─────────────────────────────────────────────────────────────────┐
│ FIX A: Decouple CUSUM from training │
│ FIX B: Enable class balancing by default │
│ FIX C: Integrate realistic cost model into HPO │
└─────────────────────────────────────────────────────────────────┘
PRIORITY 2 — HIGH (multiplier on Fix A/B/C)
┌─────────────────────────────────────────────────────────────────┐
│ FIX D: Trading-centric HPO objective (Sortino-based) │
│ FIX E: Increase feature cap + regularization │
│ FIX F: MLOps operational controls │
└─────────────────────────────────────────────────────────────────┘
PRIORITY 3 — MEDIUM (validation & exploration)
┌─────────────────────────────────────────────────────────────────┐
│ FIX G: Test CUSUM/ML paradigm conflict │
│ FIX H: Binary classification (#517) │
└─────────────────────────────────────────────────────────────────┘
5. FIX A: Decouple CUSUM from Training (committee reco #1, #7)¶
Problem¶
CUSUM filters 95% of training data at h=3.0σ. Result: 20,000 candles → 1,000 → split → Train: 700 samples. XGBoost on 700 samples with 30 features = severe overfitting, unreliable metrics.
Solution: Train on ALL data, CUSUM only at inference¶
CURRENT (broken):
OHLCV → Enrichment → Label → CUSUM FILTER → FE → Split → Train
▲
└── 95% data lost HERE
TARGET (fixed):
OHLCV → Enrichment → Label → FE → Split → Train (CUSUM REMOVED from training)
│
▼ at inference only:
CUSUM → Inference → Filters → Trade
Implementation¶
File: src/commun/cache/components/cvntrade_autonomous_fe.py
Current (~line 125):
# Apply CUSUM before split (issue #295)
filtered_df = self._apply_cusum_before_split(enriched_df)
X_train, X_val, X_test = self._temporal_split(filtered_df)
Target:
# Train on ALL data — CUSUM only at inference (committee reco #1)
cusum_training_mode = os.environ.get("CVN_CUSUM_TRAINING_MODE", "disabled")
if cusum_training_mode in ("enabled", "relaxed_1_5", "legacy_3_0"):
filtered_df = self._apply_cusum_before_split(enriched_df)
else:
filtered_df = enriched_df # NO CUSUM filtering for training
X_train, X_val, X_test = self._temporal_split(filtered_df)
New env var: CVN_CUSUM_TRAINING_MODE — canonical variants: {disabled, relaxed_1_5, legacy_3_0} (default: disabled)
FTF factor: cusum_training_mode with variants {disabled, relaxed_1_5, legacy_3_0} to A/B test the impact.
Expected Impact¶
| Metric | Before (h=3.0σ) | After (no CUSUM training) |
|---|---|---|
| Training samples | ~700 | ~14,000 (20×) |
| Val samples | ~150 | ~3,000 (20×) |
| Test samples | ~150 | ~3,000 (20×) |
| f1_buy | ~0.45 | Target: 0.55-0.65 |
| Trades per fold | 1-10 | Target: 30-100 |
CUSUM remains at inference¶
CUSUM is still applied at inference time (backtest candle loop, line 869). It gates which candles are processed by the ML model. The model is trained on ALL data but only makes predictions on CUSUM-validated candles.
Rationale: The model learns patterns from the full distribution but only acts during regime transitions. This resolves Hypothesis G (CUSUM/ML paradigm conflict) — the model is no longer trained on a biased subset.
Paradigm Test (committee reco #7)¶
To validate this architectural change, run 3 variants via FTF:
| Variant | Training data | Inference CUSUM |
|---|---|---|
cusum_disabled |
ALL candles | YES (h=3.0) |
cusum_relaxed |
CUSUM h=1.5σ | YES (h=3.0) |
cusum_legacy |
CUSUM h=3.0σ | YES (h=3.0) |
Compare Sortino, f1_buy, trades/fold. If cusum_disabled wins → confirm architectural fix.
6. FIX B: Enable Class Balancing by Default (committee reco #2)¶
Problem¶
CVN_CLASS_BALANCING=0 (disabled). With 70% HOLD, 15% BUY, 15% SELL, the model learns to predict HOLD for safety. BUY recall = 0.25 (misses 75% of opportunities). ADR-46 non-compliant.
Solution¶
Change default to CVN_CLASS_BALANCING=1 in BASE_ENV (ablation_matrix.py).
File: src/commun/finetune/ablation_matrix.py
Effect: compute_class_weight("balanced") → BUY weight ~3.3×, SELL weight ~1.5×, HOLD weight ~0.7×.
Expected Impact¶
| Metric | Before (no balance) | After (balanced) |
|---|---|---|
| recall_buy | ~0.25 | Target: 0.40-0.55 |
| precision_buy | ~0.55 | May decrease to 0.45-0.50 |
| f1_buy | ~0.45 | Target: 0.50-0.55 |
| action_rate | ~0.15 | Target: 0.20-0.35 |
Trade-off: More BUY predictions → more trades → better statistical power BUT potentially more false positives. The filters (trend, meta-label, regime) catch false positives.
7. FIX C: Realistic Cost Model in HPO (committee reco #3)¶
Problem¶
HPO optimizes precision_recall_auc which ignores transaction costs entirely. A model with f1_buy=0.50 and 30 trades at 15bps cost may have negative expectancy. HPO doesn't know.
Solution: Cost-aware HPO objective¶
New objective: sortino_net — backtest Sortino after costs.
# In hyperoptimizer, after model training:
# 1. Quick backtest on validation set with cost model
# 2. Compute net Sortino
# 3. Use as HPO score
def _compute_sortino_net(model, X_val, y_val, ohlcv_val, cost_bps=15):
"""Quick backtest for HPO scoring — net of costs."""
predictions = model.predict(X_val)
trades = _simulate_trades(predictions, ohlcv_val, cost_bps)
return _sortino_ratio(trades)
Cost components:
- Base fee: CVN_TRADE_FEE_BPS (default 15)
- Slippage: base_bps + impact × √(size/volume) (from cost_model.py)
- Funding rate: CVN_FUNDING_RATE_BPS × expected hold duration (from Binance API)
FTF factor: hpo_objective already has variants {fbeta_buy, precision_recall_auc, f1_macro}. Add sortino_net.
Implementation¶
File: src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py
Add new objective handler:
elif objective == "sortino_net":
from commun.config.cost_model import compute_trade_cost
# Quick backtest on validation fold
net_sortino = _quick_backtest_sortino(model, datasets, cost_bps=15)
if net_sortino is None or np.isnan(net_sortino):
return 0.0
return max(0.0, net_sortino / 10.0) # normalize to [0, 1] range
8. FIX D: Trading-Centric HPO & Relaxed Guards (committee reco #4)¶
Problem¶
HPO guards kill promising trials:
- action_rate < 0.08 → score = 0 (too few BUY predictions)
- action_rate > 0.60 → score = 0 (too many BUY predictions)
- recall_buy < 0.15 → score = 0 (not enough BUY detection)
These guards constrain the solution space. With class balancing (Fix B), the model will naturally have higher action_rate.
Solution¶
Relax guards and shift to trading-centric scoring:
Current guards:
Target guards (relaxed):
if action_rate < 0.03 or action_rate > 0.80: return 0.0 # Much wider
if recall_buy < 0.05: return 0.0 # Minimal floor
Objective shift: From classification-centric to trading-centric:
| Objective | Current | Target |
|---|---|---|
| Primary metric | precision × 0.45 + AUC × 0.35 + (1-logloss) × 0.20 | Sortino net (after costs) |
| Guard: action_rate | [0.08, 0.60] | [0.03, 0.80] |
| Guard: recall_buy | > 0.15 | > 0.05 |
| Guard: n_trades | none | > 10 (minimum trades for valid backtest) |
9. FIX E: Feature Selection Strategy (committee reco #5)¶
Problem¶
Adaptive cap: min(80, max(10, n_features / 20)) → typically 30-50 features from 300+. Too aggressive — removes signal.
Solution¶
- Increase cap:
min(150, max(30, n_features / 10))→ typically 80-150 features - Rely on XGBoost regularization instead of pre-selection:
reg_alpha(L1) already in HPO range [0.01, 5.0]reg_lambda(L2) already in HPO range [0.5, 7.0]max_depth[3, 8] limits tree complexity- XGBoost handles feature selection internally via
feature_importance - OOS feature selection (strict): Feature importance computed ONLY on training fold, applied to val/test
Implementation:
- src/commun/cache/components/cvntrade_autonomous_fe.py: Change cap formula
- BASE_ENV: CVN_MAX_FEATURES=0 (0 = auto) → keep, but change auto formula
- FTF factor n_features already tests {top_30, top_50, top_100, full}
Expected Impact¶
More features → more signal → BUT only if regularization prevents overfitting. With Fix A (20× more training data), overfitting risk is dramatically reduced. 700 samples + 150 features = overfit. 14,000 samples + 150 features = healthy ratio (93:1).
10. FIX F: MLOps Operational Controls (committee reco #6)¶
Kill-Switch¶
| Mechanism | Scope | Control |
|---|---|---|
CVN_FTF_ENABLED=0 |
All FTF runs | Helm ConfigMap |
| DAG pause | Stop specific DAG | Airflow UI |
| Model rollback | Revert to previous model | MLflow: promote previous Production stage |
| Filter disable | Bypass specific filter | CVN_USE_{FILTER}=0 |
Live Observability¶
| What | Where | Alert |
|---|---|---|
| Model f1/Sortino per run | Grafana FTF dashboard (§10 drift) | >10% drop → warning |
| CUSUM pass rate | Grafana infra dashboard | <1% → too aggressive |
| Funnel survival rate | Grafana FTF funnel panel | <5% → starvation |
| Training sample count | Grafana FTF stats | >15% drop → data issue |
| Action rate drift | Grafana FTF ML metrics | >20% change → model shift |
Drift Detection¶
| Type | Method | Trigger |
|---|---|---|
| Data drift | PSI on top features (30d rolling) | PSI > 0.2 |
| Concept drift | F1/Sortino trend across runs | >10% drop from 7d mean |
| Label drift | BUY/HOLD/SELL ratio shift | >20% change |
| Behavior drift | Action rate, filter block rates | >20% change |
Staged Rollout¶
| Stage | Environment | Duration | Gate |
|---|---|---|---|
| FTF ablation | Backtest | 2-3h/factor | BH p < 0.05 |
| Committee review | Document | 1 session | Score ≥ 8 |
| Shadow mode | Paper trading | 7 days | No degradation |
| Canary | Live 10% capital | 14 days | Sortino ≥ baseline |
| Full rollout | Live 100% | — | Monitoring continues |
11. FIX G: CUSUM/ML Paradigm Test (committee reco #7)¶
Experimental Design¶
3-arm study via FTF cusum_training_mode factor:
| Arm | Training CUSUM | Inference CUSUM | Hypothesis |
|---|---|---|---|
A: disabled |
None (all data) | h=3.0σ | ML learns full distribution, CUSUM gates inference |
B: relaxed |
h=1.5σ | h=3.0σ | Partial filter, more data than current |
C: legacy |
h=3.0σ | h=3.0σ | Current baseline |
Success Criteria¶
| Metric | Minimum acceptable | Target |
|---|---|---|
| f1_buy | 0.50 | 0.60 |
| Sortino (15bps) | 1.0 | 2.0 |
| Trades per fold | 20 | 50+ |
| Powered variants | all 3 | all 3 |
Expected Outcome¶
Arm A should dominate: - 20× more training data → better model generalization - CUSUM at inference → same signal quality (only regime-change trades) - f1_buy improvement: +0.10-0.20 (from 0.45 to 0.55-0.65)
If Arm A does NOT dominate → fundamental signal-to-noise issue (Hypothesis F). Requires alternative approaches (different features, different model architecture, different timeframe).
12. FIX H: Binary Classification (committee reco #8)¶
Already implemented as FTF factor classification_mode (issue #517):
| Variant | Config | Expected Impact |
|---|---|---|
3class (baseline) |
SELL/HOLD/BUY | Current performance |
binary_balanced |
BUY/NOT_BUY, logloss_auc w=0.5 | +precision (focused decision boundary) |
binary_precision |
BUY/NOT_BUY, logloss_auc w=0.7 | +precision (calibration-biased) |
Rationale: 3-class wastes model capacity learning SELL patterns we don't act on (LdP pipeline only opens long positions). Binary focuses 100% of capacity on the BUY decision.
13. Appendix B — V2 Stricter Recos (historical)¶
13. Walk-Forward Leakage Prevention (V2 reco #5)¶
Every component in the pipeline must be strictly in-sample for each walk-forward fold:
Fold k timeline:
├── [ Train window ]──[Purge]──[ Val ]──[Emb]──[ Test ]
│ FE fitted here only 10 bars 15% 5 bars 15%
│ CUSUM sigma calibrated here (lagged 500 bars)
│ Feature selection on train only
│ Class weights computed on train only
│ Thresholds optimized on val only
│ Test: NEVER seen during training or tuning
Per-Component Leakage Audit¶
| Component | Leakage vector | Current status | V3 fix |
|---|---|---|---|
| Labels | Future prices in TP/SL window | SAFE — window T+1..T+H, anti-look-ahead | — |
| Enrichment | Indicators use future data | SAFE — all backward-looking (SMA, RSI, etc.) | — |
| CUSUM sigma | Fit on full dataset | TO FIX — sigma sees test data volatility | Lagged window: train_start - 500 bars |
| FE pipeline (imputer, scaler, stationarizer) | Fit on val/test data | SAFE — fitted on train split only | — |
| Feature selection | Importance on full dataset | TO VERIFY — may see test features | Force: feature_importance(X_train, y_train) only |
| Class weights | Computed on val/test | SAFE — compute_class_weight(y_train) |
— |
| HPO | Optimizes on test | SAFE — optimizes on val, test is holdout | — |
| Walk-forward thresholds | Threshold optimized on test | SAFE — walk-forward uses val for threshold, test for evaluation | — |
| Meta-label model | Trained on same fold | SAFE — ADR-47 requires separate fold | — |
Verification Protocol¶
Before each sprint:
1. Code audit: grep -n "X_test\|y_test" src/training/ — no test data in training
2. FE pipeline: verify pipeline.fit(X_train) not pipeline.fit(X_full)
3. Feature selection: verify importance computed on (X_train, y_train) only
4. CUSUM sigma: verify sigma_window_end < train_start - purge_bars
14. Uncertainty Quantification (V2 reco #8)¶
All key metrics must be reported with confidence intervals to quantify statistical reliability.
Methods¶
| Metric | CI Method | Implementation |
|---|---|---|
| f1_buy | Bootstrap (10,000 resamples) | ablation_stats.py:bootstrap_ci() — already implemented |
| Sortino | Bootstrap (10,000 resamples) | Same — already implemented |
| Total return | Bootstrap | Same — already implemented |
| Win rate | Wilson score interval | New — robust for small n, asymptotic-free |
| Expectancy | Bootstrap on trade PnL distribution | New |
| Trades per fold | Poisson CI | New — for count data |
| Survival rate | Binomial CI (proportion_confint) |
New |
| Block rate per filter | Binomial CI | New |
Reporting Standard¶
Every metric in FTF reports and Grafana dashboards includes:
If CI includes 0 → metric is not statistically distinguishable from zero. Flag in report.
If CI width > mean → metric is unreliable. Flag as "WIDE CI — insufficient data".
BH Correction for Multiple Comparisons¶
Already implemented: ablation_stats.py:benjamini_hochberg() applied to all pairwise variant comparisons. Controls false discovery rate at 5%.
15. Minimum Trade Count Gates (V2 reco #7)¶
Problem¶
With 1-10 trades per fold, ALL metrics are unreliable. A single trade can swing Sortino from 0 to 100.
Trade count thresholds (committed)¶
| Threshold | Count | Purpose | Action if below |
|---|---|---|---|
| Statistical minimum | 30 | Minimum for bootstrap CI to be meaningful | Flag: "UNDERPOWERED — CI unreliable" |
| Power analysis | 63 | d=0.5, α=5%, power=80% (from compute_min_sample_size()) |
Flag: "INSUFFICIENT POWER for pairwise comparison" |
| Production minimum | 100 | Robust strategy evaluation | Gate: do not promote to production |
Implementation¶
- FTF report: Already flags underpowered variants. Strengthen: exclude < 30 trades from variant ranking.
- Grafana:
n_trades >= 3filter (outlier protection). Add panel showing trade count per variant. - Promotion gate: Model cannot be promoted to Production in MLflow if any evaluation fold has < 30 trades.
Expected post-Fix-A trade counts¶
With CUSUM removed from training (Fix A): - Training: 14,000 samples (20× current) - Model sees more examples → better action_rate calibration - Expected: 30-100 trades per fold (vs 1-10 current)
16. Operational Controls & Runbooks (V2 reco #6)¶
Kill-Switch Hierarchy¶
| Level | Mechanism | Scope | Activation | Response time |
|---|---|---|---|---|
| L0: Emergency halt | CVN_FTF_ENABLED=0 in Helm |
All FTF + trading | Helm deploy | <5 min |
| L1: DAG pause | Airflow UI → pause DAG | Specific DAG | Immediate | <1 min |
| L2: Model rollback | MLflow: promote previous Production | One crypto | Manual | <10 min |
| L3: Filter bypass | CVN_USE_{FILTER}=0 in Helm |
One filter | Helm deploy | <5 min |
Configuration Audit Trail¶
Every config change is traceable:
- Git: Helm values changes → PR → CodeRabbit → merge (audit via git log)
- MLflow: Model promotion → registry (version, timestamp, user)
- Committee: Design decisions → committee/sessions/*.json (ADR-52)
- Airflow: DAG runs → execution logs (who triggered, when, params)
Rollback Procedures¶
| Scenario | Steps | RTO |
|---|---|---|
| Bad model deployed | 1. Pause DAG 2. MLflow: promote previous version 3. Verify in paper trading | 15 min |
| Bad config deployed | 1. git revert the Helm values PR 2. CI/CD auto-deploy 3. Verify pods restarted |
10 min |
| Pipeline data corruption | 1. Identify bad run_id 2. DELETE FROM finetune_results WHERE run_id = X 3. Re-run |
30 min |
| Full system failure | 1. Emergency halt (L0) 2. Investigate 3. Fix + committee review 4. Staged re-deploy | 2-4h |
Runbooks (to create in documentation/RUNBOOKS/)¶
| Runbook | Trigger | First 3 checks |
|---|---|---|
MODEL_DEGRADATION.md |
f1/Sortino drops >10% | 1. Data freshness 2. Feature drift (PSI) 3. Label distribution |
STARVATION.md |
Survival rate < 5% | 1. CUSUM threshold 2. Action rate 3. Filter block rates |
HPO_FAILURE.md |
HPO returns score=0 for all trials | 1. Action rate guards 2. Class distribution 3. Feature count |
DATA_PIPELINE.md |
n_train_samples drops >15% | 1. Binance API status 2. ETL logs 3. Date range |
COST_SPIKE.md |
Avg cost > 50bps | 1. Market liquidity 2. Funding rate 3. Slippage model params |
14. Appendix C — Market Hypothesis & Edge (V2 reco #9)¶
What edge are we exploiting?¶
The system targets short-term mean-reversion at regime transition points in DeFi altcoin markets.
Thesis: When a DeFi altcoin's volatility regime shifts (detected by CUSUM), the initial price reaction overshoots. The ML model identifies overshooting candles where the probability of a profitable mean-reversion trade (TP hit before SL) exceeds the breakeven probability after costs.
Why this edge should exist¶
- DeFi altcoin microstructure: Lower liquidity than BTC/ETH → larger overreactions to regime shifts → mean-reversion opportunity.
- Retail-dominated order flow: DeFi tokens have higher retail participation → predictable behavioral patterns (panic selling on vol spikes, FOMO buying on breakouts).
- CUSUM as regime detector: CUSUM identifies structural breaks in volatility — these are exactly the moments when market participants overreact and create temporary mispricings.
- Triple barrier as target: The ATR-based SL/TP captures the mean-reversion: TP is set at the "fair" reversion level, SL limits the cost of being wrong.
Why it hasn't worked yet¶
The edge exists in principle but the pipeline has structural issues that prevent the model from capturing it: - CUSUM during training eliminates 95% of examples → model can't learn the full distribution of regime transitions - Class imbalance → model defaults to HOLD instead of predicting BUY at transition points - HPO objective optimizes classification accuracy, not trading profit → model maximizes F1 but generates few trades - Feature cap removes market microstructure features (volume patterns, order flow proxies) that capture the edge
How we validate the edge¶
- Statistical test: Sortino of our model > Sortino of random entry at same CUSUM-filtered candles (ADR-29 baseline)
- Economic test: Net expectancy per trade > 0 after costs at 30 bps
- Robustness test: Edge persists across 5 cryptos, 5 folds, 3 cost scenarios
- Decay test: Edge doesn't degrade significantly over recent folds (concept drift monitoring)
Success criteria for edge validation¶
| Metric | Minimum | Target | Method |
|---|---|---|---|
| Sortino (15bps) | > 1.0 | > 2.5 | FTF ablation, 5 folds |
| Net expectancy/trade | > 0 | > 0.5% | After costs, per trade |
| vs random baseline | > 1.5× | > 3× | ADR-29 comparison |
| Edge stability (fold variance) | CV < 100% | CV < 50% | Cross-fold coefficient of variation |
| Trades per fold | > 30 | > 50 | Statistical power for evaluation |
15. Appendix D — Pre-Harness Implementation Roadmap (historical)¶
Sprint 1: Critical Fixes — DEADLINE: 1 week¶
| Fix | Change | Effort | Impact | Gate |
|---|---|---|---|---|
| A: Decouple CUSUM | cvntrade_autonomous_fe.py + env var |
20 lines | 20× training data | FTF run confirms >5000 train samples |
| B: Enable class balancing | ablation_matrix.py BASE_ENV |
1 line | +0.10 f1_buy | ADR-46 compliant |
| C: Cost model validation | Verify cost_model.py integrated | 0 (audit) | Cost-aware eval | Expectancy computed in backtest |
| FTF factors | Add cusum_training_mode |
15 lines | A/B test paradigm | Factor appears in DAG dropdown |
| Leakage audit | Code review per §13 checklist | 2h | Experimental validity | All items SAFE or FIXED |
Sprint 2: HPO & Features — DEADLINE: 1 week¶
| Fix | Change | Effort | Impact | Gate |
|---|---|---|---|---|
D: sortino_net objective |
XGBoost_hyperoptimizer.py |
40 lines | Trading-centric HPO | HPO uses backtest Sortino as score |
| D: Relax guards | XGBoost_hyperoptimizer.py |
5 lines | Wider solution space | action_rate [0.03, 0.80] |
| E: Increase feature cap | cvntrade_autonomous_fe.py |
5 lines | More signal | 100-150 features available |
| Uncertainty CIs | ablation_report.py |
30 lines | Statistical rigor | All metrics reported with CI |
Sprint 3: Validation & Operations — DEADLINE: 1 week¶
| Fix | Change | Effort | Impact | Gate |
|---|---|---|---|---|
| G: Paradigm test (3-arm) | FTF run cusum_training_mode |
3h compute | Validate Fix A | Arm A Sortino > Arm C |
| H: Binary classification | FTF run classification_mode |
3h compute | Validate binary | Compare to 3-class baseline |
| F: MLOps controls | Grafana alerts + 5 runbooks | 2 days | Operational safety | Runbooks reviewed |
| Edge validation | Compare to random baseline (ADR-29) | 1h analysis | Confirm economic edge | Sortino > 1.5× random |
Success Gate — Hard Criteria¶
After Sprint 3, ALL must be met to proceed to production:
| Metric | Current | Gate (minimum) | Target | Method |
|---|---|---|---|---|
| f1_buy | 0.45 | ≥ 0.50 | 0.60 | FTF ablation, bootstrap CI |
| Sortino (15bps) | 1.5 | ≥ 1.5 | 2.5 | FTF, CI excludes 0 |
| Trades per fold | 1-10 | ≥ 30 | 50+ | All folds, all cryptos |
| recall_buy | 0.25 | ≥ 0.30 | 0.45 | FTF ablation |
| Net expectancy/trade | unknown | > 0 | > 0.5% | After 15bps costs |
| vs random baseline | unknown | > 1.5× | > 3× | ADR-29 comparison |
| Powered variants | rare | all | all | Power analysis (63 trades) |
| Edge stability (CV) | unknown | < 100% | < 50% | Cross-fold Sortino variance |
If gate not met: Escalate to architectural review. Options: 1. Alternative models (LightGBM, transformer, LSTM) 2. Alternative features (order book, funding rate dynamics, cross-asset) 3. Alternative timeframes (1h, 4h) 4. Alternative strategy (momentum instead of mean-reversion)
16. Files Reference¶
| File | Lines | Purpose |
|---|---|---|
src/ETL/cvntrade_external_data.py |
20-132 | Binance API, funding rates |
src/ETL/cvntrade_label.py |
188-290 | Triple barrier labeling |
src/commun/pipeline/enrichment_api.py |
82-90 | OHLCV → indicators |
src/commun/pipeline/feature_engineering_api.py |
1-150 | FE pipeline |
src/commun/cache/components/cvntrade_autonomous_fe.py |
125-200 | CUSUM + split + FE (Fix A target) |
src/backtest/filters/cvntrade_cusum_filter.py |
279-409 | CUSUM algorithm |
src/training/harness/__init__.py |
— | Harness public API (train_one, train_ensemble, run_hpo, train_with_fixed_params) |
src/training/harness/contracts.py |
— | Typed payloads (Datasets, HPOParams, SplitMetrics, TrainedArtifact, FeatureVersion) |
src/training/harness/registry.py |
— | Plugin registries + register_model / register_ensemble decorators |
src/training/harness/dags/models/{xgboost,lightgbm,catboost}_dag.py |
— | One file = one model. HPO search space + Hamilton DAG nodes |
src/training/harness/dags/ensembles/{blend_avg,stack_logreg,stack_xgb_meta}_dag.py |
— | One file = one ensemble |
src/training/harness/adapters/{xgb,lgb,cb,ensemble}.py |
— | Per-model predict_proba shims (Protocol implementations) |
src/training/harness/nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py |
— | Reused Hamilton-pure functions (the 85% shared code) |
src/training/harness/autonomous_model_trainer.py |
— | Generic cache-aware wrapper (any registered model) |
src/training/harness/autonomous_ensemble_trainer.py |
— | Generic cache-aware wrapper (any registered ensemble) |
src/training/cvntrade_autonomous_orchestrator.py |
45-160 | Orchestration — dispatches to the two generic wrappers above |
src/commun/config/cost_model.py |
— | Non-linear slippage model |
src/commun/finetune/ablation_matrix.py |
37-97 | BASE_ENV (Fix B target) |
src/commun/mlflow/cvntrade_mlflow_manager.py |
— | Model registry |
documentation/ADR.md |
744-799 | ADR-43/44/45 (funnel) |