Skip to content

Plan dossier — Track 11 : Ensemble diversity (LightGBM + CatBoost + XGBoost + LogReg stacking)

Date : 2026-05-01 Story : CVN-N001-EE-S06 (OP wp#45) GH issue : #717 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P2 — quick-win bundle Track 5 candidate (after Tracks 5/6 ABANDONED, Track 9 in FTF testing, Track 1 in code testing). Architecture tier (tier 4) — distinct lever from data tier (Track 1) + calibration tier (Track 9) + ABANDONED label/loss tiers.

Scope split (mirroring Track 1 PR #792 → follow-up pattern) : this Story's PR ships the runtime contract surface — FTF factor + guardrail + aggregator module + ensemble inference orchestrator + 68 tests + MLOps readiness + runbook. The autotrainer dispatcher + MLflow registry pattern + InferenceAPI auto-routing + production kill-switch wiring land in the Track 11 follow-up PR (block A, see §11.6 below). Until the follow-up merges, models trained with CVN_ENSEMBLE_VARIANT=stack_* MUST NOT be deployed to inference. Sequencing : per F1_BUY_BOOST_PLAN.md §5 Track 11 — could land as the 5th quick-win after Track 1 stabilises. Independent of Track 9 verdict (different lever).

1. Context — why now, why this lever

The cross-track lesson recorded in F1 plan §6 Outcomes : training signal manipulation (Tracks 5/6) is not productive. Track 9 (calibration) and Track 1 (data) are pending verdicts. Track 11 is the architecture tier — a different lever entirely : keep the input space + training signal + calibration unchanged, change the function class itself by averaging diverse model architectures.

CVNTrade trades altcoins via a single XGBoost model per crypto. XGBoost has known biases — axis-aligned splits, sensitivity to feature scaling, sequential tree fitting that can amplify early errors. The F1 plan §5 Track 11 hypothesises that diversifying across LightGBM (leaf-wise growth) + CatBoost (ordered boosting) + XGBoost snapshot ensemble (cyclical LR), then stacking via LogReg over [xgb_prob, lgb_prob, cb_prob], lifts f1_buy materially by reducing single-model variance.

Critical scope reduction discovered during infra exploration : the LightGBM + CatBoost adapters AND a StackingHPOAdapter (3 base models + LogReg meta) are already implemented in src/training/patterns/adapters/ (not greenfield as the F1 plan §5 Track 11 suggested). This Story is wiring + integration + sweep, not new trainers from scratch. Cost estimate revised : F1 plan said 4-5 days, realistic v2 estimate is 2-3 days.

2. Hypothesis (falsifiable)

A 3-model stacked ensemble lifts f1_buy materially over the single-XGBoost baseline at the current dataset / labelling regime. Specifically :

  • H0 (null) : mean(f1_buy | stacked) - mean(f1_buy | best_single) is indistinguishable from 0 (CI95 includes 0) → ABANDON.
  • H1 (alternative) : Δf1_buy ≥ +0.020 over the best single model (not the XGBoost baseline — see §5 gate refinement) with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.
  • H2 (architectural diversity gate, F1 plan §5 explicit) : no single base model dominates the stacking weights with > 80 % weight in the final LogReg coefficients. If a single model dominates → simplify back to that single model (lock the dominant variant, abandon the stack).

Falsifiable per the same gate criteria as Tracks 5 / 6 / 9 / 1, plus two Track-11-specific bars : the +0.020 lift is measured against the best single model (LGB / CB / XGB) not the XGBoost-only baseline (more honest benchmark), and the 80 % weight ceiling on any base model.

3. Variant matrix

5 variants per the F1 plan §4.2 convention. Revised v2 after committee f33c73d4 BLOCKER #1 (LogReg meta overfitting risk at n=25 OOF samples × 15-dim features) — the 15-dim feature path is dropped ; the canonical stacking variant uses simple averaging (no learned weights, robust to small samples) ; LogReg shrinkage variant kept for comparison with strong L2 regularization on a reduced 3-dim feature set.

Variant Composition Stacking Notes
none (baseline) XGBoost only (current production) n/a Reference
lgb_only LightGBM only n/a Tests "is LGB alone better than XGB ?" — single-model swap, not a stack
cb_only CatBoost only n/a Same — swap test
stack_3model_avg XGB + LGB + CB Fixed 1/3 average of [xgb_prob, lgb_prob, cb_prob] (no learned weights) Canonical Track 11 v2 variant (committee f33c73d4 BLOCKER #1 fix : robust to n=25 OOF — no overfitting surface)
stack_3model_logreg_shrink XGB + LGB + CB LogReg meta on [xgb_prob, lgb_prob, cb_prob] (reduced 3-dim, NOT 15-dim) + L2 with floor alpha ≥ 1.0 (Optuna-tuned upward only) Tests if learned weights beat fixed average. The 15-dim variant from v1 (15 features × 25 samples = 1.7 ratio under the standard 10× rule) is explicitly abandoned as methodologically unsound

5 variants. Per-fold aggregation : standard FTF protocol — bootstrap CI95 + Cohen's d + BH-corrected p-values per F1 plan §7.

Why simple averaging is the canonical v2 : at the current OOS sample sizes, the bias-variance tradeoff favours the parameter-free fixed-weight aggregator (no fitting, no overfitting). LogReg with L2 shrinkage stays in the matrix as a learned-weights comparison ; if stack_3model_logreg_shrink does NOT beat stack_3model_avg by ≥ +0.005 f1_buy, the simpler variant is locked (Occam's razor).

Why include lgb_only + cb_only standalone : if LightGBM or CatBoost alone outperforms the XGBoost baseline (likely given LGB's known edge on tabular), the simpler "swap XGB for LGB" path beats stacking on cost (1 model vs 3 + meta) AND complexity (no MLflow contract changes for stacking weights). The committee F1 plan v2 falsifiability bar (H2) explicitly demands this comparison.

Snapshot ensemble variant dropped from v2 : the F1 plan §5 mentioned XGBoost snapshot ensembles (cyclical LR + checkpoint averaging) as part of the canonical stack. Pulled out of v1 matrix per committee f33c73d4 (focus on the methodological fixes ; snapshot adds another dimension of complexity that's not the bottleneck). If the locked Track 11 variant still leaves f1_buy on the table, snapshot ensembles can be a follow-up Story under the same Epic.

4. Implementation path

4.1 Wiring the existing adapters (~1 day)

Per the explore findings, the adapters already exist :

  • src/training/patterns/adapters/cvntrade_lightgbm_adapter.pyCVNTrade_HPOInterface + CVNTrade_ModelInterface, returns {"config", "metrics", "trainer", "type": "lightgbm_direct"}
  • src/training/patterns/adapters/cvntrade_catboost_adapter.py — same shape, "type": "catboost_direct"
  • src/training/patterns/adapters/cvntrade_stacking_adapter.pyStackingHPOAdapter loads 3 base models, generates OOF meta-features (15 dim : 9 base preds × 3 classes + 6 stats), trains LogReg via Optuna ; StackingModelAdapter wraps final training

What's MISSING (the actual scope of this Story) :

  1. FTF factor ensemble_diversity — register the 5 variants in commun/finetune/ablation_matrix.py so the FTF runner can sweep them
  2. Guardrail _validate_ensemble_diversity — env var validation per ADR-58
  3. Autotrainer routingcvntrade_XGBoost_autonomous_trainer is the current single entry-point ; need a thin dispatcher that picks the right adapter (or adapters + stacking) based on CVN_ENSEMBLE_VARIANT
  4. MLflow registry pattern for stacked models — StackingModelAdapter outputs a bundle but the registry-name + artefact-list needs to be standardised
  5. InferenceAPI loading path for the stacked ensemble — currently single-model loader ; needs to handle the 3-base + meta architecture
  6. Fixed-average aggregator — trivial wrapper around [xgb_prob, lgb_prob, cb_prob] mean ; replaces the LogReg-meta path for the canonical stack_3model_avg variant
  7. 3-dim LogReg shrinkage path — strip StackingHPOAdapter's 15-dim feature builder down to the 3 base probabilities + L2 floor (alpha ≥ 1.0 lower bound passed to Optuna search space)

4.2 ADR-23 contract pinning for the ensemble (BLOCKING per Track 1 lessons)

Critical lesson from Track 1 v1 REJECT : the ensemble's feature contract MUST be pinned in MLflow artefacts, not derived from runtime env. For the stacking variants :

  • The MLflow registry record for a stacked model carries :
  • 3 base-model run_ids (xgb_run_id, lgb_run_id, cb_run_id) — pointers to the registered base models in MLflow
  • Aggregator artefact :
    • stack_3model_avg : aggregator.json with {"type": "fixed_average", "weights": [1/3, 1/3, 1/3]} (no fitted state)
    • stack_3model_logreg_shrink : stacking_meta.pkl (LogReg over 3-dim) + aggregator.json with {"type": "logreg_shrink", "alpha": <fitted>, "feature_dim": 3}
  • LogReg coefficients pinned + readable for the H2 architectural-diversity gate (no single model > 80 % weight) — for the avg variant the "weights" are constants 1/3, so H2 is trivially satisfied ; for shrink they're fitted
  • Per-base-model feature_names pinned (consistency across all 3 — they consume the same enrichment output)
  • ensemble_variant string (stack_3model_avg / stack_3model_logreg_shrink)
  • At inference, InferenceAPI reads the bundle, loads the 3 base models from their MLflow run_ids, runs each on the input, then :
  • stack_3model_avg : returns (xgb_prob + lgb_prob + cb_prob) / 3
  • stack_3model_logreg_shrink : runs LogReg on the 3-dim probability vector, returns the meta probability
  • Single-model variants (lgb_only, cb_only) follow the existing single-model loader path.

ADR-25 fail-fast on : missing run_id, base model feature_names mismatch with current enrichment output, missing aggregator artefact, aggregator feature_dim ≠ 3 (catches the legacy 15-dim path being inadvertently re-enabled), LogReg coefficient drift (e.g. saved as 4 weights when we expect 3).

4.3 FTF factor + guardrail

Add factor ensemble_diversity under MODEL_FACTORS (architecture tier, not data) :

Variant env vars
none CVN_ENSEMBLE_VARIANT=none (default — XGB only)
lgb_only CVN_ENSEMBLE_VARIANT=lgb_only
cb_only CVN_ENSEMBLE_VARIANT=cb_only
stack_3model_avg CVN_ENSEMBLE_VARIANT=stack_3model_avg
stack_3model_logreg_shrink CVN_ENSEMBLE_VARIANT=stack_3model_logreg_shrink, CVN_STACK_LOGREG_ALPHA_FLOOR=1.0 (default ; raises Optuna search-space lower bound)

Guardrail in commun/finetune/guardrails.py::_validate_ensemble_diversity : - CVN_ENSEMBLE_VARIANT{none, lgb_only, cb_only, stack_3model_avg, stack_3model_logreg_shrink} — reject other values - CVN_STACK_LOGREG_ALPHA_FLOOR[1.0, 100.0] (v2 contract minimum, tightened from [0.1, 100.0] per committee f33c73d4 BLOCKER #1 fix — sub-floor configs are explicitly written off) ; orphaned without _logreg_shrink variant → reject (typo guard) ; default 1.0 = the contract floor itself - Stacking variants require CVN_BINARY_CLASSIFICATION=1 (binary aggregator by design ; see §4.6 3-class decision logic)

4.4 Tests

  • tests/unit/test_ensemble_diversity.py — 12+ tests :
  • guardrail accepts every variant in factor registry
  • guardrail rejects unknown variants + orphaned _LOGREG_ALPHA_FLOOR env without the shrink variant
  • guardrail rejects stacking variants under CVN_BINARY_CLASSIFICATION=0 (3-class meta is OOS, see §4.6)
  • dispatcher routes none to XGBoost (regression bar — pre-Track-11 path bit-identical)
  • dispatcher routes lgb_only to LightGBM adapter
  • dispatcher routes stack_3model_avg to fixed-average aggregator with [xgb, lgb, cb] base list
  • dispatcher routes stack_3model_logreg_shrink to StackingHPOAdapter with 3-dim feature path (NOT 15-dim)
  • LogReg coefficient extraction + 80 % weight ceiling check (H2 gate) ; trivially satisfied for _avg
  • aggregator feature_dim == 3 enforcement (catches inadvertent re-enable of the dropped 15-dim path)
  • tests/integration/test_track11_ensemble.py — 5+ tests :
  • end-to-end on synthetic dataset : 3 base trainers → fixed-average aggregator → final probabilities
  • end-to-end on synthetic dataset : 3 base trainers → 3-dim LogReg shrink → final probabilities, alpha ≥ 1.0 enforced
  • per-base-model feature_names consistency check (all 3 must agree)
  • inference-time loading : load registered stack → 3 base predicts → aggregator → predict
  • reproducer assertion : none variant produces bit-identical output to current XGBoost-only path

4.5 Observability + MLOps readiness

  • New events :
  • event=ensemble_train_complete variant=... base_models=... meta_weights=... f1_buy_oof=... (training-time)
  • event=ensemble_inference variant=... base_probs=... stacked_prob=... source=... (inference-time)
  • event=ensemble_inference_aborted reason=kill_switch_engaged variant=... (kill-switch path, see §4.7)
  • event=h2_gate_violation max_weight=... model=... if any base model > 80 % weight (H2 fail signal)
  • Grafana panel : "Ensemble base-model contributions" — bar chart of LogReg weights per crypto (operator visibility on stack diversity ; for _avg the bars are pinned at 1/3 and serve as a sanity-check)
  • Inference-latency SLO panel (per committee f33c73d4 reco) — p50 / p95 / p99 of ensemble-vs-single-model latency, alert if p95 > 3× single-model
  • MLOps readiness documentation/stories/CVN-N001-EE-S06/mlops_readiness.md filled per ADR-70 before merge
  • Runbook documentation/runbooks/runbook_ensemble_diversity.md (P2) :
  • H2 gate violation response (single model dominates → simplify back)
  • Base-model feature_names drift (one of 3 retrained under a different feature contract)
  • Stacking meta-model drift (LogReg weights shift > 30 % from training, only relevant for _logreg_shrink)
  • Cache key extension acknowledgement (3× model storage = 3× cache namespace ; flag for Track 12)
  • Kill-switch interaction (§4.7) — operator-only re-enable post halt

4.6 3-class decision logic — binary mode contract

Committee f33c73d4 BLOCKER #2 : the v1 plan was silent on how a binary LogReg meta combines with the 3-class label space (BUY / HOLD / SELL).

Resolution : Track 11 inherits CVN's existing binary-classification mode and does NOT introduce a 3-class meta-aggregator. The contract :

  • CVN_BINARY_CLASSIFICATION=1 is required for any stacking variant (enforced by §4.3 guardrail). Per F1 plan + post-#608, the production path is binary (P(BUY) only).
  • Each base model (XGB / LGB / CB) is trained with binary labels (BUY=1, NOT_BUY=0). They emit P(BUY).
  • The aggregator (fixed-avg or LogReg-shrink) consumes the 3-dim [xgb_P(BUY), lgb_P(BUY), cb_P(BUY)] vector and emits a single P(BUY) ∈ [0, 1].
  • The existing ThresholdCalibrator (or PerRegimeThresholdCalibrator from Track 9, depending on the locked Track 9 variant) converts P(BUY) > θ → BUY signal.
  • SELL / HOLD live downstream in the existing filter chain (CUSUM, trend, meta-label, regime, cost, kelly, confirmation, quality), NOT in the ML layer. Track 11 does not touch the filter chain.

Out-of-scope : multi-class stacking (e.g., a 3-output softmax meta) — would require redoing the entire calibration + filter chain contract ; deferred to a future Story if the binary-mode F1 plan ABANDONS overall. This Story explicitly fails fast under multi-class mode, mirroring Track 9's binary-mode requirement.

4.7 Kill-switch integration — ADR-71 inheritance

Committee f33c73d4 BLOCKER #3 : the v1 plan did not document how Track 11 interacts with the trading kill-switch.

Resolution : Track 11 inherits the existing kill-switch (ADR-71, design documentation/design/CVN-N001-EF-S02-kill-switch-design.md, implementation Epic CVN-N001-EG). Specific contract :

  • The kill-switch lives upstream of InferenceAPI in the trading kernel (paper + live, ADR-39 + ADR-40). Track 11 changes the model loaded by InferenceAPI ; the kill-switch path is unaffected.
  • No new kill-switch surface is introduced. The 3-base + aggregator architecture is invisible to the kill-switch ; an engaged kill-switch halts all signal emission regardless of the underlying ensemble variant.
  • Inference-layer traceability : the ensemble inference orchestrator (commun.trading.ensemble_inference.run_ensemble_inference) accepts an injectable kill_switch_check callable that is invoked BEFORE any base-model predict(). When it returns True, the orchestrator short-circuits + emits event=ensemble_inference_aborted reason=kill_switch_engaged variant=<variant>. This PR ships the contract surface only — the production binding to ADR-71's single-PG read is OOS (default callable in this PR is _no_kill_switch which always returns False). The production binding lands in the Track 11 follow-up PR (block A) per §11.6, when Epic CVN-N001-EG ships the PG kill-switch read. Until then, the orchestrator's kill-switch path is exercised only by the integration test test_kill_switch_short_circuits_ensemble via an injected engaged check.
  • Re-enable contract : ADR-71 mandates operator-only disengage. Re-loading the ensemble bundle into a fresh InferenceAPI instance is part of standard restart ; no Track 11 special-case.
  • Latency SLO (per committee f33c73d4 reco) : the ADR-71 halt-latency budget is < 1 s end-to-end. The 3-base predict path adds compute upstream of the kill-switch check. The InferenceAPI MUST check kill-switch state BEFORE invoking the 3 base models, NOT after, to keep the halt path independent of ensemble cost. Test : tests/integration/test_track11_ensemble.py::test_kill_switch_short_circuits_ensemble.

Out-of-scope : per-variant kill-switches (e.g., "halt only stacking variants") — premature ; the ADR-71 single-PG-source design intentionally rejects per-variant gates.

5. Acceptance gate (per F1 plan §6)

The 6 official gates apply, with two Track-11-specific tightenings :

Gate Threshold
F1_buy lift mean Δf1_buy ≥ +0.020 vs best single model (LGB / CB / XGB), CI95 excludes 0 — NOT vs XGBoost baseline alone (honest benchmark)
Joint metric Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 % vs best single
Stability per-fold variance of f1_buy ≤ 0.05
Per-asset f1_buy improves on ≥ 4/5 cryptos vs best single
Sample size ≥ 50 BUY trades / fold
MLOps documentation/stories/CVN-N001-EE-S06/mlops_readiness.md complete
Inference-latency SLO p95 ensemble-vs-single ≤ 3.0× (committee f33c73d4 reco) ; halt-path latency unchanged from ADR-71 budget (< 1 s)
Cost-adjusted net of Sharpe accounting for the 3× training compute + 3× inference compute (committee f33c73d4 reco) — Δsharpe ≥ 0 after subtracting the cost in basis-points-equivalent units

Mandatory architectural diversity check (H2) : in the locked variant's aggregator, no single base model has effective weight > 80 %. For stack_3model_avg this is trivially satisfied (1/3 each). For stack_3model_logreg_shrink, if any LogReg coefficient exceeds 0.80 normalised → ensemble is "single-model + noise" → ABANDON the stack, lock the dominant single model instead.

Minimum-contribution floor (committee f33c73d4 reco) : in stack_3model_logreg_shrink, every base model's normalised weight ≥ 5 %. A weight < 5 % means the base model contributes effectively zero ; lock the simpler 2-model variant if applicable, OR lock the surviving single model.

If every gate clears AND H2 holds AND the 5 % floor holds → operator decision lock (atomic per-crypto promotion of the stacked model bundle per ADR-15 + ADR-42). If F1 gate fails on every variant → abandon. If single-model variant (lgb_only / cb_only) clears the gates AND beats stack_3model_avg AND stack_3model_logreg_shrink, lock the simpler variant — stacking adds complexity for no edge.

6. Out of scope

  • More than 3 base models (e.g., adding TabNet, FT-Transformer) — Track 8 / 11 v2 territory ; v1 covers the canonical 3-model stack.
  • Dynamic base-model selection per crypto (e.g., "use only XGB+LGB on BTCUSDC, full stack on ETHUSDC") — premature ; constant base set across the 5 cryptos for v1.
  • Online ensemble re-weighting at inference (e.g., per-regime stacking weights) — Track 14 territory ; ensemble weights stay frozen at training time.
  • Distillation back to single model (training a small student to mimic the ensemble) — defer ; if Track 11 LOCKs, Track 14 v2 may revisit for inference cost.
  • XGBoost snapshot ensemble (cyclical-LR + checkpoint averaging) — dropped from v2 per committee f33c73d4 (reduce variant-matrix complexity, focus on the methodology fixes) ; if locked Track 11 leaves f1_buy on the table, snapshot is a follow-up Story.
  • Snapshot ensemble for LGB / CB — same rationale ; LGB native bagging + CB ordered boosting already provide intra-model diversity if needed.
  • 15-dim LogReg meta-model — explicitly abandoned (committee f33c73d4 BLOCKER #1). 15 features × ~25 OOF samples per fold = 1.7× ratio, far below the standard 10× rule. Replaced by fixed-average + 3-dim shrinkage paths (§3).
  • 3-class meta-aggregator — out-of-scope (§4.6). Stacking variants require CVN_BINARY_CLASSIFICATION=1.
  • Per-variant kill-switch — out-of-scope (§4.7). Track 11 inherits ADR-71's single-PG kill-switch unchanged.
  • Paper/live integration — requires deployment_review committee session (Track 1 precedent + ADR-68). v1 ships backtest + MLflow registry only.
  • Cache key extension to ensemble bundle hash — Track 12 work ; documented as known-debt + pre-LOCK gate (mirrors Track 1 cache deferral).

7. Falsifiability + rollback

  • Falsifiability : pre-registered gates (H1 lift + H2 weight ceiling + 5 % floor). If Δf1 ∈ [-0.01, +0.015] with CI95 including 0 → H0 outcome → ABANDON. If H1 holds but H2 fails (single model > 80 % weight) → lock the dominant single model, abandon the stack. If 5 % floor fails on _logreg_shrink → fall back to _avg if it cleared, else lock surviving simple variant.
  • Rollback (per Track 1 lessons — model-artefact switching, NOT runtime env flag) :
  • Mandatory champion_xgb_only registered before any LOCK — the previous XGB-only champion stays as a deployable fallback.
  • Operator promotes champion_xgb_only via Console (atomic per-crypto per ADR-15 + ADR-42) on regression detection. Latency < 5 min.
  • The runtime env CVN_ENSEMBLE_VARIANT is training-time only ; flipping at inference would dimension-mismatch the loaded model bundle.
  • Hot-fix path for stacking-layer bugs : retrain via PR + atomic promotion ; no env-flag toggle.
  • Pre-LOCK rollback dry-run (per Track 1 committee 6519ed97 v2 reco v2.4) : before LOCK, an operator demotes the stacked champion to staging and re-promotes champion_xgb_only on a single crypto (typically BTCUSDC), verifies signals match the pre-Track-11 baseline within tolerance, then re-promotes the stacked champion. The dry-run logs the promotion / demotion atomicity end-to-end.

8. Risks

Risk Likelihood Impact Mitigation
3× training cost blows up sweep budget (5 cryptos × 5 folds × 5 variants × ~3× per-fold cost = 15× XGB-only) high medium Pre-FTF compute budget estimate ; abandon _logreg_shrink first if budget exceeded (most expensive due to Optuna sweep). Existing Optuna single-thread per HPO trial means parallelism is in tree-build, not trial-loop. Cost-adjusted Sharpe gate (§5) makes runaway compute self-falsifying
LogReg meta-model overfits OOF features medium medium Mitigated structurally in v2 — 15-dim path dropped, replaced by 3-dim with alpha ≥ 1.0 floor + fixed-average baseline (no fitted state). _logreg_shrink must beat _avg by ≥ +0.005 to lock
Single base model dominates (LGB or CB has the edge alone) → stack adds 3× cost for marginal gain medium low H2 gate (80 % weight ceiling) + 5 % floor catches it. Lock simpler lgb_only / cb_only instead
feature_names drift between base models (one trained under a different enrichment config) low high ADR-23 fail-fast at load time : all 3 base-model feature_names must match current enrichment output ; mismatch raises RuntimeError. Runtime feature_names validation (committee f33c73d4 reco) — also re-verified at every InferenceAPI call (cached check)
MLflow artefact bloat (3 models + meta = ~3× storage per crypto × 5 cryptos = 15× XGB-only baseline) medium low Quarterly MLflow registry cleanup ; flag if storage > 100 GB per crypto in §4.5 observability
LogReg meta-model coefficients drift in production over time medium medium Drift runbook §3 — automated weekly drift check via the Grafana panel + Prometheus alerts (committee f33c73d4 reco — automate, don't rely on manual review) ; if any weight drifts > 30 % from training, page operator and schedule re-fit
Ensemble inference latency > existing single-model SLO medium medium Inference-latency SLO gate (§5) : p95 ≤ 3.0× single-model. If exceeds budget, run base models in parallel via threadpool (3-base predict is embarrassingly parallel) before falling back to a simpler variant
3-class label leakage into binary stacking (e.g., a SELL-leaning OOF fold biases the meta) low medium CVN_BINARY_CLASSIFICATION=1 enforced + base-model labels validated at load time (§4.6). Multi-class load → RuntimeError
Kill-switch halt-latency degrades under ensemble path low high §4.7 — kill-switch state checked BEFORE base-model predicts ; integration test test_kill_switch_short_circuits_ensemble measures end-to-end halt-latency under each variant
Pre-FTF integration tests catch only synthetic data, miss production data idiosyncrasies medium medium Mandatory pre-FTF integration test on a single crypto's real OHLCV (BTCUSDC last 30 days) before the 5-crypto sweep — committee f33c73d4 reco

9. Why this is not the next training-signal-manipulation attempt

Per the F1 plan §6 cross-track lesson : Tracks 5 + 6 closed ABANDONED on training signal manipulation. Track 11 is :

  • Architecture-tier (tier 4, not tier 2/3 LABEL/LOSS).
  • Function-class diversification (different model architectures, not different loss / labels / threshold).
  • Pre-training-data agnostic (the model's input + training labels stay identical to baseline ; only the inference function class changes).

If Track 11 ABANDONS, the F1 plan §6 sequencing implication block defers the rest of the architecture tier (Track 8 sequence model). If Track 11 LOCKs, architecture diversity becomes a productive lever and unlocks Track 8 (sequence model as residual on top of the ensemble) + Track 14 (per-regime ensemble weights).

10. Cross-references

  • F1 plan §5 Track 11 + §6 sequencing
  • ADRs : ADR-15 (atomic promotion), ADR-23 (features version-pinned), ADR-25 (no silent fallback), ADR-32 (event=key=value structured logs), ADR-39 (runtime standalone), ADR-40 (paper/live same kernel), ADR-42 (atomic per-crypto promotion), ADR-56 (every change FTF-testable), ADR-58 (every factor → guardrail + integration test), ADR-70 (MLOps readiness mandatory), ADR-71 (trading kill-switch — Track 11 inheritance, see §4.7)
  • Existing infra reused :
  • src/training/patterns/adapters/cvntrade_lightgbm_adapter.py (LightGBM trainer)
  • src/training/patterns/adapters/cvntrade_catboost_adapter.py (CatBoost trainer)
  • src/training/patterns/adapters/cvntrade_stacking_adapter.py (3-model stacking + LogReg meta — StackingHPOAdapter + StackingModelAdapter)
  • src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py (current single-model entry-point — extends with dispatcher)
  • src/commun/mlflow/cvntrade_mlflow_manager.py:log_model_with_registry (MLflow registration)
  • New code :
  • FTF factor ensemble_diversity in commun/finetune/ablation_matrix.py
  • Guardrail _validate_ensemble_diversity in commun/finetune/guardrails.py
  • Aggregator module commun/trading/ensemble_aggregator.py (fixed-average + 3-dim shrink + diversity gates + JSON serialise/deserialise)
  • Inference orchestrator commun/trading/ensemble_inference.py (3-base predict fan-out + injectable kill-switch check + structured events)
  • OOS this PR (block A follow-up — see §11.6) : the InferenceAPI auto-routing extension in commun/pipeline/inference_api.py that wires the loaded model's ensemble_variant tag to run_ensemble_inference. Until block A merges, InferenceAPI does NOT route stacking variants — models trained with CVN_ENSEMBLE_VARIANT=stack_* MUST NOT be deployed.
  • Sister Tracks : Track 5 results (ABANDON), Track 6 results (ABANDON), Track 9 (FTF testing), Track 1 (in code testing post-merge)
  • Track 1 lessons applied :
  • ADR-23 feature contract pinning via MLflow artefact (committee 62d756a9 v1 BLOCKER #2)
  • Model-artefact rollback via existing MLOps promotion (committee 62d756a9 v1 BLOCKER #1)
  • SHA256 checksum on artefacts (committee 6519ed97 v2 reco v2.1)
  • Pre-LOCK rollback dry run (committee 6519ed97 v2 reco v2.4)
  • Cache key acknowledgement deferred to Track 12

11. v1 committee triage (f33c73d4, REJECTED / METHODOLOGY_FLAW)

Committee verdict on v1 : REJECTED with strong consensus. Reason : "critical methodological flaws in the LogReg meta-model's robustness (severe overfitting risk with insufficient OOF samples) and an undefined strategy for handling 3-class labels with a binary meta-model, compounded by the absence of an explicit kill-switch for operational safety".

Three blockers + ten recommendations. Triage below.

11.1 Three blockers (all addressed in v2)

Blocker Severity v2 resolution
#1 LogReg meta-model overfitting (15-dim × ~25 OOF samples = 1.7× ratio, far below the 10× rule) High §3 variant matrix rewritten — 15-dim path dropped. Canonical variant is stack_3model_avg (no fitted state, no overfitting surface). Comparator stack_3model_logreg_shrink reduces to 3-dim + L2 floor alpha ≥ 1.0. §5 gate adds : shrink must beat avg by ≥ +0.005 OR avg locks (Occam)
#2 Undefined 3-class decision logic (binary LogReg meta vs 3-class label space) High New §4.6 — Track 11 inherits CVN_BINARY_CLASSIFICATION=1 (F1 plan + post-#608). Base models emit P(BUY), aggregator emits P(BUY) ; SELL/HOLD live in downstream filter chain. Multi-class meta is OOS. Guardrail rejects stacking variants under multi-class mode
#3 Absence of explicit kill-switch High New §4.7 — Track 11 inherits ADR-71's existing kill-switch (single PG source, operator-only disengage). Kill-switch state checked before base-model predicts to keep halt-latency independent of ensemble cost. Integration test test_kill_switch_short_circuits_ensemble. New event event=ensemble_inference_aborted reason=kill_switch_engaged for traceability

11.2 Ten recommendations — triage

# Reco Bucket Action
R1 Fix LogReg meta overfitting pre-impl §3 + §5 ; addressed in blocker #1
R2 Define 3-class strategy explicitly pre-impl §4.6 ; addressed in blocker #2
R3 Dedicated kill-switch documentation pre-impl §4.7 ; addressed in blocker #3
R4 Inference latency SLOs pre-impl §5 gate (p95 ≤ 3× single) + §4.5 Grafana panel
R5 Drift automation (don't rely on weekly manual review) impl-time §8 risks — Prometheus alerts on weight drift, automated check ; runbook §3 documents the alert response
R6 Cost-adjusted metrics pre-impl §5 gate — Δsharpe net of training + inference compute (basis-points-equivalent)
R7 Min weight 5 % floor pre-impl §5 — minimum-contribution floor on _logreg_shrink ; weight < 5 % → lock simpler variant
R8 Pre-FTF integration tests on real data impl-time §8 risks — mandatory single-crypto real-data integration test (BTCUSDC 30 days) before 5-crypto sweep
R9 Runtime feature_names validation impl-time §8 risks — re-verified at every InferenceAPI call (cached) on top of load-time check
R10 Cost estimate revision pre-impl §1 already revised 4-5d → 2-3d ; §5 cost-adjusted-Sharpe gate makes runaway compute self-falsifying

All R1-R10 are folded into v2 and either addressed in the dossier or held as impl-time gates for the implementation PR. None deferred.

11.3 v2 committee verdict (ee529c59, PASSED / OK / strong consensus)

Committee v2 PASSED with zero blockers. Strong consensus across 5 experts (scores 7.5–8.5). Two areas of dissent (advisory) :

  • Robustness of _logreg_shrink with small OOF samples — 3 experts say "still risky despite L2", 2 say "pragmatic and sound given the constraints". Mitigated by §5's +0.005 lift requirement (avg locks if shrink doesn't beat by ≥ +0.005 — which inherently filters out high-variance shrink fits) and §11.4 reco R2 below.
  • Execution realism in backtest — 1 expert (crypto-trader) flags absence of slippage/fee/funding drag/stress-case analysis. Other 4 rely on §5's cost-adjusted Sharpe gate. Tracked as R3 below ; partial fix (slippage + fees in cost-adjusted Sharpe input) is impl-time ; full stress-case analysis is deferred (cross-Track concern, not Track-11-specific).

15 recommendations triaged in §11.4.

11.4 v2 recommendations — triage

# Reco Bucket Action
R1 Accelerate Track 12 (Cache Key Extension) post-Track-11 stabilisation deferred Cross-Track ; logged as block-G in Track 1 backlog already (cache deferral pattern) ; surfaces in Track 12 Story planning
R2 Robustness of _logreg_shrink on small samples — explore inner CV + monitor coefficient stability across folds/assets if locked impl-time If _logreg_shrink locks : add per-fold coefficient stability check (variance across 5 folds) to §4.4 tests + §4.5 Grafana panel ; flag ABANDON if std(weights) > 0.15 across folds
R3 Execution realism — slippage, fees, funding, stress-case impl-time (partial) + deferred (full) Slippage + fees in cost-adjusted Sharpe input (§5) are impl-time additions to the FTF reporter ; full stress-case (liquidation cascades, spread shocks) is cross-Track and deferred to a follow-up Story
R4 Deep-dive on ensemble × 9-filter chain interaction post-lock post-lock Post-LOCK observability addition ; documented as part of the runbook §4.5 — operator monitors filter-pass rates under each variant before promoting prod
R5 Automate MLflow registry cleanup + URI validation deferred Cross-Track ; existing quarterly cleanup (§8 risks) is the v1 pattern ; full TTL-based automation is a separate observability Story
R6 Lower Prometheus drift alert threshold to > 10 % + auto-fallback to _avg if critical impl-time Drift threshold lowered from 30 % to 10 % page / 30 % auto-fallback in runbook §3 — only if _logreg_shrink is the locked variant. _avg has no learned weights to drift
R7 Document basis-points-equivalent calc methodology for cost-adjusted Sharpe impl-time Add §5bis to the dossier OR equivalent docstring in the FTF reporter — straight conversion : compute_cost_bps = pod_$/hour × wall_clock_hours / aum × 10000. AUM proxy = $100k as the FTF default (matches existing reporter convention)
R8 feature_names SHA256 + same-data-snapshot enforcement impl-time Add SHA256(feature_names list) check at base-model load + a train_data_snapshot_id field (timestamp range) to MLflow record ; mismatch → RuntimeError. Mirrors Track 1's SHA256 pattern from committee 6519ed97 v2 reco v2.1
R9 Ensemble-specific debugging runbooks beyond general drift post-lock Post-LOCK runbook expansion ; the §4.5 runbook stub covers H2 violation + drift ; deeper debugging stories (one base model underperforming systematically across regimes) are post-lock observability
R10 Threadpool parallel base-model predict + auto-fallback on latency breach impl-time Wired in §4.7's InferenceAPI extension : 3 base predicts in a concurrent.futures.ThreadPoolExecutor(max_workers=3) ; auto-fallback to single-model on p95 SLO breach is a post-lock runbook step (operator promotes the simpler variant via Console)
R11 Dedicated kill-switch race-condition test with sub-100ms budget impl-time Extend test_kill_switch_short_circuits_ensemble (§4.7) : measure latency between kill-switch engage + ensemble inference abort ; assert < 100 ms P95
R12 Disclose actual BUY/SELL/HOLD trade counts per fold in FTF results impl-time The FTF reporter already emits per-fold BUY counts ; add HOLD + SELL columns + a per-fold sample-size warning if BUY < 50 (mirrors §5 sample-size gate)
R13 Justify ATR-dynamic horizon vs regime stationarity (ADF on barrier hit times) deferred Cross-Track concern (label / horizon design, not ensemble-specific) ; logged for the dossier of any future label / horizon Story (currently Tracks 5/6 ABANDONED so no active vehicle)
R14 Document why LGB / CB beat XGB in crypto microstructure beyond variance reduction pre-impl Add brief paragraph to §1 — LGB's leaf-wise growth captures leptokurtic returns better in low-data regimes, CB's ordered boosting reduces target leakage on autocorrelated features. Variance reduction is the primary claim in the F1 plan §5 ; specific microstructure edge is hypothesised, not pre-registered
R15 Probability of Backtest Overfitting (PBO) analysis deferred Cross-Track infrastructure (PBO requires bootstrap resampling of the strategy submission set) ; pre-existing FTF-wide concern ; logged as a future observability story

Net : 6 impl-time gates, 1 pre-impl text addition (R14), 5 deferred (cross-Track), 2 post-lock (R4, R9), 1 post-lock conditional (R10 fallback path).

11.5 Implementation PR scope (this PR)

This PR ships the runtime contract surface that all downstream wiring depends on. Concretely :

Component File Status
FTF factor ensemble_diversity src/commun/finetune/ablation_matrix.py ✅ shipped (5 variants)
Guardrail _validate_ensemble_diversity src/commun/finetune/guardrails.py ✅ shipped (binary-mode + alpha-floor + variant whitelist)
Aggregator module src/commun/trading/ensemble_aggregator.py ✅ shipped (FixedAverageAggregator + LogRegShrinkAggregator + evaluate_diversity_gates)
Ensemble inference orchestrator src/commun/trading/ensemble_inference.py ✅ shipped (3-base fan-out + kill-switch short-circuit + structured events)
Unit tests (49) tests/unit/test_ensemble_diversity.py ✅ 49/49 PASS
Integration tests (19) tests/integration/test_track11_ensemble.py ✅ 19/19 PASS (incl. kill-switch latency p95 < 100 ms)
MLOps readiness documentation/stories/CVN-N001-EE-S06/mlops_readiness.md ✅ shipped (ADR-70)
Runbook P2 documentation/runbooks/runbook_ensemble_diversity.md ✅ shipped (5 symptoms : H2 / feature_names / drift / latency / kill-switch)
MkDocs nav mkdocs.yml ✅ runbook registered ; strict build green

11.6 Block A — Track 11 follow-up PR (explicit OOS this PR)

The following items are explicitly out-of-scope of this PR and tracked as the Track 11 follow-up (mirrors Track 1 PR #792 → follow-up pattern, committee 6519ed97 v2 reco v2.4) :

  1. Autotrainer dispatcher — extend src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py (or create cvntrade_ensemble_trainer.py) routing CVN_ENSEMBLE_VARIANT to the right base-model adapter list + aggregator fit
  2. MLflow registry pattern for stacked models — aggregator.json artefact + 3 base-model run_id references + per-base feature_names SHA256 + train_data_snapshot_id (committee reco R8)
  3. InferenceAPI extension in src/commun/pipeline/inference_api.py to auto-route stacking variants to commun.trading.ensemble_inference.run_ensemble_inference based on the loaded model's variant tag
  4. Production kill-switch wiring to ADR-71's single-PG read (currently a no-op default in commun.trading.ensemble_inference._no_kill_switch ; production wiring lands when Epic CVN-N001-EG ships, mirroring Track 1's pre-deployment gate)
  5. Pre-LOCK rollback dry run — atomic per-crypto promotion of the champion_xgb_only fallback ; 24h shadow validation (committee reco v2.4 from Track 1)
  6. FTF sweep results dossier — block B-F per the Track 1 backlog convention

Until the follow-up merges, models trained with CVN_ENSEMBLE_VARIANT=stack_* MUST NOT be deployed to inference (the InferenceAPI doesn't yet auto-route). The runtime contract surface in this PR enables the FTF sweep + the follow-up PR with zero new risk : both aggregator paths + the guardrail + the orchestrator are independently testable today.

11.7 Falsifiability summary post-v2

The full pre-registered gate set : H1 (Δf1 ≥ +0.020 vs best single, CI95 excludes 0) ; H2 (max base-model weight ≤ 80 %) ; 5 % floor on _logreg_shrink ; per-asset (≥ 4/5 cryptos) ; stability (per-fold variance ≤ 0.05) ; sample size (≥ 50 BUY trades / fold) ; latency (p95 ≤ 3× single) ; cost-adjusted Sharpe (Δsharpe ≥ 0 net of compute) ; MLOps readiness ; pre-LOCK rollback dry-run.

A locked variant must clear all of the above. If _logreg_shrink doesn't beat _avg by ≥ +0.005 → lock _avg (Occam). If _avg doesn't beat lgb_only / cb_only → lock the simpler swap variant. If no variant clears → ABANDON Track 11 ; the F1 plan §6 sequencing implications apply (defer the rest of the architecture tier).


Question for the committee (v2 — blockers addressed)

v1 was REJECTED (f33c73d4, METHODOLOGY_FLAW, strong consensus) on three blockers : (1) LogReg meta overfitting risk at 15-dim × ~25 OOF samples = 1.7× ratio under the 10× rule ; (2) undefined 3-class decision logic for a binary meta-model ; (3) absence of explicit kill-switch documentation. v2 addresses all three structurally — the 15-dim path is dropped in favour of fixed-average + 3-dim L2-shrunk variants (§3) ; binary mode is contracted explicitly (§4.6, multi-class explicitly OOS, fail-fast guardrail) ; ADR-71 kill-switch inheritance is documented with a pre-base-predict halt-check + integration test (§4.7).

Validate that the fixed-average aggregator (stack_3model_avg) replaces the 15-dim LogReg as the canonical variant. Rationale : at n_oof = ~25 per fold, the parameter-free aggregator avoids the overfitting surface entirely ; _logreg_shrink (3-dim + α ≥ 1.0) stays as a learned-weights comparator with a +0.005 lift requirement to lock instead of _avg. Is this Occam-favoured tiebreaker the right structural fix, or does it risk under-using the ensemble's information content ?

Validate the kill-switch architectural choice : Track 11 inherits ADR-71 unchanged ; the kill-switch state is checked before base-model predicts to keep halt-latency independent of ensemble compute. No new kill-switch surface. Is this inheritance pattern sufficient, or does the 3-base predict path warrant a dedicated halt-budget separate from ADR-71's < 1 s end-to-end SLO ?

Validate the expanded gate set (§5) : H1 lift + H2 weight ceiling + 5 % min-contribution floor on _logreg_shrink + inference-latency SLO (p95 ≤ 3× single-model) + cost-adjusted Sharpe (net of 3× train + 3× inference compute in basis-points-equivalent units). Are these four additional gates sufficient to catch silent failures (information-empty stacking, latency degradation, compute-driven Sharpe inflation) that a pure F1-lift gate would miss ?

Flag any remaining hidden modes (e.g. base models trained under different feature contracts silently producing wrong meta-features, runtime feature_names check missing on a code path, MLflow registry contract drift between the 3 base model URIs and the aggregator artefact, automated drift alerts firing without runbook coverage, kill-switch race condition where a halt arrives between base-model load and predict) where Track 11 would silently produce wrong probabilities or violate ADR-71 latency without visible alerts.