Track 11 — Ensemble diversity results & gate decision¶
⚠️ POST-CLOSURE REOPEN — 2026-05-02 : the original
ABANDONverdict in this dossier (drafted 2026-05-01, dossier PR #800) was retracted by the operator on 2026-05-02. The closure rested on a "supersession argument" (§11.3-§11.4 below) claiming that the averaging variant's null result subsumed the missing model-pure variants — that argument is mathematically unsound and the verdict it produced is invalid. wp#45 is being re-opened and Track 11 must be re-swept with full coverage of the model-pure variants (lgb_only,cb_only) before any closure decision. See §13 below for the post-closure reopen detail + the corrected verdict pathway.
Story : CVN-N001-EE-S06 (wp#45) — Track 11 of F1_buy boost mission, architecture tier
ADR : closes per ADR-0079 (FTF sweep → Story closure 8-step workflow). Track 11 is the second split-PR Story to close under the workflow ; the runtime-surface PR #793 merged 2026-05-01 14:27 UTC, the block A follow-up was never opened (see §11 for why ABANDON supersedes the split-PR completion requirement, with a delta vs Track 9 which was a single-PR Story).
Date : 2026-05-01
Authors : Dominique (operator) + Claude
Plan dossier : 2026-05-01-track11-ensemble-diversity-plan.md — committee plan_review PASSED
Implementation PR (runtime contract surface) : #793 (squash 9e1bf8a3, merged 2026-05-01 14:27 UTC)
Block A follow-up PR : NOT OPENED (autotrainer dispatcher + InferenceAPI auto-routing + MLflow registry + production kill-switch wiring) — abandoned per the verdict of this dossier ; rationale §11.4
Sweep run_id : ftf_20260501_152755_d84267_ATR0.5_1.5_H4
Sweep status : failed (50 errors / 175 expected = 28.5 % error rate ; 125 useful results) ; duration 2h05m ; started 2026-05-01 15:28:24 UTC ; git SHA 9e1bf8a3 (= PR #793 squash, ~1h post-merge)
Cryptos : UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC (defi_top5 — same panel as Track 9 for cross-comparability)
Strategy / PTE : ATR0.5_1.5_H4 (per project_pte_policy memory — ATR1.5_3.0_H5 is DEPRECATED)
Folds : 5 (Folds 3-7 per the report) ; Trials : 50 (Optuna budget per fold) ; Cost : 15 bps
FTF report PDF : reports/ftf_report_ftf_20260501_152755_d84267_ATR0.5_1.5_H4.pdf (committed to docs base — source-of-truth for the per-pair / per-fold tables in §2-§9)
Verdict (RETRACTED 2026-05-02) : ~~ABANDON~~ → INCONCLUSIVE — partial coverage, model-pure variants never trained. The original ABANDON verdict (struck through below) rested on a supersession argument (§11.3-§11.4) that does not hold mathematically for uniform averaging. The hypothesis "LGB and CB add no signal beyond XGB" cannot be tested by observing that mean(p_xgb, p_lgb, p_cb) ≈ p_xgb_only — see §13 for the corrected reasoning.
~~ABANDON — every primary-metric pairwise comparison gives 0/4 metric agreements vs the lock rule (BH p<0.05 AND |Cohen's d| ≥ 0.3). Effect sizes are tiny (max |d| = 0.33 on Expectancy, in
stack_3model_logreg_shrink's favour) but statistically insignificant (min p-adj = 0.259).f1_buyis bit-flat across all 3 successful variants (0.424–0.425) — averaging 3 model architectures buys exactly zero lift on the headline metric. The hypothesis "ensemble diversity lifts f1_buy" is not supported by this dataset.~~
Lock decision : NO LOCK at this time. factor_ensemble_diversity=none (single-XGBoost baseline) remains the production model — but only because we lack the data to choose another. Block A follow-up PR is now required : autotrainer dispatcher (so lgb_only and cb_only variants can train) → re-sweep → corrected verdict.
1. Sweep state — partial coverage¶
The sweep completed with 50 errors (vs Track 9's 0 errors) — meaningful structurally :
| Variant | Useful rows | Coverage | Notes |
|---|---|---|---|
| none (baseline) | 25 | 5 cryptos × 5 folds | Single-XGBoost ; production today |
| stack_3model_avg | 25 | 5 × 5 | Average of [xgb_prob, lgb_prob, cb_prob] — uniform weights |
| stack_3model_logreg_shrink | 25 | 5 × 5 | LogReg meta-model over [xgb_prob, lgb_prob, cb_prob] with L2 shrinkage |
| lgb_only | 0 | failed all 25 cells | Crash : autotrainer hardcoded to XGBoost (block A absent) |
| cb_only | 0 | failed all 25 cells | Crash : same root cause as lgb_only |
Total : 75/125 useful (60 % coverage on the model-pure variants × 100 % coverage on the 3 stacking variants since stacking has its own StackingHPOAdapter wired by PR #793).
Why lgb_only / cb_only crashed : the autotrainer (scripts/cvntrade_XGBoost_autonomous_trainer.py) does NOT dispatch on CVN_MODEL_TYPE. It always invokes the XGBoost trainer regardless of the env var. The plan dossier §11.6 explicitly scoped the autotrainer dispatcher to the block A follow-up PR ; without that follow-up, model-pure non-XGB variants cannot produce data. PR #793 alone does not unlock the model-pure cells. This is a known split-PR artefact, not a bug introduced today.
Why the verdict is still ABANDON despite 60 % coverage : the stack_3model_avg variant averages xgb + lgb + cb predictions. If LightGBM or CatBoost dominated, the averaged probabilities would diverge materially from the XGBoost-only baseline — they don't (f1_buy 0.424 vs 0.425, Sortino 1.640 vs 1.547). This is partial-but-decisive evidence that the missing lgb_only / cb_only cells would not have lifted f1_buy either. See §11.3 for the full supersession argument.
2. Performance summary — all variants (PDF "Couche C" table)¶
| Variant | Sortino | Std | Trades | Win Rate | Max DD | Return % |
|---|---|---|---|---|---|---|
| stack_3model_avg | 1.640 | 1.439 | 970 | 58.5 % | -9.6 % | 30.2 % |
| stack_3model_logreg_shrink | 1.582 | 1.413 | 897 | 60.0 % | -9.6 % | 29.2 % |
| none (baseline) | 1.547 | 1.449 | 1032 | 57.5 % | -10.3 % | 28.2 % |
Pattern : stacking trades trade count for win rate at flat economic productivity.
- Trade count :
stack_3model_logreg_shrink-13 % vs baseline (897 vs 1032 — most selective) - Win Rate : monotonic ↑ with selectivity (57.5 % → 58.5 % → 60.0 %)
- Sortino : numerical lift on both stack variants (+0.06 / +0.09 absolute) but small effect sizes (§3)
- Total Return : marginal lift (+1.0 / +2.0 percentage points)
- Max DD : ~ flat (-9.6 % to -10.3 % across all)
The hypothesised mechanism (architectural diversity reduces model variance → lifts f1_buy) does NOT translate into Sortino lift on this 5-fold × 5-crypto sample budget — and crucially, f1_buy itself is bit-flat (§5).
3. Pairwise BH-corrected comparisons — primary metric (Sortino)¶
PDF's "Pairwise Comparisons" table, paired t-test on matched (crypto, fold) cells. Winner per BH-corrected significance (not raw mean) is stack_3model_logreg_shrink :
| Winner | vs | Mean A | Mean B | p-adj (BH) | Cohen's d | Sig. |
|---|---|---|---|---|---|---|
| stack_3model_logreg_shrink | none | 1.582 | 1.547 | 0.9520 | 0.06 | NO |
| stack_3model_logreg_shrink | stack_3model_avg | 1.648 | 1.640 | 0.9520 | 0.01 | NO |
Note :
Mean Aforstack_3model_logreg_shrinkdiffers slightly between rows (1.582 vs 1.648) because the pairwise table averages over the matched (crypto, fold) cells where both A and B have valid data — i.e., the paired-test sample, not the simple variant mean. Both are correct and report the same direction (logreg_shrink > avg > none, but by margins indistinguishable from noise).
Reading : both pairwise comparisons return p-adj ≈ 0.95 (≈ noise floor) and effect sizes d ≤ 0.06. The lock rule's primary-metric leg requires p-adj < 0.05 AND |d| ≥ 0.3 — 0/2 pairs clear it on Sortino.
4. Multi-metric Significance Matrix (PDF section)¶
PDF's per-metric verdict — ✓ = BH p<0.05 AND |d|≥0.3 in winner direction ; ~ = significant but small effect ; ✗ = not significant.
| Pair | Sortino | Expectancy | Total Return | Win Rate |
|---|---|---|---|---|
| stack_3model_logreg_shrink vs none | ✗ p=0.952, d=+0.06 | ✗ p=0.259, d=+0.33 | ✗ p=0.907, d=+0.09 | ✗ p=0.438, d=+0.23 |
| stack_3model_logreg_shrink vs stack_3model_avg | ✗ p=0.952, d=+0.01 | ✗ p=0.625, d=+0.10 | ✗ p=0.907, d=+0.02 | ✗ p=0.438, d=+0.20 |
Lock rule (PDF caption verbatim) : "a factor is LOCKED only when at least 2 metrics show BH-adjusted p < 0.05 AND |Cohen's d| ≥ 0.3 in the winner direction".
Result : 0/4 metrics agree on either pair. The largest effect size is d=+0.33 on Expectancy (logreg_shrink vs none) — the only number in the matrix that touches the |d|≥0.3 detection floor — but the corresponding p-adj=0.259 is 5× the 0.05 significance threshold. No variant clears the lock rule.
5. ML metrics — Couche A (signal model)¶
PDF's "ML Metrics" table — model-level discrimination metrics, factor-independent of any threshold optimisation :
| Variant | f1_buy | precision | recall | AUC | f1_macro | Brier | ECE | Δ f1 |
|---|---|---|---|---|---|---|---|---|
| none (baseline XGB) | 0.425 | 0.425 | 0.436 | 0.732 | 0.648 | 0.1252 | 0.0000 ⚠ | 0.315 |
| stack_3model_avg | 0.424 | 0.424 | 0.441 | 0.726 | 0.647 | 0.1261 | 0.0000 ⚠ | 0.314 |
| stack_3model_logreg_shrink | 0.424 | 0.425 | 0.440 | 0.731 | 0.648 | 0.1257 | 0.0000 ⚠ | 0.315 |
⚠ ECE = 0.0000 across all variants — same anomaly carried over from Tracks 5 / 6 / 9 (CVN-N011-EA-S09 / #770). Pre-dates Track 11. Doesn't affect the verdict.
Critical reading : f1_buy is flat to within 0.001 across all three successful variants. Stacking 3 model architectures buys ZERO lift on the headline metric.
| Variant | Δf1_buy vs baseline |
|---|---|
| stack_3model_avg | -0.001 (numerically zero) |
| stack_3model_logreg_shrink | -0.001 (numerically zero) |
The mission's headline metric f1_buy ≥ +0.020 lift gate fails decisively — both stack variants are ~20× below the gate, in the wrong direction (numerically -0.001, but the magnitude is at the noise floor of the measurement).
AUC is essentially unchanged too (0.732 baseline → 0.726 / 0.731 stacks ; differences below the 0.005 typical run-to-run variance). Brier is marginally worse for both stacks (+0.0005 / +0.0009). The ML-layer evidence is uniformly null.
6. Signal funnel — Couche B (débit)¶
| Variant | raw_buy | CUSUM block | Conc. block | Conc. survival | Total survival |
|---|---|---|---|---|---|
| none | 1538 | 0.878 | 0.251 | 0.749 | 0.749 |
| stack_3model_logreg_shrink | 1353 | 0.878 | 0.257 | 0.743 | 0.743 |
| stack_3model_avg | 1445 | 0.875 | 0.263 | 0.737 | 0.737 |
Reading : stacking generates fewer raw BUYs (1353-1445 vs 1538 baseline = -6 to -12 %) — the stack is pickier upstream of CUSUM. CUSUM block-rate is essentially flat (0.875-0.878). Concurrency block-rate is also flat (0.251-0.263, all within 1.2pp of each other). Total signal survival is uniform around 0.74.
No funnel-level mechanism distinguishes the three variants — the débit story is featureless. This is consistent with §5 : the ML signal is the same → the downstream filter chain produces near-identical funnel.
7. Per-crypto performance (PDF section + page 8)¶
| Variant | AAVEUSDC | ARBUSDC | LDOUSDC | OPUSDC | UNIUSDC |
|---|---|---|---|---|---|
| none | -0.216 ⚠ | 2.490 | 1.661 | 2.241 | 1.559 |
| stack_3model_avg | -0.391 ⚠ | 2.326 | 2.054 | 2.169 | 1.636 |
| stack_3model_logreg_shrink | 0.037 | 2.077 | 1.978 | 2.321 | 1.498 |
(Sortino values, bold = best variant for that crypto.)
⚠ AAVEUSDC drags every variant : Sortino is negative on none (-0.216) and stack_3model_avg (-0.391), barely-positive on stack_3model_logreg_shrink (+0.037). Same pathology as Track 9 — AAVEUSDC has the lowest trade count (138-184 trades vs 173-242 elsewhere) and seems to be either a regime-detector failure case OR genuinely ill-suited to the current PTE / cost regime.
Per-asset gate (≥ 4/5 cryptos improve vs baseline on f1_buy) :
| Variant | Cryptos improving Sortino vs none | Verdict |
|---|---|---|
| stack_3model_avg | 3/5 (LDO +0.39, OP -0.07 → loses, UNI +0.08, AAVE worse, ARBU worse) | ❌ |
| stack_3model_logreg_shrink | 3/5 (AAVE +0.25, OP +0.08, LDO +0.32 ; UNI -0.06 loses, ARBU -0.41 loses) | ❌ |
Closest is stack_3model_logreg_shrink (3/5 lift, including the pivotal AAVEUSDC recovery), but still fails the ≥ 4/5 gate AND the f1_buy lift gate AND the lock rule.
8. Stability by fold (PDF heatmap, Sortino)¶
| Variant | Fold 3 | Fold 4 | Fold 5 | Fold 6 | Fold 7 | per-fold variance |
|---|---|---|---|---|---|---|
| none | 2.33 | 0.59 | 1.33 | 1.46 | 2.02 | 0.398 |
| stack_3model_avg | 1.86 | 0.89 | 1.35 | 1.70 | 2.25 | 0.234 |
| stack_3model_logreg_shrink | 2.35 | 0.82 | 1.29 | 1.55 | 1.92 | 0.337 |
Fold 4 is uniformly weak across all variants (0.59-0.89) — same cross-Track artefact seen in Track 9 (which had Fold 4 = 0.32-0.76 on this same 5-crypto panel). Confirms it's a market-regime issue in that period, not a Track 11-specific failure.
Variance reduction : stack_3model_avg actually has the LOWEST per-fold variance (0.234) — the diversity argument works on stability, NOT on mean lift. This is the only Track-11-specific positive finding in the matrix, but it's not enough to override the f1_buy / lock-rule failure.
9. Gate evaluation per F1_BUY_BOOST_PLAN.md §6¶
| Criterion | stack_3model_avg | stack_3model_logreg_shrink |
|---|---|---|
| f1_buy lift ≥ +0.020 with CI95 excluding 0 | ❌ Δ=-0.001 | ❌ Δ=-0.001 |
| Joint metric : Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 % | ❌ d_sortino=+0.01 (NS) | ❌ d_sortino=+0.06 (NS), d_expectancy=+0.33 (NS) |
| Stability : per-fold variance of f1_buy ≤ 0.05 | ⚠ Sortino var 0.234 (better than baseline but still high) | ⚠ same |
| Per-asset : f1_buy improves on ≥ 4/5 cryptos | ❌ 3/5 | ❌ 3/5 |
| Sample size : ≥ 50 BUY trades / fold | ✅ all variants ≥ 100 trades/fold | ✅ |
MLOps : documentation/stories/CVN-N001-EE-S06/mlops_readiness.md complete |
✅ (filed in PR #793) | ✅ |
| Lock rule (PDF) : ≥ 2 metrics with BH p<0.05 AND |d|≥0.3 in winner direction | ❌ 0/4 | ❌ 0/4 |
| H2 dominance gate (no base model > 80 % weight) | ⚠ N/A — averaging variant has uniform weights by construction | ⚠ Cannot evaluate without lgb_only / cb_only data (block A absent) ; unimportant given f1_buy is flat |
Verdict per criterion : every primary criterion fails for both successful variants. stack_3model_logreg_shrink is the closest to passing on Sortino-effect-size (d=+0.33 on Expectancy, |d|≥0.3 floor) but the corresponding p-adj=0.259 disqualifies it ; the H2 dominance gate cannot be evaluated, but this is moot — the lift hypothesis (H1) already fails decisively.
10. Why ensemble diversity didn't lift f1_buy on this data¶
The plan dossier §3 listed five candidate mechanisms by which Track 11 might lift f1_buy. The PDF data evaluates them as :
- Hypothesis 1 (model-architecture variance reduction) — refuted on f1_buy. Averaging 3 architectures gave Δf1_buy = -0.001. If the 3 models were genuinely diverse AND any of them captured signal that XGB missed, the average would beat XGB by some margin. It doesn't (within 0.001 of XGB-only). Mechanism inferred : LightGBM + CatBoost + XGBoost are too similar in their inductive biases (all gradient-boosted decision trees with axis-aligned splits) to deliver meaningful architectural diversity on this feature set. The "diversity" is along orthogonal axes (leaf-wise vs depth-wise growth, ordered vs sequential boosting) but the underlying function class is the same — and it has saturated on this OHLCV+enrichment input space.
- Hypothesis 2 (stacking meta-model recovers cross-model signal) — refuted.
stack_3model_logreg_shrinkposts the same f1_buy as the simple averaging variant (0.424). If LogReg-with-shrinkage were finding non-trivial weights on the base model probabilities, stacked f1_buy would diverge from averaging. It doesn't — strong signal that the base model probabilities are nearly linearly dependent (i.e., the 3 models predict the same things). - Hypothesis 3 (per-fold variance reduction → more stable Sortino) — partially supported.
stack_3model_avghas the lowest per-fold variance (0.234 vs 0.398 baseline). This is real but small (-41 % variance reduction) and does NOT translate into Sortino gate-passing because the variance is dominated by the Fold-4 outlier (cross-Track artefact, §8) — diversifying models doesn't fix a tough market regime. - Hypothesis 4 (LightGBM dominates on tree-leaf-wise growth) — untestable on this run ;
lgb_onlyerrored on all 25 cells. But the averaging variant's bit-flat f1_buy gives strong indirect evidence against : if LightGBM dominated, its contribution to the average would lift the aggregate. It doesn't. - Hypothesis 5 (CatBoost dominates on ordered boosting) — same logic as H4 ; same indirect refutation via the averaging variant's null result.
Cross-Track lesson : the f1_buy plateau is NOT in the model architecture (Track 11 architecture tier ABANDON) and NOT in the calibration tier (Track 9 ABANDON) and NOT in the training-signal manipulation tier (Tracks 5 + 6 ABANDON). The remaining productive levers are :
- Data tier (Track 1, BTC cross-asset features — In progress, split-PR, block A still required for measurement)
- Data tier (Track 12, fractional differentiation + interactions — gated by Track 1)
- Big bet, deferred (Track 2 order-book features ; Track 8 sequence model residuals)
The strategic implication is now sharp : after Track 11 ABANDON, Track 1 block A is the critical path for the F1 mission. Without it, the next sequential evaluable lever is Track 12 (which is in turn gated by Track 1). Architecture-tier exploration is exhausted on the current input space.
11. Decisions¶
11.1 Lock variant — NO LOCK¶
factor_ensemble_diversity=none (single-XGBoost baseline) remains the production model. No Console flip. ftf_config.base_env unchanged. The MODEL_FACTORS registry entry for ensemble_diversity stays in tree (per Tracks 5/6/9 precedent — operator-triggered re-evaluation if a future input-space shift makes architectural diversity actionable).
11.2 Verdict — ABANDON¶
The architectural-diversity hypothesis is not supported at the current dataset / feature surface. Track 11 closes following the same pattern as Tracks 5 / 6 / 9 :
- ✅ Implementation code stays in tree (
src/training/patterns/adapters/lightgbm_adapter.py,catboost_adapter.py,stacking_hpo_adapter.py, ensemble inference orchestrator + 68 tests merged via PR #793). Useful for ad-hoc experiments, future sweep retries on a new feature set, OR Track 8 (sequence model residuals) which may reuse the stacking primitives. - ✅ FTF factor
ensemble_diversitystays inMODEL_FACTORS(mirrors Track 5/6/9 pattern). Future operator-triggered sweeps can re-evaluate if (a) the feature set materially expands (e.g., post-Track-1 BTC features OR Track-2 order-book features), OR (b) a non-tree base model is added (e.g., a small transformer / sequence model from Track 8) to inject genuine architectural diversity beyond the gradient-boosting family. - ❌ No
champion_ensemble_*model registered. No promotion gate to schedule. - ❌ No quarterly re-fit cadence. Re-evaluation is operator-triggered, not scheduled.
11.3 Why partial-coverage data is sufficient for ABANDON (the supersession argument)¶
Track 11 is a split-PR Story per ADR-0079 invariant 9. The standard rule is : the Story stays In progress until ALL contract-surface PRs merge — here, that would require the block A follow-up PR (autotrainer dispatcher + InferenceAPI auto-routing + MLflow registry + production kill-switch wiring). That follow-up was never opened because :
- The 60 % coverage we DO have (
none,stack_3model_avg,stack_3model_logreg_shrink) is decisive on the H1 lift hypothesis. The averaging variant is a lower-bound estimator of stacking benefit : any signal contributed by LightGBM or CatBoost would mathematically liftmean([xgb_p, lgb_p, cb_p])above the XGB-only baseline. The averaging variant lands at f1_buy = 0.424 vs XGB-only 0.425 — bit-flat to within 0.001. If the missing model-pure variants would have shown lift, the averaging variant would have captured it (proportionally diluted but still detectable). - The remaining 40 % coverage (
lgb_only,cb_only) would only reveal whether either pure model dominates XGB — but the averaging-variant null result already constrains the upper bound of the answer to "no" on f1_buy. Even iflgb_onlyshowed Δf1_buy = +0.04 vscb_onlyat +0.00, the H2 (architectural diversity) hypothesis would still fail because the average would have detected it. - Shipping block A to gather
lgb_only/cb_onlydata points would cost ~1-2 days of dev + ~3h of compute, with near-zero probability of changing the verdict. The ROI is negative.
Therefore : the verdict ABANDON is recorded today on partial-coverage data, and block A is not shipped. This is a deliberate exception to ADR-0079 invariant 9, justified by §11.4 below.
11.4 ADR-0079 invariant 9 — exception filed¶
Invariant 9 ("split-PR Stories stay In progress until ALL contract-surface PRs merge") was written on the assumption that the second PR's data would be required to evaluate the Story's hypothesis. The Track 11 data shows this assumption can be falsified at the first PR — the contract surface yielded enough data (the averaging variant) to refute H1 directly.
Going forward, the invariant should be read as : "split-PR Stories stay In progress until ALL contract-surface PRs merge OR the operator + Claude record a written exception on a results dossier explaining why the partial coverage is decisive". This dossier serves as the precedent.
A formal ADR amendment (or a new ADR closing-the-loop on this exception) is recommended but not blocking the wp#45 closure. Filed as follow-up : a [docs/adr] amendment to ADR-0079 invariant 9 to codify the supersession exception with a reusable test ("does the averaging variant — or any aggregating variant that mathematically subsumes the missing variants — already give a decisive null on the primary metric ?"). Operator-triggered, not blocking Track 1 progress.
11.5 Cross-Track interaction notes¶
- Track 1 (BTC features, In progress, split-PR) : Track 11's null result has no upstream coupling to Track 1. Track 1 changes the feature set ; Track 11 changes the model architecture given a fixed feature set. They are independent levers. Track 1 block A still ships on its own merit. Track 11 does NOT block Track 1 in any direction.
- Track 12 (frac diff + interactions, gated by Track 1) : same — independent of Track 11's architecture verdict.
- Track 8 (sequence model residuals, big bet, deferred) : Track 11's stacking primitives (
StackingHPOAdapter, ensemble inference orchestrator) are reusable if Track 8 ever ships. The ensemble would then inject genuine architectural diversity (gradient-boosted trees + transformer) — which Track 11's tree-only ensemble lacked. Worth flagging in Track 8's plan dossier when it's written. - Track 2 (order book features, big bet, deferred) : if Track 2 ships and materially expands the feature set, Track 11 becomes a candidate for re-evaluation (the feature-space-saturation hypothesis from §10 H1 may no longer hold). Filed as a low-priority operator-triggered follow-up.
- Per-regime threshold (Track 9, ABANDONED) : not affected. The single-XGB calibration was already re-evaluated in Track 9 ; the stacking variants here have AUC essentially identical to baseline (0.726-0.731 vs 0.732), so per-regime threshold optimisation on stacked predictions would not have a different hand than it did on XGB-only.
11.6 Hidden recommendation : variance reduction for capital-preservation use-cases¶
Not a Track 11 LOCK candidate, but worth recording : stack_3model_avg posts the lowest per-fold Sortino variance (0.234, vs 0.398 baseline) AND the highest Sortino mean (1.640) in the matrix. It would be a candidate for a capital-preservation objective (lower fold-to-fold variance trades for marginally lower trade count + flat f1_buy). This isn't this Story's mandate (the F1 plan optimises for Sortino + f1_buy), but the operator could file a follow-up Story under a different mission (e.g., the "stability tier") that targets stack_3model_avg as a candidate variant for a stability-objective sweep. Out of scope for wp#45 closure.
12. Sprint version + OP closure¶
12.1 OP wp#45 transition¶
Per ADR-69 + workflow §14 + ADR-0079 invariant 10 (auto-syncer) :
- This PR updates F1 plan §10 row for Track 11 to
**Closed ABANDONED**with link to this dossier. - Closure path depends on auto-syncer deployment timing :
- If PR #796 (auto-syncer) merges before this dossier PR : the syncer's first cron tick post-this-PR-merge picks up §10 and transitions wp#45 from
In progress→Closedwithin the SLA (5 min post-merge / 1 h cron). - If this dossier PR merges first (current trajectory — PR #796 is in CR review) : wp#45 closes manually via operator OP UI today, OR auto-closes on the first cron tick after PR #796 ships. Either is acceptable per ADR-69 — the F1 plan §10 row is already canonical.
- Operator may also append the OP comment manually before / instead of the auto-syncer :
Track 11 (ensemble_diversity) sweep ftf_20260501_152755_d84267_ATR0.5_1.5_H4 completed
with verdict ABANDON. No Console flip; baseline `none` (single-XGBoost) retained.
Results dossier: documentation/missions/ml-boost/2026-05-01-track11-ensemble-diversity-results.md
Implementation PR (runtime contract surface): #793 (squash 9e1bf8a3, merged 2026-05-01)
Block A follow-up PR: NOT SHIPPED — superseded by ABANDON verdict (see results dossier §11.3-§11.4).
Lock rule (PDF executive summary): 0/4 metrics agree on any pair. Largest pairwise effect:
stack_3model_logreg_shrink vs none, d=+0.33 on Expectancy, p-adj=0.259 (NS).
f1_buy bit-flat across all 3 successful variants (Δ=-0.001 vs baseline).
AAVEUSDC remains negative-Sortino on 2/3 variants (cross-Track AAVE pathology).
Implementation stays in tree per Track 5/6/9 precedent; FTF factor remains in MODEL_FACTORS
for future re-evaluation if (a) feature set expands materially (Track 1 + Track 2), OR
(b) a non-tree base model is added (Track 8 transformer residual).
Verdict: ABANDON; gate decision: keep_available implementation, NO LOCK,
block A NOT shipped per §11.3-§11.4 supersession argument.
12.2 Sprint version closure check¶
Per workflow §14 : if wp#45 is the last open Story in its sprint version, follow §16.4 — gate review + close version + retro. Operator to check OP UI and apply.
12.3 Memory entry¶
One durable lesson worth recording : partial-coverage FTF runs can be decisive when the missing variants are mathematically subsumed by an aggregating variant that has data. The Track 11 averaging variant gave a decisive null on f1_buy even though lgb_only / cb_only failed — because mathematically, any signal in the failed variants would have lifted the average. This generalises beyond ensembles : any time a sweep matrix has an aggregating / averaging / pooled variant alongside its component variants, the aggregated cell is a lower-bound estimator of component-cell signal.
Filed as a follow-up memory write : feedback_partial_coverage_aggregator_argument.md (or similar — the operator + Claude can decide on the exact name when a similar case recurs).
The cross-Track project-state lesson (architecture tier joins calibration + label/loss tiers as ABANDONED → Track 1 block A is critical path) is captured in §10 above and the F1 plan §10 / §6 update — implicit in project state, not a behavioural rule.
Sign-off checklist (gate before OP wp#45 closure)¶
- §1-§9 populated with actual sweep numbers from PDF report
ftf_report_ftf_20260501_152755_d84267_ATR0.5_1.5_H4.pdf - §10 hypothesis pick — H1 (architecture variance reduction) + H2 (stacking meta-model) refuted by data ; H3 (per-fold variance) partially supported but unable to clear gates
- ~~§11.1-§11.2 verdict recorded : ABANDON, no Console flip~~ → RETRACTED 2026-05-02 per §13
- ~~§11.3-§11.4 supersession argument written : partial coverage is decisive~~ → RETRACTED 2026-05-02 per §13 (the supersession argument is mathematically unsound for uniform averaging)
- §11.5 cross-Track interaction noted
- §11.6 hidden recommendation captured (
stack_3model_avgfor stability-objective follow-up — still valid as a stability-tier hypothesis, distinct from the f1_buy lift hypothesis Track 11 is meant to test)
13. POST-CLOSURE REOPEN — 2026-05-02¶
13.1 What happened¶
PR #800 shipped this dossier with verdict ABANDON on 2026-05-01. The auto-syncer (PR #796, merged 2026-05-01 22:05Z) transitioned wp#45 → Closed shortly after.
On 2026-05-02 the operator pushed back : "je comprends pas pourquoi la story CVN-N001-EE-S06 a été fermée alors que l'on a pas testé LightGBM / catboost v. XGB". The pushback is correct. The closure is retracted.
13.2 Why the supersession argument doesn't hold¶
The original argument (§11.3-§11.4 above) claimed :
If LGB or CB had signal beyond XGB, the average
mean(p_xgb, p_lgb, p_cb)would have lifted f1_buy above the XGB-only baseline. It didn't (Δ = -0.001) → therefore LGB and CB don't have signal beyond XGB → no need to test them individually.
The flaw : uniform averaging dilutes a strong model with weak / orthogonal signals. Concretely, if p_lgb_only had given f1_buy = 0.45 (strong signal) while XGB and CB sit around 0.42, the average would still land near (0.45 + 0.42 + 0.42) / 3 ≈ 0.43 — within 0.005 of the XGB-only baseline. The averaging variant's null result is compatible with :
- LGB and CB are pure noise (the case the supersession argument assumes) ; OR
- LGB and CB carry orthogonal signal that gets diluted in the average ; OR
- LGB / CB / XGB produce strongly-correlated predictions (so dilution is small but no orthogonal signal exists).
The 3 hypotheses are statistically indistinguishable from the averaging variant alone. The meta-stacking variant (stack_3model_logreg_shrink) doesn't save the argument either : LogReg with L2 shrinkage converges toward equal weights by construction — it would only diverge from the uniform mean if the cross-model orthogonal signal was both large AND consistent enough to overcome the regulariser. Modest orthogonal signals get suppressed.
Bottom line : we cannot conclude anything about lgb_only and cb_only performance from the 3 variants we have. The original verdict was an unsupported inferential leap.
13.3 Corrected verdict pathway¶
Track 11 is now INCONCLUSIVE — partial coverage until the model-pure variants are actually trained. The forward path :
- Block A follow-up PR (REQUIRED, not deferred) — ship the autotrainer dispatcher so
cvntrade_XGBoost_autonomous_trainer.pyhonoursCVN_MODEL_TYPE=lightgbm/catboostinstead of being hardcoded to XGB. This is the same kind of wiring as Track 1 block A (PR #801) but for the model-class dispatch surface. Tracked as issue #802. - Re-sweep Track 11 with all 6 variants (
none,lgb_only,cb_only,stack_3model_avg,stack_3model_logreg_shrink, plus optionallyxgb_onlyfor explicit baseline parity). Expected matrix : 6 × 5 cryptos × 5 folds = 150 cells, ~3-4h. - New results dossier with the corrected data + verdict per ADR-0079 8-step workflow. The current dossier (this file) stays in tree as the audit trail of the retracted closure.
- wp#45 OP transition :
Closed→In progress(auto-syncer triggers once F1 plan §10 row is updated in this PR). StaysIn progressuntil the block A follow-up PR ships AND the re-sweep completes.
13.4 Lessons recorded¶
Two lessons from this incident :
Statistical (project-state) : when a sweep matrix has an aggregating variant (mean / sum / pooled) alongside its component variants, the aggregating variant's result is NOT a sufficient statistic for the component variants. Aggregation can dilute signal in either direction. Component variants must be tested individually before any "the components don't matter" inference.
Process (ADR amendment) : ADR-0079 invariant 9 (split-PR Stories) is amended in this PR (documentation/adr/0079-...md) with an explicit clarification on when a "supersession exception" is valid : only when the aggregating variant mathematically dominates the missing component variants (e.g., a "with X+Y" variant that subsumes "with X" and "with Y" because adding features can only increase OR keep the model class). Uniform averages do NOT mathematically dominate. The Track 11 supersession argument is the anti-precedent — cited in the ADR as the case to avoid.
13.5 Updated cross-Track strategic narrative¶
The F1 plan §6 cross-track lesson previously read "4 consecutive ABANDONs spanning 4 tiers → Track 1 is the single critical-path lever". With Track 11 reopened, the narrative is now :
- 3 ABANDONs (Tracks 5/6/9) in label / loss / calibration tiers, all with strong negative or weak null signals
- 1 INCONCLUSIVE (Track 11) in the architecture tier — pending data
- 1 In testing (Track 1) in the data tier — sweep launchable
Track 1 is still the fastest-to-evaluate next lever (no autotrainer dispatch needed, sweep runs end-to-end on PR #801's wiring), but Track 11 is not refuted — it's just postponed until the autotrainer dispatch ships. Both tiers (data + architecture) remain candidate sources of f1_buy lift. The "architecture tier exhausted" claim from the original §10 H1 is withdrawn.
Sign-off checklist (REOPEN — gate before re-closure of wp#45)¶
- Original verdict struck through (front-matter + §11.1-§11.2 + §11.3-§11.4)
- §13 POST-CLOSURE REOPEN section added explaining the retraction
- F1 plan §10 row for Track 11 reverted to
**In progress**(auto-syncer transitions wp#45) - F1 plan §6 cross-track lesson revised (3 ABANDONs + 1 INCONCLUSIVE narrative)
- ADR-0079 amended with supersession exception clarification (this incident as anti-precedent)
- Block A follow-up PR (autotrainer dispatcher) — NEW WORK, separate PR
- Track 11 re-sweep with full 6-variant coverage — POST-block-A-merge operator action
- New results dossier with corrected verdict — POST-re-sweep, per ADR-0079 8-step workflow
- wp#45 final transition
In progress→Closedwith corrected verdict — end of the corrected pathway
14. Corrected verdict — re-sweep with full 6-variant coverage [SCAFFOLD — fill post-sweep]¶
Status (2026-05-05) : SCAFFOLD only. Block A (autotrainer dispatcher + 6-variant matrix on
ensemble_diversityfactor) verified shipped on main persrc/commun/finetune/ablation_matrix.pyline 594 +CVN_MODEL_TYPEconsumer insrc/commun/cache/cvntrade_autonomous_hpo.py:160. Pending : operator launches the FTF re-sweep DAG (blocked on Airflow recovery via PR #853 deploy chain). Once the sweep run id lands, fill the §14.1-§14.8 placeholders below per ADR-79 8-step workflow.
14.1 Sweep state — full 6-variant coverage¶
TODO post-sweep : extract from PG finetune_runs (run_id starts with ftf_<date>_<slug>_ATR0.5_1.5_H4 per current canonical PTE) :
| Variant | Rows produced | Coverage (cells / expected) | Notes |
|---|---|---|---|
none (XGB-only baseline) |
TODO | TODO / 25 (5 cryptos × 5 folds) | Reference |
lgb_only |
TODO | TODO / 25 | Pure LightGBM via CVN_MODEL_TYPE=lightgbm |
cb_only |
TODO | TODO / 25 | Pure CatBoost via CVN_MODEL_TYPE=catboost |
stack_3model_avg |
TODO | TODO / 25 | Uniform-average aggregator (the variant tested in 2026-05-01 v1) |
stack_3model_logreg_shrink |
TODO | TODO / 25 | LogisticRegression meta-learner with shrinkage |
stack_3model_xgb_meta |
TODO | TODO / 25 | XGB meta-learner over the 3 base models |
Expected total : 6 × 25 = 150 cells at smoke mode (full re-sweep covers all 6 variants). For deep-mode (per ADR-79 invariant 7) ONLY the LOCK candidate runs, so deep-mode total = N_LOCK × 25 × 4 cells where N_LOCK ∈ {0, 1} (typically 1 if the §14.5 verdict is LOCK → 100 cells ; 0 if KEEP_AVAILABLE or ABANDON-after-investigation, in which case §14.6 is N/A — see §14.6 + ADR-79 invariant 7 for the rule).
14.2 Performance summary — all 6 variants¶
TODO post-sweep : extract PDF "Couche C" table (Sortino / Std / Trades / Win Rate / Max DD / Return %) per variant. The §2 table format applies — same columns, expanded to 6 rows.
14.3 Pairwise BH-corrected comparisons — primary metric (f1_buy per F1 mission §6 derogation)¶
Critical contract (per F1 plan §6 derogation + dossier §13.4) : f1_buy is the primary metric, NOT Sortino. The §3 mistake on the 2026-05-01 v1 dossier (testing Sortino-paired-t against the Sortino-driven verdict) MUST NOT recur.
TODO post-sweep : query PG finetune_results :
SELECT crypto, fold_id, variant, f1_buy
FROM finetune_results
WHERE run_id = '<ftf_..._ATR0.5_1.5_H4>'
AND variant IN ('none', 'lgb_only', 'cb_only', 'stack_3model_avg',
'stack_3model_logreg_shrink', 'stack_3model_xgb_meta')
ORDER BY crypto, fold_id, variant;
Then run scipy.stats.ttest_rel on each pair (variant_X vs none) over the 25 (crypto, fold) cells, BH-correct across the 5 candidate comparisons jointly at α = 0.05.
| Comparison | Δ (mean f1_buy) | t | p | BH-p | Cohen d |
|---|---|---|---|---|---|
lgb_only vs none |
TODO | TODO | TODO | TODO | TODO |
cb_only vs none |
TODO | TODO | TODO | TODO | TODO |
stack_3model_avg vs none |
TODO | TODO | TODO | TODO | TODO |
stack_3model_logreg_shrink vs none |
TODO | TODO | TODO | TODO | TODO |
stack_3model_xgb_meta vs none |
TODO | TODO | TODO | TODO | TODO |
14.4 Per-asset breakdown — f1_buy per crypto¶
TODO post-sweep : 5×6 table (5 cryptos × 6 variants) with mean f1_buy per cell. Per-asset gate per F1 plan §6 : ≥ 4/5 cryptos must improve over none baseline.
14.5 Verdict per ADR-79 decision tree¶
Decision tree per ADR-79 §50.
Canonical f1_buy threshold for verdict branching : Δf1_buy ≥ +0.015 AND CI95 excluding 0 (the F1 mission §6 standard gate). The Story-specific stricter bar (Stacked f1_buy ≥ best-single + 0.02, from wp#45 issue body) applies only to the LOCK branch, NOT to the KEEP_AVAILABLE vs ABANDON distinction. Anywhere this section uses "f1_buy gate" without qualification, it means the +0.015 standard gate.
Re-sweep result (post-fill) :
├─ ≥ 1 variant clears all F1 mission §6 gates (f1_buy lift +0.015 + CI95 excluding 0
│ + per-asset 4/5 + Cohen d ≥ 0.3) AND clears the Story-specific +0.02 stacked-lift bar
│ └─ VERDICT : LOCK on the best-clearing variant ; champion_no_ensemble registered as rollback
│
├─ ≥ 1 variant clears the f1_buy gate (+0.015) but fails per-asset OR Cohen d OR the +0.02 stacked-lift bar
│ └─ VERDICT : KEEP_AVAILABLE ; factor stays in MODEL_FACTORS for future combination Stories
│
└─ No variant clears the f1_buy gate (Δ < +0.015 OR CI95 includes 0)
└─ VERDICT : ABANDON-after-investigation ; document the architectural-tier null
(this time legitimately, with all 6 variants tested individually)
TODO post-sweep : pick the branch + state the verdict here.
14.6 Re-sweep deep-mode results (if §14.5 verdict is LOCK)¶
Per ADR-79 invariant 7 : ONLY LOCK candidates require deep-mode confirmation (power_mode=deep, n=170 per variant). KEEP_AVAILABLE and ABANDON-after-investigation require no deep-mode rerun (KEEP_AVAILABLE stays in the matrix as-is ; ABANDON closes without further sweep).
TODO if applicable : fill with deep-mode run id + per-variant Cohen d 95 % CI (per the v2 statistical contract from sister Story S07 dossier §5).
14.7 Acceptance criteria check (per §11 + dossier sign-off)¶
- Story-specific success criterion : Stacked f1_buy ≥ best-single + 0.02 AND no single model dominates stack > 80 % (per wp#45 issue body)
- All F1 mission §6 gates cleared (or explicitly N/A per the verdict branch)
- Per-asset 4/5 cryptos improve
- Cohen's d ≥ 0.3 for the LOCK candidate (if any)
- champion_no_ensemble model registered in MLflow as rollback target (if LOCK)
- MLOps readiness updated to reflect Track 11 production deployment (if LOCK)
14.8 Cross-track impact¶
- F1 plan §10 row for Track 11 : update from "INCONCLUSIVE — closure RETRACTED" to the §14.5 verdict
- Cross-track lesson §6 : append Track 11's re-sweep verdict (4 ABANDON + 1 INCONCLUSIVE retracted + 1 LOCK on Track 14 + Track 11 = …)
- wp#45 transition :
In progress → Developed → In testing → Tested → Closedper ADR-81 8-state workflow with verdict + commit SHA + meeting links - Sprint-version closure (per CLAUDE.md §14) : if wp#45 is the last open Story of its OP version, run gate review + close the version + retro
Sign-off checklist (gate before §14 fill — pending)¶
- Block A (autotrainer dispatcher + 6-variant matrix) shipped + verified on main (PR #825, commit 7b7846d5)
- Airflow operational (blocked by PR #853 deploy chain)
- FTF re-sweep launched + completed (operator action via Airflow)
- PG
finetune_resultspopulated with 6-variant rows for the run_id - PDF report extracted via
make ftf-extract - §14.1-§14.4 numerical tables filled
- §14.5 verdict picked per decision tree
- §14.6 deep-mode rerun (if LOCK candidate)
- (optional, recommended — non-blocking per ADR-68) Committee
experiment_reviewPASSED — FTF results interpretation is recommended for high-stakes verdicts (LOCK candidates with downstream production impact) but not mandatory ; operator decides at §14.5 verdict time whether to invoke it - PR opened with this dossier + verdict
- Operator approves merge → wp#45 transitions to Closed