Track 1 — BTC cross-asset features results & gate decision¶

Story : CVN-N001-EE-S04 (wp#43) — Track 1 of F1_buy boost mission, data tier ADR : closes per ADR-0079 (FTF sweep → Story closure 8-step workflow). First Track to follow the workflow with the make ftf-extract-equivalent path codified in ADR-0080 (PDF extraction from PG finetune_runs.pdf_report via kubectl exec into the console pod ; ADR-0080 invariant 3 grace clause used pending the Make target). Date : 2026-05-02 Authors : Dominique (operator) + Claude Plan dossier : 2026-04-30-track1-btc-features-plan.md — committee plan_review PASSED 2026-04-30 Implementation PRs : - Runtime contract surface : #792 (squash 71bf70ba, merged 2026-04-30) - Block A wiring : #801 (squash 4a39e665, merged 2026-05-01) — committee pr_review PASSED (Mistral-only, single-LLM degraded mode per Gemini key outage) Sweep run_id : ftf_20260501_230526_2483f9_ATR0.5_1.5_H4 Sweep status : completed ; 150 results ; 0 errors (full coverage on all 6 variants × 5 cryptos × 5 folds) ; duration 3h53m ; started 2026-05-01 23:05:56 UTC ; git SHA 4a39e665 (= PR #801 squash, ~9h post-merge) Cryptos : UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC (defi_top5 — same panel as Tracks 9 + 11 for cross-comparability) Folds : 5 (Folds 3-7) ; Trials : 50 per fold (operator chose 50, exceeding both standard n=15 and deep n=30 budgets — power_mode=standard reference matrix is 5 cryptos × 5 folds × 15 trials per commun.finetune.power_mode) ; Cost : 15 bps FTF report PDF : reports/ftf_report_ftf_20260501_230526_2483f9_ATR0.5_1.5_H4.pdf (committed to docs base — source-of-truth for the per-pair / per-fold tables in §2-§9)

Verdict : ABANDON pending leakage root-cause investigation — the mandatory leakage check fails per plan §4.6 spec. The plan dossier specifies a paired t-test on f1_buy(btc_full_purge0) - f1_buy(btc_full) (NOT Sortino), with BH-corrected p ≥ 0.05 required for PASS. The actual paired test on the 25 cells gives Δ=+0.00727, t=2.171, p=0.0401, Cohen d=+0.434 — btc_full_purge0 outperforms btc_full on f1_buy with statistical significance, exactly the leakage-detector's red-flag pattern. Per plan §4.6 verbatim : "If the paired difference is significantly positive (BH-corrected p < 0.05) → leakage suspected → ABANDON Track 1 pending root-cause investigation."

Critical caveat — Sortino diverges from f1_buy on this comparison : on the same 25 cells, the canonical btc_full (purge=20) BEATS btc_full_purge0 on Sortino (1.710 vs 1.390, +0.32, +23%). This is NOT a logical contradiction of the ABANDON verdict — the leakage gate is on f1_buy per plan §4.6, and Sortino measures something different (trade-level economic productivity, not per-prediction accuracy). The divergence is informative : the f1_buy lift on purge=0 likely reflects a small per-prediction signal that does NOT translate into trade-level economic lift (the model marginally improves its calls but the trades it picks are dominated by altcoin-specific noise on the returns side). The investigation must distinguish : (a) genuine but small ADR-14 violation that canonical purge=20 already plugs ; OR (b) production-exploitable signal mistakenly purged (BTC's bar-i close IS available at altcoin's bar-i decision time in production) ; OR (c) statistical noise (p=0.0401 just below the 0.05 threshold).

Lock decision : NO LOCK. factor_btc_features=none (XGB-blind baseline) remains the production model. Console state unchanged. Factor stays in MODEL_FACTORS per ADR-0079 invariant 6 — the leakage investigation may produce a corrected variant set (e.g., purge_bars sensitivity sweep [0, 5, 10, 20, 40] to identify the production-feasible minimum) that re-opens the LOCK candidacy.

Earlier (incorrect) verdict : a previous draft of this dossier (now retracted) called this KEEP_AVAILABLE based on a Sortino-based leakage check ("btc_full Sortino 1.710 > btc_full_purge0 1.390 → no leakage"). CodeRabbit pass 1 caught the spec mismatch — the plan §4.6 explicitly mandates f1_buy paired t-test, not Sortino comparison. Mea culpa. This dossier ships the corrected verdict per the plan's mandatory gate.

1. Sweep state — full coverage¶

Per the FTF report's executive summary (run completed 0 errors, 150 rows = 5 cryptos × 5 folds × 6 variants, no rejected variants — the first sweep with full coverage since the autotrainer became BTC-aware via PR #801) :

Factor	Variant	Useful rows	Coverage	Notes
btc_features	none (baseline)	25	5 × 5	Single-XGBoost, OHLCV-only enrichment (current production)
btc_features	btc_min	25	5 × 5	3 directional features (returns 1h/4h/24h)
btc_features	btc_full	25	5 × 5	Canonical 6 features (returns + vol + zscore + lag5 corr)
btc_features	btc_full_purge0	25	5 × 5	Leakage-detection sanity — purge=0 (look-ahead allowed) ; should NOT outperform btc_full
btc_features	btc_full_purge10	25	5 × 5	Sensitivity test — purge=10 (vs canonical 20)
btc_features	btc_vol_only	25	5 × 5	3 volatility features (vol/zscore/lag5-corr)

Sweep status : succeeded (every variant produced metrics on every (crypto, fold) cell). Clean run — no fold dropped, no fallback to baseline. PR #801's wiring is validated end-to-end.

2. Performance summary — all variants (PDF "Couche C" table)¶

Variant	Sortino	Std	Trades	Win Rate	Max DD	Return %
btc_full	1.710	1.548	847	64.5 %	-8.8 %	29.8 %
btc_vol_only	1.606	1.534	876	59.3 %	-10.2 %	30.3 %
none (baseline)	1.547	1.449	1032	57.5 %	-10.3 %	28.2 %
btc_full_purge10	1.425	1.621	1072	57.7 %	-10.5 %	27.4 %
btc_min	1.417	1.365	823	62.1 %	-9.4 %	25.3 %
btc_full_purge0	1.390	1.420	818	62.1 %	-9.9 %	24.3 %

Pattern (descriptive — gate evaluation per spec is in §5.1 + §9, not inferable from this Sortino table alone) : - btc_full posts the highest Sortino + highest Win Rate in the matrix - The leakage-detection variant btc_full_purge0 is the WORST Sortino — observed fact only ; no leakage inference can be drawn from this table because the leakage gate per plan §4.6 is on f1_buy, not Sortino. See §5.1 for the spec'd test : on f1_buy the canonical/purge0 ordering inverts (purge0 outperforms canonical with p=0.0401, gate FAILS). - btc_full_purge10 (sensitivity, smaller purge window) underperforms btc_full on Sortino — observed fact ; the purge_bars sensitivity sweep planned in the leakage investigation (#806 Phase B) will determine the right value across multiple metrics, not just Sortino. - btc_min (3 directional features only) underperforms btc_full on Sortino - btc_vol_only (3 volatility features only) sits between baseline and btc_full on Sortino

The Sortino progression btc_full > btc_vol_only > none > btc_min is observed but not informative on the leakage gate ; the leakage check uses a different metric (f1_buy, §5.1) and the per-asset gate is also on f1_buy (§9). Cross-metric inference must be explicit, not assumed.

3. Pairwise BH-corrected comparisons — primary metric (Sortino)¶

PDF's "Pairwise Comparisons" table, paired t-test on matched (crypto, fold) cells. Winner per BH-corrected significance is btc_full :

Winner	vs	Mean A	Mean B	p-adj (BH)	Cohen's d	Sig.
btc_full	btc_full_purge0	1.739	1.443	0.6194	0.37	NO
btc_full	btc_full_purge10	1.770	1.473	0.6194	0.28	NO
btc_full	btc_min	1.710	1.449	0.6194	0.26	NO
btc_full	btc_vol_only	1.770	1.606	0.6194	0.16	NO
btc_full	none	1.710	1.586	0.6194	0.20	NO

Reading : all comparisons return BH-adjusted p ≈ 0.62 (not significant) but with effect sizes in the right direction. The largest Sortino effect is btc_full vs btc_full_purge0 at d=+0.37 — confirming the canonical variant beats the leakage-prone one with a small effect (good — would be a red flag if d were negative or large).

4. Multi-metric Significance Matrix — `btc_full` vs alternatives¶

PDF's per-metric verdict — ✓ = BH p<0.05 AND |d|≥0.3 in winner direction ; ~ = significant but small effect ; ✗ = not significant.

Pair	Sortino	Expectancy	Total Return	Win Rate
btc_full vs btc_full_purge0	✗ p=0.619, d=+0.37	✗ p=0.701, d=+0.11	✗ p=0.729, d=+0.36	✗ p=0.854, d=+0.09
btc_full vs btc_full_purge10	✗ p=0.619, d=+0.28	✗ p=0.150, d=+0.45	✗ p=0.832, d=+0.16	✗ p=0.242, d=+0.43
btc_full vs btc_min	✗ p=0.619, d=+0.26	✗ p=0.286, d=+0.32	✗ p=0.729, d=+0.26	✗ p=0.366, d=+0.31
btc_full vs btc_vol_only	✗ p=0.619, d=+0.16	✗ p=0.286, d=+0.34	✗ p=0.949, d=+0.03	✗ p=0.337, d=+0.34
btc_full vs none	✗ p=0.619, d=+0.20	✗ p=0.094, d=+0.61	✗ p=0.889, d=+0.08	✗ p=0.154, d=+0.57

Lock rule (PDF caption verbatim) : "a factor is LOCKED only when at least 2 metrics show BH-adjusted p < 0.05 AND |Cohen's d| ≥ 0.3 in the winner direction".

Result : 0/4 metrics agree on any pair → no LOCK today.

BUT — the btc_full vs none row is the most promising single comparison in the F1 mission to date :

Expectancy d=+0.61 (LARGE effect) with p=0.094 (close to the 0.05 threshold ; would clear at n≈45 per variant ≈ power_mode=deep)
Win Rate d=+0.57 (medium-large) with p=0.154 (would clear at n≈80 per variant)
All 4 metrics positive in the winner direction (no contradicting signal)

This is the right pattern for KEEP_AVAILABLE : the effect exists with a meaningful magnitude, but the n=25 sample budget is below the detection floor for what's needed to formalise it.

5. ML metrics — Couche A (signal model)¶

PDF's "ML Metrics" table — model-level discrimination metrics, factor-independent of any threshold optimisation :

Variant	f1_buy	precision	recall	AUC	f1_macro	Brier	Δ f1
btc_full_purge0	0.433 ⚠	0.431	0.449	0.732	0.652	0.1248	0.319
btc_full	0.432	0.424	0.452	0.732	0.650	0.1257	0.317
btc_vol_only	0.426	0.430	0.436	0.728	0.648	0.1268	0.315
none (baseline)	0.425	0.425	0.436	0.732	0.648	0.1252	0.315
btc_min	0.424	0.414	0.454	0.730	0.646	0.1259	0.313
btc_full_purge10	0.423	0.429	0.433	0.729	0.647	0.1258	0.314

⚠ ECE = 0.0000 across all variants — same anomaly carried over from Tracks 5 / 6 / 9 / 11 (CVN-N011-EA-S09 / #770). Pre-dates Track 1.

5.1 Mandatory leakage check (plan §4.6)¶

The plan dossier §4.6 specifies the leakage check as a paired t-test on f1_buy(btc_full_purge0) - f1_buy(btc_full) across 25 paired (crypto, fold) cells, with BH-corrected p ≥ 0.05 required for PASS. The PDF aggregates only show the per-variant means ; the per-cell paired comparison was extracted from the prod PG finetune_results table by querying WHERE run_id = 'ftf_20260501_230526_2483f9_ATR0.5_1.5_H4' AND variant IN ('btc_full', 'btc_full_purge0') and running scipy.stats.ttest_rel.

Result :

Statistic	Value
Paired observations	25 (5 cryptos × 5 folds)
Mean f1_buy(`btc_full`)	0.42518
Mean f1_buy(`btc_full_purge0`)	0.43245
Δ (purge0 − canonical)	+0.00727
Paired t-statistic	2.171
p-value (two-sided)	0.0401
Cohen's d (paired)	+0.434 (medium)
BH-corrected p (k=1)	0.0401

Verdict : FAIL — btc_full_purge0 outperforms btc_full with p=0.0401 < 0.05. Per plan §4.6 verbatim : "If the paired difference is significantly positive (BH-corrected p < 0.05) → leakage suspected → ABANDON Track 1 pending root-cause investigation."

5.2 Sortino-vs-f1_buy divergence (operator note)¶

Sortino tells the opposite story : canonical btc_full (Sortino 1.710) beats btc_full_purge0 (Sortino 1.390) by +0.32 (+23 %). The leakage signal lives at the per-prediction level (ML metric f1_buy) but does NOT translate into trade-level economic lift. Three candidate interpretations :

Real but small ADR-14 violation : purge_bars=0 lets one or more BTC features (likely btc_correlation_15m_lag5, btc_z_score_close, or btc_realized_vol_24h) include same-bar BTC info that overlaps with the altcoin's H4 label window. The model exploits this for marginally better per-bar f1, but the trades it picks remain dominated by altcoin-specific noise → no Sortino lift. Fix : the canonical purge=20 already plugs this ; investigation should confirm purge=20 is sufficient.
Production-exploitable signal mistakenly purged : BTC's bar-i close IS available at altcoin's bar-i decision time in production (BTC closes alongside the altcoin). The "leakage" detected by the check might actually be a real signal we're conservatively discarding via the 5h purge window. Fix : sensitivity sweep on purge_bars ∈ {0, 5, 10, 20, 40} to find the minimum purge that doesn't exploit altcoin-label-horizon overlap.
Statistical noise : at p=0.0401 the gate just fails ; with a slightly different fold split or HPO seed it could pass. d=+0.434 is medium effect, Δ=+0.00727 is tiny. Fix : deep re-sweep at n=170 per variant resolves this either way.

The investigation cannot proceed via this dossier — it requires (a) a per-feature ablation of the BTC feature set with purge=0 to localise which feature(s) drive the f1 lift, AND (b) the purge_bars sensitivity sweep. Filed as a follow-up GH issue.

5.3 ML metrics — for reference (winner row uninterpretable until §5.1 resolved)¶

Δf1_buy vs baseline :

Variant	Δf1_buy
btc_full_purge0	+0.008 (LEAKAGE FLAG — see §5.1)
btc_full	+0.007
btc_vol_only	+0.001
btc_min	-0.001
btc_full_purge10	-0.002

The mission's headline metric f1_buy ≥ +0.015 lift gate (per F1 plan §6 line 311) fails by 0.008 — btc_full lifts by 0.007 instead of 0.015. The lift is in the right direction with the right ML signal alignment (Brier improves marginally, AUC stays flat → the model is making marginally better confident calls). But until the leakage finding (§5.1) is investigated, the f1_buy lift attribution is uncertain — part of the lift may be the same per-prediction signal that the leakage check flagged on the purge=0 variant.

6. Signal funnel — Couche B (débit)¶

Variant	raw_buy	CUSUM block	Conc. block	Conc. survival	Total survival	Train Samples
btc_min	1173	0.878	0.217	0.783	0.783	25729
btc_full	1223	0.877	0.222	0.778	0.778	25705
btc_full_purge0	1241	0.874	0.227	0.773	0.773	25683
btc_vol_only	1231	0.873	0.242	0.758	0.758	25679
none (baseline)	1538	0.878	0.251	0.749	0.749	25729
btc_full_purge10	1657	0.875	0.254	0.746	0.746	25705

Reading : BTC-feature variants generate fewer raw BUYs (1173-1241 vs 1538 baseline = -19 to -24 %) — the BTC features make the model more selective at the raw-signal stage. CUSUM block-rate is essentially flat (0.873-0.878) → the selectivity is in the model, not the regime filter. Concurrency block-rate is also flat (0.217-0.254) → no overfit-driven trade clustering.

Critical observation : btc_full produces 1223 raw BUYs vs baseline 1538 (-20 %), but higher Win Rate (64.5 % vs 57.5 %, +7pp) AND higher Sortino (1.710 vs 1.547, +0.16). This means the BTC features are filtering the right candidates — the rejected raw BUYs were the lower-quality ones. Exactly the desired mechanism : data-tier features improve the underlying signal-to-noise ratio, not just shift the threshold.

7. Per-crypto performance (PDF "Performance by Crypto" tables)¶

Variant	AAVEUSDC	ARBUSDC	LDOUSDC	OPUSDC	UNIUSDC
btc_full	0.039	3.176	2.012	2.320	1.296
btc_full_purge10	-0.465 ⚠	2.012	1.595	2.275	1.331
btc_full_purge0	-0.222	2.128	1.686	1.969	1.126
btc_min	-0.188	2.008	1.934	1.750	1.583
btc_vol_only	-0.247	2.642	1.925	1.910	1.637
none (baseline)	-0.216	2.490	1.661	2.241	1.559

(Sortino values, bold = best variant for that crypto.)

AAVEUSDC pathology — Sortino-only finding, not corroborated by f1_buy :

Variant	AAVEUSDC Sortino	vs baseline (-0.216)	AAVEUSDC f1_buy	vs baseline (0.4540)
btc_full	+0.039	+0.255 (back to positive)	0.4477	-0.006 (slightly worse)
btc_full_purge0	-0.222	-0.006	0.4658	+0.012
btc_min	-0.188	+0.028	0.4632	+0.009
btc_vol_only	-0.247	-0.031	0.4535	-0.001
btc_full_purge10	-0.465	-0.249 ⚠	0.4671	+0.013

btc_full flips AAVEUSDC's Sortino from negative to positive — but the per-bar f1_buy is essentially unchanged (-0.006 vs baseline). This is the same Sortino-vs-f1_buy divergence pattern as elsewhere in the dossier (§5.2, §7) : the BTC features change which trades the model picks, not how often it's right per-bar. On AAVEUSDC the trade-selection effect is large (Sortino -0.216 → +0.039) but the per-prediction signal is flat. Plan §6's per-asset gate is on f1_buy ; on Sortino alone this would be a clear win, but the gate spec is not on Sortino.

Per-asset gate per spec (F1 plan §6 line 314 : "f1_buy improves on ≥ 4/5 cryptos") :

The Sortino-based per-crypto table above is the PDF aggregation. The plan's per-asset gate is on f1_buy, not Sortino. Recomputing on the actual per-(crypto, fold) f1_buy from prod PG finetune_results :

Variant	AAVEUSDC	ARBUSDC	LDOUSDC	OPUSDC	UNIUSDC
none (baseline)	0.4540	0.4017	0.4327	0.4007	0.4344
btc_full	0.4477	0.3953	0.4501	0.4054	0.4274
btc_full_purge0	0.4658	0.4010	0.4576	0.4137	0.4241
btc_full_purge10	0.4671	0.3830	0.4333	0.4130	0.4200
btc_min	0.4632	0.3852	0.4386	0.4033	0.4287
btc_vol_only	0.4535	0.3853	0.4313	0.3936	0.4267

(per-crypto f1_buy mean over 5 folds, bold where variant > baseline)

Variant	Cryptos improving f1_buy vs none	Per-spec verdict
btc_full	2/5 (LDO +0.017, OP +0.005 ; AAVE -0.006, ARBU -0.006, UNI -0.007 lose)	❌ FAIL (need ≥ 4/5)
btc_vol_only	0/5 (all cryptos slightly worse)	❌ FAIL
btc_min	3/5 (AAVE +0.009, LDO +0.006, OP +0.003 ; ARBU -0.017, UNI -0.006 lose)	❌ FAIL

btc_full FAILS the per-asset f1_buy gate (2/5). The Sortino-based 4/5 reading shipped in earlier drafts of this dossier (and PR #800 narrative) was wrong-spec — ABANDON verdict still holds for the leakage gate (§5.1) ; this corrected per-asset reading just removes the qualifying claim that the per-asset gate cleared.

Sortino-vs-f1_buy divergence per crypto : btc_full improves Sortino on 4/5 cryptos but f1_buy on only 2/5. AAVE pathology is "fixed" on Sortino (-0.216 → +0.039) but unchanged on f1_buy (0.4540 → 0.4477). This is the same divergence pattern as the leakage check (§5.2) at a finer granularity — the model picks slightly worse trades per-prediction (lower per-bar f1 on 3/5 cryptos) but the trades it DOES pick are economically more profitable on most cryptos (higher Sortino on 4/5). Useful informally for the leakage investigation (#806) — both phenomena likely share a root cause in how BTC features modulate trade selection vs trade quality.

8. Stability by fold (PDF heatmap, Sortino)¶

Variant	Fold 3	Fold 4	Fold 5	Fold 6	Fold 7	per-fold variance
btc_full	2.77	0.59	1.22	1.40	2.34	0.677
btc_full_purge0	1.71	0.74	1.06	1.40	1.85	0.190
btc_full_purge10	1.76	0.41	1.43	1.55	1.76	0.330
btc_min	1.91	0.55	1.29	1.37	1.97	0.345
btc_vol_only	2.40	0.57	1.24	1.43	1.98	0.500
none	2.33	0.59	1.33	1.46	2.02	0.398

Fold 4 is uniformly weak across all variants (0.41-0.74) — same cross-Track artefact seen in Tracks 9 + 11. Confirms it's a market-regime issue in that period, not a Track-1-specific failure.

btc_full has the highest per-fold variance (0.677) — the lift is concentrated on Folds 3 + 7 where the BTC signal is strongest, with neutral-to-baseline behaviour on the middle folds. This is NOT a stability concern for the verdict — it's the natural fingerprint of a feature that has stronger predictive power in some market regimes than others. A future Story could investigate which BTC-regime characteristics correlate with Fold 3 + 7 to refine the variant selection (e.g., toggle BTC features on/off based on a regime classifier).

9. Gate evaluation per F1_BUY_BOOST_PLAN.md §6 ¶

Criterion	btc_full	btc_vol_only	btc_min
f1_buy lift ≥ +0.015 with CI95 excluding 0 (per F1 plan §6 line 311)	❌ Δ=+0.007	❌ Δ=+0.001	❌ Δ=-0.001
Joint metric : Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 %	⚠ d_exp=+0.61, d_sortino=+0.20, ΔmaxDD=+1.5pp	⚠ d_exp=+0.34, d_sortino=+0.10, ΔmaxDD=-0.1pp	⚠ d_exp=+0.32, d_sortino=-0.13, ΔmaxDD=+0.9pp
Stability : per-fold variance of f1_buy ≤ 0.05 (per F1 plan §6)	✅ var=0.00106 (computed from PG `finetune_results` per-fold f1_buy means)	✅ var=0.00122	✅ var=0.00066
Per-asset : f1_buy improves on ≥ 4/5 cryptos (per F1 plan §6 line 314)	❌ 2/5 on f1_buy (LDO +0.017, OP +0.005 ; AAVE/ARBU/UNI lose) — see §7	❌ 0/5	❌ 3/5
Sample size : ≥ 50 BUY trades / fold	✅ ~170 trades/fold	✅ ~175 trades/fold	✅ ~165 trades/fold
MLOps : `documentation/stories/CVN-N001-EE-S04/mlops_readiness.md` complete	✅ (filed in PR #792 + #801)	✅	✅
Mandatory leakage check (per Track 1 plan dossier `2026-04-30-track1-btc-features-plan.md` §4.6) : paired t-test on `f1_buy(btc_full_purge0) − f1_buy(btc_full)` BH-corrected p ≥ 0.05	❌ p=0.0401, d=+0.434 (`btc_full_purge0` outperforms — leakage suspected, see §5.1)	N/A	N/A
Lock rule (PDF) : ≥ 2 metrics with BH p<0.05 AND \|d\|≥0.3 in winner direction	❌ 0/4 (but 2/4 with d ≥ 0.3 — under-powered)	❌ 0/4	❌ 0/4

Verdict per criterion : the mandatory leakage check fails (p=0.0401), the f1_buy lift gate fails (0.007 vs 0.015 needed), the per-asset f1_buy gate fails (2/5 vs 4/5 needed), and the lock rule fails (0/4 metrics) — ABANDON pending leakage root-cause investigation. The leakage failure is the controlling gate per Track 1 plan dossier §4.6 ; even though btc_full shows promising effect sizes on Sortino + Expectancy + Win Rate, those signals are on different metrics than the gates the plan specifies (which are on f1_buy). The Sortino-vs-f1_buy divergence (§5.2 + §7) suggests a coherent mechanism — the leakage investigation (#806) must localise the per-feature contribution before any LOCK candidacy.

10. Why BTC features lift Sortino on this data¶

The plan dossier §3 listed four candidate mechanisms by which Track 1 might lift f1_buy. The PDF data evaluates them as :

Hypothesis 1 (BTC as macro-direction proxy) — partially supported. The directional features (btc_return_1h/4h/24h) alone (btc_min variant) underperform the full set, suggesting raw BTC direction isn't the dominant signal. But removing them (btc_vol_only) also loses some lift, so they contribute marginally.
Hypothesis 2 (BTC volatility as risk regime indicator) — strongly supported. The btc_vol_only variant (3 volatility features : btc_realized_vol_24h, btc_z_score_close, btc_correlation_15m_lag5) posts Sortino 1.606 — second-best in the matrix, +0.06 over baseline. The full set adds modest incremental value (1.710 = +0.10 over btc_vol_only). The volatility features carry the bulk of the predictive signal.
Hypothesis 3 (BTC-altcoin lagged correlation as crowd-following indicator) — supported. The btc_correlation_15m_lag5 feature (included in both btc_vol_only and btc_full) appears to be doing real work — both variants outperform btc_min (which excludes it).
Hypothesis 4 (BTC features mitigate AAVEUSDC pathology) — strongly supported. btc_full is the ONLY variant that flips AAVEUSDC's Sortino from negative (-0.216 baseline) to positive (+0.039). The pathology was diagnosed in Tracks 9 + 11 dossiers as likely related to AAVEUSDC's sensitivity to broad-market regime shifts (DeFi sub-sector correlations) ; the BTC features now provide that broader context to the model.

Mechanism summary : the BTC features work as a regime contextualiser, not a directional signal. The model uses them to assess "is this BUY signal happening during a stable/risk-on or volatile/risk-off market regime?" — and that context lets it pick which raw BUYs to act on with higher precision (Win Rate +7pp) and discard the rest (raw_buy -20%). This is the textbook outcome of adding cross-asset features to a signal-discovery model.

Cross-Track lesson : the f1_buy plateau is NOT purely upstream of model architecture (Track 11's INCONCLUSIVE status notwithstanding) AND NOT addressable by training-signal manipulation (Tracks 5/6) or threshold calibration (Track 9) alone. The data tier delivers a real but under-powered lift — exactly what the F1 plan §6 hypothesis predicted. The lever is correct ; the sample budget is the bottleneck.

11. Decisions¶

11.1 Lock variant — NO LOCK today¶

factor_btc_features=none (XGB-blind baseline) remains the production model. No Console flip. ftf_config.base_env unchanged.

The promising results on btc_full do NOT justify a LOCK at the current sample budget — promoting a model that fails the formal lock rule would set a precedent that erodes the discipline gates protect us from. The under-powered nature is a known pathology with a known remedy (deeper sweep) ; we apply that remedy before committing.

11.2 Verdict — ABANDON pending leakage root-cause investigation¶

Per plan §4.6 mandatory gate (f1_buy(purge0) > f1_buy(canonical) significantly → ABANDON), Track 1 closes ABANDON. The factor stays in MODEL_FACTORS per ADR-0079 invariant 6 — the leakage investigation may produce a corrected variant set that re-opens the LOCK candidacy.

✅ Implementation code stays in tree (src/commun/pipeline/btc_features.py + EnrichmentAPI._enrich_core wiring + AblationRunner._maybe_overlay_btc_features). The BTC features primitives are validated end-to-end (sweep ran 0 errors).
✅ FTF factor btc_features stays in MODEL_FACTORS per ADR-0079 invariant 6 — leakage investigation candidate.
❌ No champion_btc_features model registered. No promotion gate to schedule.
❌ No quarterly re-fit cadence today. Re-evaluation gated on the leakage investigation verdict.
🔍 Required follow-up : root-cause investigation of the f1_buy leakage on btc_full_purge0 (filed as a separate Story — see §11.5).

11.3 Cross-Track interaction notes¶

Track 11 (ensemble diversity, In progress, reopened 2026-05-02) : Track 1's ABANDON pending leakage investigation does NOT change Track 11's path — #802 (autotrainer dispatcher) still ships and Track 11 re-sweep proceeds with the OHLCV-only feature set (same as the original Track 11 sweep). If/when Track 1 returns from leakage investigation with a corrected variant set, the BTC features can be layered onto Track 11's re-swept variants (data × architecture composition).
Track 12 (frac diff + interactions, gated by Track 1) : gate NOT cleared by an ABANDON verdict. Track 12 stays gated until Track 1's leakage investigation produces either a passing variant set OR a definitive ABANDON-not-investigation-able. The Track 12 plan dossier can still be drafted in parallel (architecture-of-the-features design is independent of Track 1's outcome), but the sweep launch waits.
Track 9 (per-regime threshold, ABANDONED) : Track 1's leakage finding has no direct interaction. Filed for re-evaluation only post a successful Track 1 re-sweep (which may or may not happen).
Per-asset deepening — Sortino-vs-f1_buy divergence pattern : btc_full improves Sortino on 4/5 cryptos but f1_buy on only 2/5 (per §7 spec-corrected analysis). UNIUSDC loses on both metrics ; AAVEUSDC + ARBUSDC win on Sortino but lose on f1_buy. Surfaces in the leakage investigation (#806) as one input — the per-(crypto, feature) breakdown of the f1_buy leakage may help localise which feature(s) drive the trade-quality vs trade-selection split.

11.4 Hidden recommendation — `btc_vol_only` for capital-preservation use-cases (status pending)¶

btc_vol_only posts the best Total Return (30.3 %) in the matrix and second-best Sortino (1.606). It would be a candidate for a lighter-weight capital-preservation variant — only 3 features (vs 6 for btc_full), reducing model complexity. But the btc_vol_only feature set is a strict subset of btc_full's features (3 of the 6) and includes btc_correlation_15m_lag5 + btc_z_score_close + btc_realized_vol_24h — exactly the features most likely to be the leakage source (rolling-window features that include same-bar BTC info when purge=0). Until the leakage investigation localises which feature(s) drive the f1_buy lift on purge0, btc_vol_only cannot be recommended either — it likely shares the same leakage pathology.

11.5 Forward path — leakage investigation + sensitivity sweep¶

Filed as new GH issue (link below). Scope :

Per-feature leakage ablation : compute_btc_features supports 6 features. Run a sweep with each feature individually (using purge_bars=0) and measure the f1_buy lift over none. Identify which feature(s) explain the +0.0073 lift on btc_full_purge0.
purge_bars sensitivity sweep : with the leakage source(s) localised, run the canonical 6-feature variant at purge_bars ∈ {0, 5, 10, 20, 40} to find the minimum production-feasible purge that doesn't exploit altcoin-label-horizon overlap. The current 20-bar default may be over-conservative OR may need to be increased for the specific leakage source.
Determine if the leakage is production-exploitable : BTC's bar-i close IS available at altcoin's bar-i decision time in production (BTC closes alongside the altcoin). The "leakage" detected by the check might be a real signal we're conservatively discarding via the 5h purge window. Distinguishing between "ADR-14 violation that must be plugged" vs "production-available signal we mistakenly purged" requires a careful analysis of each BTC feature's temporal contract.
Outcome decision : either (a) confirm canonical purge=20 is the right value → re-sweep at deep mode → corrected verdict OR (b) discover purge_bars can be relaxed to e.g. 10 without leakage → adjust the canonical and re-sweep → potentially better lift.

12. Sprint version + OP closure¶

12.1 OP wp#43 transition¶

Per ADR-69 + workflow §14 + ADR-0079 invariant 10 (auto-syncer) :

This PR updates F1 plan §10 row for Track 1 to **Closed ABANDONED** with link to this dossier.
The auto-syncer (scripts/op_story_sync.py, deployed via .github/workflows/op-story-sync.yml) will read §10 within the SLA (5 min post-merge) and transition wp#43 from In testing → Closed.
OP comment template :

Track 1 (btc_features) sweep ftf_20260501_230526_2483f9_ATR0.5_1.5_H4 completed
cleanly (150 results, 0 errors, full 6-variant coverage thanks to PR #801's wiring).

But the MANDATORY LEAKAGE CHECK per plan §4.6 FAILS: paired t-test on
f1_buy(btc_full_purge0) - f1_buy(btc_full) gives p=0.0401, d=+0.434 — the
leakage-detector variant outperforms canonical. Per plan §4.6 verbatim:
"leakage suspected → ABANDON Track 1 pending root-cause investigation".

Critical caveat: Sortino diverges from f1_buy (canonical 1.710 > purge0 1.390,
+0.32). Effect sizes on btc_full vs none are promising (Expectancy d=+0.61,
Win Rate d=+0.57) but cannot be cleanly attributed until the leakage source
is localised. Plus the f1_buy gate fails (Δ=+0.007 vs +0.015 needed).

Earlier KEEP_AVAILABLE draft retracted in CR pass 1 (CodeRabbit caught the
Sortino-vs-f1_buy spec mismatch — plan §4.6 mandates f1_buy paired t-test,
I had run Sortino comparison instead).

Verdict: ABANDON pending leakage root-cause investigation.
Forward path:
- Per-feature leakage ablation (which feature drives the +0.0073 lift on purge=0?)
- purge_bars sensitivity sweep [0, 5, 10, 20, 40] to find production-feasible minimum
- Decision: confirm purge=20 OR adjust + re-sweep + corrected verdict

Tracked: GH issue #806 (leakage investigation).

Implementation stays in tree; FTF factor remains in MODEL_FACTORS for the
investigation re-sweep.

12.2 Sprint version closure check¶

Per workflow §14 : if wp#43 is the last open Story in its sprint version, follow §16.4 — gate review + close version + retro. Operator to check OP UI and apply.

12.3 Memory entry¶

Two durable lessons worth recording :

Process lesson #1 : leakage check spec must be applied verbatim, not paraphrased. The original draft of this dossier compared Sortino (canonical vs purge=0) and called it PASS — but the plan §4.6 explicitly mandates a paired t-test on f1_buy with BH correction. The Sortino comparison was a different test on a different metric and would have shipped a wrong-spec verdict if CR pass 1 hadn't caught it. Filed as a memory : feedback_leakage_check_spec_verbatim.md (or similar — recurring pattern : the canonical PDF report doesn't include the per-spec paired test, so it MUST be computed from PG finetune_results directly when the spec calls for it).

Process lesson #2 : the Sortino-vs-f1_buy divergence is interesting in its own right. The leakage check fails on f1_buy but Sortino strongly favours canonical — this means the "leakage" is per-prediction-level signal that doesn't translate into trade-level economic lift. This is a known pattern in ML-for-trading where ML metrics and economic metrics can diverge (the two are NOT logically contradictory ; they measure different layers of the pipeline). The leakage gate is on the ML metric per F1 plan §4.6 (i.e., the gate is conservative — it triggers on a per-prediction signal even when trade-level returns are unaffected). The investigation will likely conclude either : (a) tiny ADR-14 violation that the canonical purge=20 already plugs (gate is over-strict) ; OR (b) production-exploitable signal we mistakenly purged (purge_bars over-conservative). Filed as a project memory once the investigation produces the answer.