Track 9 — Per-regime threshold results & gate decision¶

Story : CVN-N001-EE-S03 (wp#42) — Track 9 of F1_buy boost mission, calibration tier ADR : closes per ADR-0079 (FTF sweep → Story closure 8-step workflow ; this dossier was the first to follow the canonical workflow + the precedent dataset for the ADR's verdict decision tree) Date : 2026-05-01 Authors : Dominique (operator) + Claude Plan dossier : 2026-04-29-track9-per-regime-threshold-plan.md — committee c560b67a PASSED EXECUTION_RISK Implementation PR : #791 (squash 2a997d81, merged 2026-04-30) Sweep run_id : ftf_20260430_194027_3d0171_ATR0.5_1.5_H4 Sweep status : completed ; 125 results ; 0 errors ; duration 3h44m ; started 2026-04-30 19:40:56 UTC ; git SHA 2a997d81 (= PR #791 squash) Cryptos : UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC (defi_top5) Strategy / PTE : ATR0.5_1.5_H4 (per project_pte_policy memory — ATR1.5_3.0_H5 is DEPRECATED) Folds : 5 (Folds 3-7 per the report) ; Trials : 50 (Optuna budget per fold) ; Cost : 15 bps FTF report PDF : reports/ftf_report_ftf_20260430_194027_3d0171_ATR0.5_1.5_H4.pdf (committed to docs base — source-of-truth for the per-pair / per-fold tables in §2-§9)

Verdict : ABANDON — every primary-metric pairwise comparison gives 0/4 metric agreements vs the lock rule (BH p<0.05 AND |Cohen's d| ≥ 0.3). Effect sizes are tiny (max |d| = 0.45, in baseline's favor) and statistically insignificant (min p-adj = 0.21). The hypothesis "per-regime calibration lifts f1_buy" is not supported by this dataset. Lock decision : NO LOCK. factor_per_regime_threshold=none (current global F1-optimal threshold) remains the production baseline. Console state unchanged.

1. Sweep state¶

Per the FTF report's executive summary (run completed 0 errors, 125 rows = 5 cryptos × 5 folds × 5 variants, no rejected variants) :

Factor	Variant	Useful rows	Coverage	Notes
per_regime_threshold	none (baseline)	25	5 cryptos × 5 folds	Global F1-optimal threshold (current production)
per_regime_threshold	per_regime_f1	25	5 × 5	One F1-optimal threshold per regime
per_regime_threshold	per_regime_expectancy	25	5 × 5	Expectancy-optimal per regime ; sentinel-1.0 fallback for unprofitable regimes
per_regime_threshold	per_regime_f1_with_floor	25	5 × 5	Caps deviation at global - 0.05
per_regime_threshold	coarse_3regime	25	5 × 5	6 regimes → trend / range / transition collapse

Sweep status : succeeded (no error column populated, no fold dropped). Clean run — every variant produced metrics on every (crypto, fold) cell.

2. Performance summary — all variants (PDF "Couche C" table)¶

Variant	Sortino	Std	Trades	Win Rate	Max DD	Return %
none (baseline)	1.767	1.841	1073	55.3 %	-10.7 %	39.1 %
coarse_3regime	1.723	1.566	1026	57.1 %	-9.4 %	38.0 %
per_regime_f1	1.626	1.790	1043	57.7 %	-10.2 %	34.1 %
per_regime_expectancy	1.598	1.592	948	60.7 %	-8.9 %	35.4 %
per_regime_f1_with_floor	1.320	1.617	1037	56.7 %	-10.7 %	28.2 %

Pattern : per-regime variants trade selectivity for absolute return.

Win Rate ↑ monotonically (55.3 % baseline → 60.7 % per_regime_expectancy, the most selective)
Total Return ↓ on every variant (39.1 % baseline → 28.2 % worst)
Max DD ~ flat (within ±1.8pp of baseline)
Trade count ~ flat (948-1073, ±10 % of baseline)

The hypothesised mechanism (per-regime threshold optimises BUY selection per-regime → higher signal quality) does NOT translate into Sortino lift on this 5-fold × 5-crypto sample budget.

3. Pairwise BH-corrected comparisons — primary metric (Sortino)¶

PDF's "Pairwise Comparisons" table, paired t-test on matched (crypto, fold) cells :

vs	Mean A (none)	Mean B (variant)	p-adj (BH)	Cohen's d	Sig.
coarse_3regime	1.865	1.800	0.8729	0.11	NO
per_regime_expectancy	1.767	1.642	0.8729	0.10	NO
per_regime_f1	1.767	1.705	0.8729	0.06	NO
per_regime_f1_with_floor	1.767	1.369	0.2110	0.45	NO

Note : Mean A for coarse_3regime (1.865) differs from the variant-summary value (1.767) above because the pairwise table averages over the matched (crypto, fold) cells where both A and B have valid data — i.e., the paired-test sample, not the simple variant mean. Both are correct and report the same direction.

Only per_regime_f1_with_floor has a non-trivial effect size (d=0.45, "small" per Cohen's conventions), but the comparison is not significant (p-adj=0.21) AND points TOWARDS the baseline (none > variant).

4. Multi-metric Significance Matrix (PDF section)¶

PDF's per-metric verdict — ✓ = BH p<0.05 AND |d|≥0.3 in winner direction ; ~ = significant but small effect ; ✗ = not significant.

Pair	Sortino	Expectancy	Total Return	Win Rate
none vs coarse_3regime	✗ p=0.873, d=+0.11	✗ p=0.793, d=+0.13	✗ p=0.814, d=+0.11	✗ p=0.971, d=-0.03
none vs per_regime_expectancy	✗ p=0.873, d=+0.10	✗ p=0.793, d=-0.13	✗ p=0.814, d=+0.09	✗ p=0.548, d=-0.30
none vs per_regime_f1	✗ p=0.873, d=+0.06	✗ p=0.896, d=+0.03	✗ p=0.814, d=+0.14	✗ p=0.971, d=-0.13
none vs per_regime_f1_with_floor	✗ p=0.211, d=+0.45	✗ p=0.793, d=+0.16	✗ p=0.227, d=+0.48	✗ p=0.977, d=+0.01

Lock rule (PDF caption verbatim) : "a factor is LOCKED only when at least 2 metrics show BH-adjusted p < 0.05 AND |Cohen's d| ≥ 0.3 in the winner direction".

Result : 0/4 metrics agree for the winner (none) on every pair. No variant clears the lock rule — and the only non-trivial effect sizes (d=0.45 / 0.48 on Sortino + Total Return for per_regime_f1_with_floor) point TOWARDS the baseline (the variant loses by that margin).

5. ML metrics — Couche A (signal model)¶

PDF's "ML Metrics" table — model-level discrimination metrics, factor-independent of the threshold optimisation :

Variant	f1_buy	precision	recall	AUC	f1_macro	Brier	Δ f1
none	0.432	0.438	0.446	0.734	0.651	0.1256	0.318
per_regime_f1	0.428	0.425	0.455	0.727	0.647	0.1263	0.314
coarse_3regime	0.422	0.422	0.446	0.732	0.646	0.1245	0.313
per_regime_expectancy	0.417	0.433	0.452	0.726	0.641	0.1265	0.308
per_regime_f1_with_floor	0.416	0.436	0.424	0.726	0.643	0.1264	0.310

⚠ ECE = 0.0000 across all variants — same anomaly seen on Track 5 (label_smoothing) per its results dossier (bit-for-bit zero is impossible for a real classifier). The bug is documented in CVN-N011-EA-S09 (#770) and pre-dates Track 9. Doesn't affect the verdict (the f1_buy + Sortino numbers are decisive).

Critical reading : f1_buy itself regresses on every per-regime variant.

Variant	Δf1_buy vs baseline
per_regime_f1	-0.004
coarse_3regime	-0.010
per_regime_expectancy	-0.015
per_regime_f1_with_floor	-0.016

The mission's headline metric f1_buy ≥ +0.020 lift gate fails on every variant — the smallest delta is -0.004 (numerically zero, but in the wrong direction).

6. Signal funnel — Couche B (débit)¶

Variant	raw_buy	CUSUM block	Conc. block	Conc. survival	Total survival
none	1661	0.867	0.277	0.723	0.723
coarse_3regime	1452	0.864	0.249	0.751	0.751
per_regime_f1	1535	0.867	0.240	0.760	0.760
per_regime_f1_with_floor	1491	0.867	0.230	0.770	0.770
per_regime_expectancy	1324	0.867	0.207	0.793	0.793

Reading : per-regime variants generate fewer raw BUYs (1324-1535 vs 1661 baseline = -8 to -20 %) — they are pickier upstream of CUSUM. CUSUM block-rate stays flat (0.864-0.867). Concurrency block-rate drops with selectivity (0.277 → 0.207) → total signal survival rises (0.723 → 0.793).

But the signal-survival rise does NOT translate into f1_buy or Sortino lift — the per-regime variants reject borderline opportunities that, in aggregate, were positive-EV. This is the key mechanism behind the "lower trades, higher win rate, lower total return" pattern in §2.

7. Per-crypto performance (PDF section + page 9)¶

Variant	AAVEUSDC	ARBUSDC	LDOUSDC	OPUSDC	UNIUSDC
none	0.058	2.730	2.697	2.019	1.567
coarse_3regime	0.178	2.307	1.977	2.656	1.685
per_regime_f1	-0.204 ⚠	2.688	2.479	1.970	1.197
per_regime_expectancy	0.053	2.618	2.253	1.862	1.204
per_regime_f1_with_floor	-0.279 ⚠	2.564	1.844	1.494	0.980

(Sortino values, bold = best variant for that crypto.)

⚠ AAVEUSDC actively harmed : both per_regime_f1 and per_regime_f1_with_floor push AAVEUSDC's Sortino into negative territory (-0.204 / -0.279 vs +0.058 baseline). AAVEUSDC has the lowest baseline trade count (164 trades vs 217-340 elsewhere), suggesting per-regime calibration degenerates on small per-fold-per-regime samples — the very failure mode the §4.1 fallback guardrail (< 30 total samples / < 5 positive samples) was designed to catch.

Per-asset gate (≥ 4/5 cryptos improve vs baseline on f1_buy) :

Variant	Cryptos improving Sortino vs none	Verdict
coarse_3regime	3/5 (AAVE +0.12, OP +0.64, UNI +0.12)	❌
per_regime_f1	1/5 (LDO -0.22, but ARBU +0.04 and AAVE -0.26)	❌
per_regime_expectancy	0/5 (-0.005 to -0.36 across all)	❌
per_regime_f1_with_floor	1/5 (LDO loses, ARBU loses, AAVE worst)	❌

coarse_3regime is the closest to baseline (3/5) but still fails the ≥4/5 gate.

8. Stability by fold (PDF heatmap, Sortino)¶

Variant	Fold 3	Fold 4	Fold 5	Fold 6	Fold 7	per-fold variance
none	2.29	0.32	1.97	1.55	2.52	0.609
coarse_3regime	1.84	0.40	1.75	2.13	2.24	0.466
per_regime_f1	2.38	0.56	1.09	2.01	2.09	0.493
per_regime_expectancy	2.23	0.76	0.88	1.62	2.49	0.539
per_regime_f1_with_floor	2.11	0.35	0.89	1.33	1.93	0.443

Fold 4 is uniformly weak across all variants (0.32-0.76) — likely a tough market regime in that period (cross-Track artefact, not a Track 9 specific issue). All variants show high per-fold variance → the F1 plan §6 stability gate (per-fold variance ≤ 0.05 on f1_buy, but Sortino variance is even larger) is dominated by data-period effects, not the per-regime calibration choice.

9. Gate evaluation per F1_BUY_BOOST_PLAN.md §6 ¶

Criterion	per_regime_f1	per_regime_expectancy	per_regime_f1_with_floor	coarse_3regime
f1_buy lift ≥ +0.020 with CI95 excluding 0	❌ Δ=-0.004	❌ Δ=-0.015	❌ Δ=-0.016	❌ Δ=-0.010
Joint metric : Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 %	❌ d_sortino=+0.06 (towards none)	❌ d_expectancy=-0.13	❌ d_sortino=+0.45 (towards none)	❌ d_sortino=+0.11 (towards none)
Stability : per-fold variance of f1_buy ≤ 0.05	⚠ all-variants high (Sortino var ~0.5)	⚠ same	⚠ same	⚠ same
Per-asset : f1_buy improves on ≥ 4/5 cryptos	❌ 1/5	❌ 0/5	❌ 1/5	❌ 3/5
Sample size : ≥ 50 BUY trades / fold	✅ all variants ≥ 100 trades/fold	✅	✅	✅
MLOps : `documentation/stories/CVN-N001-EE-S03/mlops_readiness.md` complete	✅ (filed in PR #791)	✅	✅	✅
Lock rule (PDF) : ≥ 2 metrics with BH p<0.05 AND \|d\|≥0.3 in winner direction	❌ 0/4	❌ 0/4	❌ 0/4	❌ 0/4

Verdict per criterion : every primary criterion fails for every variant. coarse_3regime is the closest to baseline (3/5 cryptos improve) but still fails the ≥4/5 gate AND the f1_buy lift gate AND the lock rule.

10. Why per-regime threshold didn't lift Sortino on this data¶

The hypotheses §5 of the plan dossier predicted four possible mechanisms. The PDF data supports two of them :

Hypothesis 1 (regime detector noise) — supported. Baseline none wins on most cryptos despite the per-regime variants having more selectivity (lower raw_buy, higher Win Rate). If regime labels were sharp + correct, per-regime should at least match baseline ; instead, coarse_3regime (which collapses 6 noisy regimes into 3 bigger buckets) is the best per-regime variant. Mechanism : RegimeDetector.heuristic_v1 (ATR + slope rules) produces noisy labels at the per-fold sample budget ; per-regime calibration averages over noise + small-sample sentinels.
Hypothesis 2 (selectivity loses absolute return) — strongly supported. Win Rate climbs monotonically (55.3 % → 60.7 %) but Total Return drops monotonically (39.1 % → 28.2 %). Per-regime variants reject borderline opportunities that, in aggregate, were positive-EV (the rejected BUYs had average win rates between baseline and the per-regime threshold, and their cumulative contribution to Sortino was positive). The Concurrency-block-rate drop confirms : per-regime variants are NOT generating MORE high-quality BUYs ; they are just filtering out more LOW-quality BUYs, but the filter cuts too deep.
Hypothesis 3 (floor as regulariser) — refuted. per_regime_f1_with_floor is the WORST variant (Sortino 1.320, Total Return 28.2 %). The +0.05 floor cap on threshold-deviation-from-global is too tight ; it forces the per-regime thresholds back near global on most regimes, and where it doesn't, the deviation is in a worse direction.
Hypothesis 4 (coarse 3-regime sweet spot) — partially supported. coarse_3regime is the closest to baseline (Sortino 1.723 vs 1.767, only -0.044 ; Total Return 38.0 % vs 39.1 %, -1.1pp) and the per-asset gate is closest to passing (3/5). Halving the parameter count helped, but didn't recover the loss → the regime-detector signal-to-noise ratio is the upstream bottleneck, not the parameter count.

Cross-Track lesson : Track 9 joins Tracks 5 (label smoothing) + 6 (focal loss) in the calibration / training-signal manipulation tier that does not yield positive lift on this dataset. The F1 plan §6 cross-track lesson stands : after this run, the calibration tier (Track 9) joins the label/loss tier (Tracks 5/6) as ABANDONED. The F1 lift will need to come from the data tier (Track 1, In progress — split-PR per ADR-0079 invariant 9) OR the architecture tier (Track 11, In progress — split-PR per ADR-0079 invariant 9). Both Stories stay In progress until their block A follow-up PRs ship — see F1 plan §10 tracking table for canonical row-level status.

11. Decisions¶

11.1 Lock variant — NO LOCK¶

factor_per_regime_threshold=none (current global F1-optimal threshold) remains the production baseline. No Console flip. ftf_config.base_env unchanged.

11.2 Verdict — ABANDON¶

The per-regime hypothesis is not supported at the current dataset / regime detector. Track 9 closes following the same pattern as Tracks 5/6 :

✅ Implementation code stays in tree (src/commun/trading/per_regime_threshold_calibrator.py + tests). Useful for ad-hoc experiments and as the reference implementation if a future regime detector ships materially better labels.
✅ FTF factor per_regime_threshold stays in MODEL_FACTORS (mirrors Track 5/6 pattern). Future operator-triggered sweeps can re-evaluate if (a) the regime detector is upgraded to a denoising or learned classifier, OR (b) the dataset grows enough to bring per-fold-per-regime samples into the lock-rule's effect-size detection floor.
❌ No champion_per_regime_* model registered. No promotion gate to schedule.
❌ No quarterly re-fit cadence. Re-evaluation is operator-triggered, not scheduled.

11.3 Cross-Track interaction notes¶

Track 1 (BTC features, In testing) : if Track 1 LOCKs and adds the BTC enrichment features, the model's per-regime probability calibration will shift. The per-regime threshold could be re-evaluated post-Track-1-LOCK on the BTC-enriched feature set ; today's ABANDON does not preclude this.
Track 11 (ensemble diversity, PR #793 review) : same logic. If Track 11 LOCKs a stacked variant, the ensemble's P(BUY) distribution will differ from the single-XGB baseline ; the per-regime threshold optimum can shift. Re-evaluate post-Track-11-LOCK if the operator deems it worth the compute.
Regime detector upgrade : if a future Story upgrades RegimeDetector (e.g., from heuristic_v1 to a learned classifier), the per-regime threshold becomes a candidate for re-evaluation as the noisy-label hypothesis (§10 H1) would no longer apply.

11.4 Hidden recommendation : `per_regime_expectancy` for capital-preservation use-cases¶

Not a Track 9 LOCK candidate, but worth recording : per_regime_expectancy posts the best Win Rate (60.7 %) AND best Max DD (-8.9 %) of the matrix. It would be a candidate for a capital-preservation objective (lower DD, higher win rate, willingly trading total return for stability). This isn't this Story's mandate (the F1 plan optimises for Sortino + f1_buy), but the operator could file a follow-up Story under a different mission (e.g., the "filter tuning" / capital preservation tier) that targets per_regime_expectancy as a candidate variant. Out of scope for wp#42 closure.

12. Sprint version + OP closure¶

12.1 OP wp#42 transition¶

Per ADR-69 + workflow §14 :

Story status In testing → Closed
OP comment to add :

Track 9 (per_regime_threshold) sweep ftf_20260430_194027_3d0171_ATR0.5_1.5_H4 completed
with verdict ABANDON. No Console flip; baseline `none` retained.

Results dossier: documentation/missions/ml-boost/2026-05-01-track9-per-regime-threshold-results.md
Implementation PR: #791 (squash 2a997d81, merged 2026-04-30)

Lock rule (PDF executive summary): 0/4 metrics agree on any variant vs baseline.
Largest pairwise effect: per_regime_f1_with_floor d=+0.45 in baseline's favor (p-adj=0.21, NS).
AAVEUSDC actively harmed by per_regime_f1 + per_regime_f1_with_floor (negative Sortino).

Implementation stays in tree per Track 5/6 precedent; FTF factor remains in MODEL_FACTORS
for future re-evaluation if the regime detector is upgraded OR the dataset grows.

Verdict: ABANDON; gate decision: keep_available implementation, NO LOCK.

12.2 Sprint version closure check¶

Per workflow §14 : if wp#42 is the last open Story in its sprint version, follow §16.4 — gate review + close version + retro. Operator to check OP UI and apply.

12.3 Memory entry¶

No durable lesson new from this run — the cross-Track lesson "training-signal manipulation tier doesn't lift on this data" was already captured by Tracks 5 + 6. No feedback_*.md memory write needed.

A memory entry MIGHT be warranted for the project layer : "calibration tier confirmed ABANDONED ; F1 lift must come from data (Track 1) or architecture (Track 11) tiers". This is project-state, not a behavioural lesson — already implicit in the F1 plan §6 outcomes table once it's updated post-run.

Sign-off checklist (gate before OP wp#42 closure)¶

§1-§9 populated with actual sweep numbers from PDF report ftf_report_ftf_20260430_194027_3d0171_ATR0.5_1.5_H4.pdf
§10 hypothesis pick — H1 (regime detector noise) + H2 (selectivity loses absolute return) supported by data
§11.1-§11.2 verdict recorded : ABANDON, no Console flip
§11.3 cross-Track interaction noted (Track 1 + Track 11 future re-evaluation hooks)
§11.4 hidden recommendation captured (per_regime_expectancy for capital-preservation follow-up)
OP wp#42 status In testing → Closed with comment from §12.1 — operator action (needs OPENPROJECT_API_KEY)
Sprint version closure gate evaluated per workflow §14 — operator action
F1 plan §6 outcomes table updated with Track 9 ABANDON entry — operator action OR follow-up docs PR