Track 9 — Per-regime threshold results & gate decision¶
Story : CVN-N001-EE-S03 (wp#42) — Track 9 of F1_buy boost mission, calibration tier
ADR : closes per ADR-0079 (FTF sweep → Story closure 8-step workflow ; this dossier was the first to follow the canonical workflow + the precedent dataset for the ADR's verdict decision tree)
Date : 2026-05-01
Authors : Dominique (operator) + Claude
Plan dossier : 2026-04-29-track9-per-regime-threshold-plan.md — committee c560b67a PASSED EXECUTION_RISK
Implementation PR : #791 (squash 2a997d81, merged 2026-04-30)
Sweep run_id : ftf_20260430_194027_3d0171_ATR0.5_1.5_H4
Sweep status : completed ; 125 results ; 0 errors ; duration 3h44m ; started 2026-04-30 19:40:56 UTC ; git SHA 2a997d81 (= PR #791 squash)
Cryptos : UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC (defi_top5)
Strategy / PTE : ATR0.5_1.5_H4 (per project_pte_policy memory — ATR1.5_3.0_H5 is DEPRECATED)
Folds : 5 (Folds 3-7 per the report) ; Trials : 50 (Optuna budget per fold) ; Cost : 15 bps
FTF report PDF : reports/ftf_report_ftf_20260430_194027_3d0171_ATR0.5_1.5_H4.pdf (committed to docs base — source-of-truth for the per-pair / per-fold tables in §2-§9)
Verdict : ABANDON — every primary-metric pairwise comparison gives 0/4 metric agreements vs the lock rule (BH p<0.05 AND |Cohen's d| ≥ 0.3). Effect sizes are tiny (max |d| = 0.45, in baseline's favor) and statistically insignificant (min p-adj = 0.21). The hypothesis "per-regime calibration lifts f1_buy" is not supported by this dataset.
Lock decision : NO LOCK. factor_per_regime_threshold=none (current global F1-optimal threshold) remains the production baseline. Console state unchanged.
1. Sweep state¶
Per the FTF report's executive summary (run completed 0 errors, 125 rows = 5 cryptos × 5 folds × 5 variants, no rejected variants) :
| Factor | Variant | Useful rows | Coverage | Notes |
|---|---|---|---|---|
| per_regime_threshold | none (baseline) | 25 | 5 cryptos × 5 folds | Global F1-optimal threshold (current production) |
| per_regime_threshold | per_regime_f1 | 25 | 5 × 5 | One F1-optimal threshold per regime |
| per_regime_threshold | per_regime_expectancy | 25 | 5 × 5 | Expectancy-optimal per regime ; sentinel-1.0 fallback for unprofitable regimes |
| per_regime_threshold | per_regime_f1_with_floor | 25 | 5 × 5 | Caps deviation at global - 0.05 |
| per_regime_threshold | coarse_3regime | 25 | 5 × 5 | 6 regimes → trend / range / transition collapse |
Sweep status : succeeded (no error column populated, no fold dropped). Clean run — every variant produced metrics on every (crypto, fold) cell.
2. Performance summary — all variants (PDF "Couche C" table)¶
| Variant | Sortino | Std | Trades | Win Rate | Max DD | Return % |
|---|---|---|---|---|---|---|
| none (baseline) | 1.767 | 1.841 | 1073 | 55.3 % | -10.7 % | 39.1 % |
| coarse_3regime | 1.723 | 1.566 | 1026 | 57.1 % | -9.4 % | 38.0 % |
| per_regime_f1 | 1.626 | 1.790 | 1043 | 57.7 % | -10.2 % | 34.1 % |
| per_regime_expectancy | 1.598 | 1.592 | 948 | 60.7 % | -8.9 % | 35.4 % |
| per_regime_f1_with_floor | 1.320 | 1.617 | 1037 | 56.7 % | -10.7 % | 28.2 % |
Pattern : per-regime variants trade selectivity for absolute return.
- Win Rate ↑ monotonically (55.3 % baseline → 60.7 %
per_regime_expectancy, the most selective) - Total Return ↓ on every variant (39.1 % baseline → 28.2 % worst)
- Max DD ~ flat (within ±1.8pp of baseline)
- Trade count ~ flat (948-1073, ±10 % of baseline)
The hypothesised mechanism (per-regime threshold optimises BUY selection per-regime → higher signal quality) does NOT translate into Sortino lift on this 5-fold × 5-crypto sample budget.
3. Pairwise BH-corrected comparisons — primary metric (Sortino)¶
PDF's "Pairwise Comparisons" table, paired t-test on matched (crypto, fold) cells :
| vs | Mean A (none) | Mean B (variant) | p-adj (BH) | Cohen's d | Sig. |
|---|---|---|---|---|---|
| coarse_3regime | 1.865 | 1.800 | 0.8729 | 0.11 | NO |
| per_regime_expectancy | 1.767 | 1.642 | 0.8729 | 0.10 | NO |
| per_regime_f1 | 1.767 | 1.705 | 0.8729 | 0.06 | NO |
| per_regime_f1_with_floor | 1.767 | 1.369 | 0.2110 | 0.45 | NO |
Note :
Mean Aforcoarse_3regime(1.865) differs from the variant-summary value (1.767) above because the pairwise table averages over the matched (crypto, fold) cells where both A and B have valid data — i.e., the paired-test sample, not the simple variant mean. Both are correct and report the same direction.
Only per_regime_f1_with_floor has a non-trivial effect size (d=0.45, "small" per Cohen's conventions), but the comparison is not significant (p-adj=0.21) AND points TOWARDS the baseline (none > variant).
4. Multi-metric Significance Matrix (PDF section)¶
PDF's per-metric verdict — ✓ = BH p<0.05 AND |d|≥0.3 in winner direction ; ~ = significant but small effect ; ✗ = not significant.
| Pair | Sortino | Expectancy | Total Return | Win Rate |
|---|---|---|---|---|
| none vs coarse_3regime | ✗ p=0.873, d=+0.11 | ✗ p=0.793, d=+0.13 | ✗ p=0.814, d=+0.11 | ✗ p=0.971, d=-0.03 |
| none vs per_regime_expectancy | ✗ p=0.873, d=+0.10 | ✗ p=0.793, d=-0.13 | ✗ p=0.814, d=+0.09 | ✗ p=0.548, d=-0.30 |
| none vs per_regime_f1 | ✗ p=0.873, d=+0.06 | ✗ p=0.896, d=+0.03 | ✗ p=0.814, d=+0.14 | ✗ p=0.971, d=-0.13 |
| none vs per_regime_f1_with_floor | ✗ p=0.211, d=+0.45 | ✗ p=0.793, d=+0.16 | ✗ p=0.227, d=+0.48 | ✗ p=0.977, d=+0.01 |
Lock rule (PDF caption verbatim) : "a factor is LOCKED only when at least 2 metrics show BH-adjusted p < 0.05 AND |Cohen's d| ≥ 0.3 in the winner direction".
Result : 0/4 metrics agree for the winner (none) on every pair. No variant clears the lock rule — and the only non-trivial effect sizes (d=0.45 / 0.48 on Sortino + Total Return for per_regime_f1_with_floor) point TOWARDS the baseline (the variant loses by that margin).
5. ML metrics — Couche A (signal model)¶
PDF's "ML Metrics" table — model-level discrimination metrics, factor-independent of the threshold optimisation :
| Variant | f1_buy | precision | recall | AUC | f1_macro | Brier | ECE | Δ f1 |
|---|---|---|---|---|---|---|---|---|
| none | 0.432 | 0.438 | 0.446 | 0.734 | 0.651 | 0.1256 | 0.0000 ⚠ | 0.318 |
| per_regime_f1 | 0.428 | 0.425 | 0.455 | 0.727 | 0.647 | 0.1263 | 0.0000 ⚠ | 0.314 |
| coarse_3regime | 0.422 | 0.422 | 0.446 | 0.732 | 0.646 | 0.1245 | 0.0000 ⚠ | 0.313 |
| per_regime_expectancy | 0.417 | 0.433 | 0.452 | 0.726 | 0.641 | 0.1265 | 0.0000 ⚠ | 0.308 |
| per_regime_f1_with_floor | 0.416 | 0.436 | 0.424 | 0.726 | 0.643 | 0.1264 | 0.0000 ⚠ | 0.310 |
⚠ ECE = 0.0000 across all variants — same anomaly seen on Track 5 (label_smoothing) per its results dossier (bit-for-bit zero is impossible for a real classifier). The bug is documented in CVN-N011-EA-S09 (#770) and pre-dates Track 9. Doesn't affect the verdict (the f1_buy + Sortino numbers are decisive).
Critical reading : f1_buy itself regresses on every per-regime variant.
| Variant | Δf1_buy vs baseline |
|---|---|
| per_regime_f1 | -0.004 |
| coarse_3regime | -0.010 |
| per_regime_expectancy | -0.015 |
| per_regime_f1_with_floor | -0.016 |
The mission's headline metric f1_buy ≥ +0.020 lift gate fails on every variant — the smallest delta is -0.004 (numerically zero, but in the wrong direction).
6. Signal funnel — Couche B (débit)¶
| Variant | raw_buy | CUSUM block | Conc. block | Conc. survival | Total survival |
|---|---|---|---|---|---|
| none | 1661 | 0.867 | 0.277 | 0.723 | 0.723 |
| coarse_3regime | 1452 | 0.864 | 0.249 | 0.751 | 0.751 |
| per_regime_f1 | 1535 | 0.867 | 0.240 | 0.760 | 0.760 |
| per_regime_f1_with_floor | 1491 | 0.867 | 0.230 | 0.770 | 0.770 |
| per_regime_expectancy | 1324 | 0.867 | 0.207 | 0.793 | 0.793 |
Reading : per-regime variants generate fewer raw BUYs (1324-1535 vs 1661 baseline = -8 to -20 %) — they are pickier upstream of CUSUM. CUSUM block-rate stays flat (0.864-0.867). Concurrency block-rate drops with selectivity (0.277 → 0.207) → total signal survival rises (0.723 → 0.793).
But the signal-survival rise does NOT translate into f1_buy or Sortino lift — the per-regime variants reject borderline opportunities that, in aggregate, were positive-EV. This is the key mechanism behind the "lower trades, higher win rate, lower total return" pattern in §2.
7. Per-crypto performance (PDF section + page 9)¶
| Variant | AAVEUSDC | ARBUSDC | LDOUSDC | OPUSDC | UNIUSDC |
|---|---|---|---|---|---|
| none | 0.058 | 2.730 | 2.697 | 2.019 | 1.567 |
| coarse_3regime | 0.178 | 2.307 | 1.977 | 2.656 | 1.685 |
| per_regime_f1 | -0.204 ⚠ | 2.688 | 2.479 | 1.970 | 1.197 |
| per_regime_expectancy | 0.053 | 2.618 | 2.253 | 1.862 | 1.204 |
| per_regime_f1_with_floor | -0.279 ⚠ | 2.564 | 1.844 | 1.494 | 0.980 |
(Sortino values, bold = best variant for that crypto.)
⚠ AAVEUSDC actively harmed : both per_regime_f1 and per_regime_f1_with_floor push AAVEUSDC's Sortino into negative territory (-0.204 / -0.279 vs +0.058 baseline). AAVEUSDC has the lowest baseline trade count (164 trades vs 217-340 elsewhere), suggesting per-regime calibration degenerates on small per-fold-per-regime samples — the very failure mode the §4.1 fallback guardrail (< 30 total samples / < 5 positive samples) was designed to catch.
Per-asset gate (≥ 4/5 cryptos improve vs baseline on f1_buy) :
| Variant | Cryptos improving Sortino vs none | Verdict |
|---|---|---|
| coarse_3regime | 3/5 (AAVE +0.12, OP +0.64, UNI +0.12) | ❌ |
| per_regime_f1 | 1/5 (LDO -0.22, but ARBU +0.04 and AAVE -0.26) | ❌ |
| per_regime_expectancy | 0/5 (-0.005 to -0.36 across all) | ❌ |
| per_regime_f1_with_floor | 1/5 (LDO loses, ARBU loses, AAVE worst) | ❌ |
coarse_3regime is the closest to baseline (3/5) but still fails the ≥4/5 gate.
8. Stability by fold (PDF heatmap, Sortino)¶
| Variant | Fold 3 | Fold 4 | Fold 5 | Fold 6 | Fold 7 | per-fold variance |
|---|---|---|---|---|---|---|
| none | 2.29 | 0.32 | 1.97 | 1.55 | 2.52 | 0.609 |
| coarse_3regime | 1.84 | 0.40 | 1.75 | 2.13 | 2.24 | 0.466 |
| per_regime_f1 | 2.38 | 0.56 | 1.09 | 2.01 | 2.09 | 0.493 |
| per_regime_expectancy | 2.23 | 0.76 | 0.88 | 1.62 | 2.49 | 0.539 |
| per_regime_f1_with_floor | 2.11 | 0.35 | 0.89 | 1.33 | 1.93 | 0.443 |
Fold 4 is uniformly weak across all variants (0.32-0.76) — likely a tough market regime in that period (cross-Track artefact, not a Track 9 specific issue). All variants show high per-fold variance → the F1 plan §6 stability gate (per-fold variance ≤ 0.05 on f1_buy, but Sortino variance is even larger) is dominated by data-period effects, not the per-regime calibration choice.
9. Gate evaluation per F1_BUY_BOOST_PLAN.md §6¶
| Criterion | per_regime_f1 | per_regime_expectancy | per_regime_f1_with_floor | coarse_3regime |
|---|---|---|---|---|
| f1_buy lift ≥ +0.020 with CI95 excluding 0 | ❌ Δ=-0.004 | ❌ Δ=-0.015 | ❌ Δ=-0.016 | ❌ Δ=-0.010 |
| Joint metric : Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 % | ❌ d_sortino=+0.06 (towards none) | ❌ d_expectancy=-0.13 | ❌ d_sortino=+0.45 (towards none) | ❌ d_sortino=+0.11 (towards none) |
| Stability : per-fold variance of f1_buy ≤ 0.05 | ⚠ all-variants high (Sortino var ~0.5) | ⚠ same | ⚠ same | ⚠ same |
| Per-asset : f1_buy improves on ≥ 4/5 cryptos | ❌ 1/5 | ❌ 0/5 | ❌ 1/5 | ❌ 3/5 |
| Sample size : ≥ 50 BUY trades / fold | ✅ all variants ≥ 100 trades/fold | ✅ | ✅ | ✅ |
MLOps : documentation/stories/CVN-N001-EE-S03/mlops_readiness.md complete |
✅ (filed in PR #791) | ✅ | ✅ | ✅ |
| Lock rule (PDF) : ≥ 2 metrics with BH p<0.05 AND |d|≥0.3 in winner direction | ❌ 0/4 | ❌ 0/4 | ❌ 0/4 | ❌ 0/4 |
Verdict per criterion : every primary criterion fails for every variant. coarse_3regime is the closest to baseline (3/5 cryptos improve) but still fails the ≥4/5 gate AND the f1_buy lift gate AND the lock rule.
10. Why per-regime threshold didn't lift Sortino on this data¶
The hypotheses §5 of the plan dossier predicted four possible mechanisms. The PDF data supports two of them :
- Hypothesis 1 (regime detector noise) — supported. Baseline
nonewins on most cryptos despite the per-regime variants having more selectivity (lower raw_buy, higher Win Rate). If regime labels were sharp + correct, per-regime should at least match baseline ; instead,coarse_3regime(which collapses 6 noisy regimes into 3 bigger buckets) is the best per-regime variant. Mechanism :RegimeDetector.heuristic_v1(ATR + slope rules) produces noisy labels at the per-fold sample budget ; per-regime calibration averages over noise + small-sample sentinels. - Hypothesis 2 (selectivity loses absolute return) — strongly supported. Win Rate climbs monotonically (55.3 % → 60.7 %) but Total Return drops monotonically (39.1 % → 28.2 %). Per-regime variants reject borderline opportunities that, in aggregate, were positive-EV (the rejected BUYs had average win rates between baseline and the per-regime threshold, and their cumulative contribution to Sortino was positive). The Concurrency-block-rate drop confirms : per-regime variants are NOT generating MORE high-quality BUYs ; they are just filtering out more LOW-quality BUYs, but the filter cuts too deep.
- Hypothesis 3 (floor as regulariser) — refuted.
per_regime_f1_with_flooris the WORST variant (Sortino 1.320, Total Return 28.2 %). The +0.05 floor cap on threshold-deviation-from-global is too tight ; it forces the per-regime thresholds back near global on most regimes, and where it doesn't, the deviation is in a worse direction. - Hypothesis 4 (coarse 3-regime sweet spot) — partially supported.
coarse_3regimeis the closest to baseline (Sortino 1.723 vs 1.767, only -0.044 ; Total Return 38.0 % vs 39.1 %, -1.1pp) and the per-asset gate is closest to passing (3/5). Halving the parameter count helped, but didn't recover the loss → the regime-detector signal-to-noise ratio is the upstream bottleneck, not the parameter count.
Cross-Track lesson : Track 9 joins Tracks 5 (label smoothing) + 6 (focal loss) in the calibration / training-signal manipulation tier that does not yield positive lift on this dataset. The F1 plan §6 cross-track lesson stands : after this run, the calibration tier (Track 9) joins the label/loss tier (Tracks 5/6) as ABANDONED. The F1 lift will need to come from the data tier (Track 1, In progress — split-PR per ADR-0079 invariant 9) OR the architecture tier (Track 11, In progress — split-PR per ADR-0079 invariant 9). Both Stories stay In progress until their block A follow-up PRs ship — see F1 plan §10 tracking table for canonical row-level status.
11. Decisions¶
11.1 Lock variant — NO LOCK¶
factor_per_regime_threshold=none (current global F1-optimal threshold) remains the production baseline. No Console flip. ftf_config.base_env unchanged.
11.2 Verdict — ABANDON¶
The per-regime hypothesis is not supported at the current dataset / regime detector. Track 9 closes following the same pattern as Tracks 5/6 :
- ✅ Implementation code stays in tree (
src/commun/trading/per_regime_threshold_calibrator.py+ tests). Useful for ad-hoc experiments and as the reference implementation if a future regime detector ships materially better labels. - ✅ FTF factor
per_regime_thresholdstays inMODEL_FACTORS(mirrors Track 5/6 pattern). Future operator-triggered sweeps can re-evaluate if (a) the regime detector is upgraded to a denoising or learned classifier, OR (b) the dataset grows enough to bring per-fold-per-regime samples into the lock-rule's effect-size detection floor. - ❌ No
champion_per_regime_*model registered. No promotion gate to schedule. - ❌ No quarterly re-fit cadence. Re-evaluation is operator-triggered, not scheduled.
11.3 Cross-Track interaction notes¶
- Track 1 (BTC features, In testing) : if Track 1 LOCKs and adds the BTC enrichment features, the model's per-regime probability calibration will shift. The per-regime threshold could be re-evaluated post-Track-1-LOCK on the BTC-enriched feature set ; today's ABANDON does not preclude this.
- Track 11 (ensemble diversity, PR #793 review) : same logic. If Track 11 LOCKs a stacked variant, the ensemble's
P(BUY)distribution will differ from the single-XGB baseline ; the per-regime threshold optimum can shift. Re-evaluate post-Track-11-LOCK if the operator deems it worth the compute. - Regime detector upgrade : if a future Story upgrades
RegimeDetector(e.g., fromheuristic_v1to a learned classifier), the per-regime threshold becomes a candidate for re-evaluation as the noisy-label hypothesis (§10 H1) would no longer apply.
11.4 Hidden recommendation : per_regime_expectancy for capital-preservation use-cases¶
Not a Track 9 LOCK candidate, but worth recording : per_regime_expectancy posts the best Win Rate (60.7 %) AND best Max DD (-8.9 %) of the matrix. It would be a candidate for a capital-preservation objective (lower DD, higher win rate, willingly trading total return for stability). This isn't this Story's mandate (the F1 plan optimises for Sortino + f1_buy), but the operator could file a follow-up Story under a different mission (e.g., the "filter tuning" / capital preservation tier) that targets per_regime_expectancy as a candidate variant. Out of scope for wp#42 closure.
12. Sprint version + OP closure¶
12.1 OP wp#42 transition¶
Per ADR-69 + workflow §14 :
- Story status
In testing→Closed - OP comment to add :
Track 9 (per_regime_threshold) sweep ftf_20260430_194027_3d0171_ATR0.5_1.5_H4 completed
with verdict ABANDON. No Console flip; baseline `none` retained.
Results dossier: documentation/missions/ml-boost/2026-05-01-track9-per-regime-threshold-results.md
Implementation PR: #791 (squash 2a997d81, merged 2026-04-30)
Lock rule (PDF executive summary): 0/4 metrics agree on any variant vs baseline.
Largest pairwise effect: per_regime_f1_with_floor d=+0.45 in baseline's favor (p-adj=0.21, NS).
AAVEUSDC actively harmed by per_regime_f1 + per_regime_f1_with_floor (negative Sortino).
Implementation stays in tree per Track 5/6 precedent; FTF factor remains in MODEL_FACTORS
for future re-evaluation if the regime detector is upgraded OR the dataset grows.
Verdict: ABANDON; gate decision: keep_available implementation, NO LOCK.
12.2 Sprint version closure check¶
Per workflow §14 : if wp#42 is the last open Story in its sprint version, follow §16.4 — gate review + close version + retro. Operator to check OP UI and apply.
12.3 Memory entry¶
No durable lesson new from this run — the cross-Track lesson "training-signal manipulation tier doesn't lift on this data" was already captured by Tracks 5 + 6. No feedback_*.md memory write needed.
A memory entry MIGHT be warranted for the project layer : "calibration tier confirmed ABANDONED ; F1 lift must come from data (Track 1) or architecture (Track 11) tiers". This is project-state, not a behavioural lesson — already implicit in the F1 plan §6 outcomes table once it's updated post-run.
Sign-off checklist (gate before OP wp#42 closure)¶
- §1-§9 populated with actual sweep numbers from PDF report
ftf_report_ftf_20260430_194027_3d0171_ATR0.5_1.5_H4.pdf - §10 hypothesis pick — H1 (regime detector noise) + H2 (selectivity loses absolute return) supported by data
- §11.1-§11.2 verdict recorded : ABANDON, no Console flip
- §11.3 cross-Track interaction noted (Track 1 + Track 11 future re-evaluation hooks)
- §11.4 hidden recommendation captured (
per_regime_expectancyfor capital-preservation follow-up) - OP wp#42 status
In testing→Closedwith comment from §12.1 — operator action (needs OPENPROJECT_API_KEY) - Sprint version closure gate evaluated per workflow §14 — operator action
- F1 plan §6 outcomes table updated with Track 9 ABANDON entry — operator action OR follow-up docs PR