Skip to content

Track 9 — Per-regime threshold results & gate decision

Story : CVN-N001-EE-S03 (wp#42) — Track 9 of F1_buy boost mission, calibration tier ADR : closes per ADR-0079 (FTF sweep → Story closure 8-step workflow ; this dossier was the first to follow the canonical workflow + the precedent dataset for the ADR's verdict decision tree) Date : 2026-05-01 Authors : Dominique (operator) + Claude Plan dossier : 2026-04-29-track9-per-regime-threshold-plan.md — committee c560b67a PASSED EXECUTION_RISK Implementation PR : #791 (squash 2a997d81, merged 2026-04-30) Sweep run_id : ftf_20260430_194027_3d0171_ATR0.5_1.5_H4 Sweep status : completed ; 125 results ; 0 errors ; duration 3h44m ; started 2026-04-30 19:40:56 UTC ; git SHA 2a997d81 (= PR #791 squash) Cryptos : UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC (defi_top5) Strategy / PTE : ATR0.5_1.5_H4 (per project_pte_policy memory — ATR1.5_3.0_H5 is DEPRECATED) Folds : 5 (Folds 3-7 per the report) ; Trials : 50 (Optuna budget per fold) ; Cost : 15 bps FTF report PDF : reports/ftf_report_ftf_20260430_194027_3d0171_ATR0.5_1.5_H4.pdf (committed to docs base — source-of-truth for the per-pair / per-fold tables in §2-§9)

Verdict : ABANDON — every primary-metric pairwise comparison gives 0/4 metric agreements vs the lock rule (BH p<0.05 AND |Cohen's d| ≥ 0.3). Effect sizes are tiny (max |d| = 0.45, in baseline's favor) and statistically insignificant (min p-adj = 0.21). The hypothesis "per-regime calibration lifts f1_buy" is not supported by this dataset. Lock decision : NO LOCK. factor_per_regime_threshold=none (current global F1-optimal threshold) remains the production baseline. Console state unchanged.


1. Sweep state

Per the FTF report's executive summary (run completed 0 errors, 125 rows = 5 cryptos × 5 folds × 5 variants, no rejected variants) :

Factor Variant Useful rows Coverage Notes
per_regime_threshold none (baseline) 25 5 cryptos × 5 folds Global F1-optimal threshold (current production)
per_regime_threshold per_regime_f1 25 5 × 5 One F1-optimal threshold per regime
per_regime_threshold per_regime_expectancy 25 5 × 5 Expectancy-optimal per regime ; sentinel-1.0 fallback for unprofitable regimes
per_regime_threshold per_regime_f1_with_floor 25 5 × 5 Caps deviation at global - 0.05
per_regime_threshold coarse_3regime 25 5 × 5 6 regimes → trend / range / transition collapse

Sweep status : succeeded (no error column populated, no fold dropped). Clean run — every variant produced metrics on every (crypto, fold) cell.

2. Performance summary — all variants (PDF "Couche C" table)

Variant Sortino Std Trades Win Rate Max DD Return %
none (baseline) 1.767 1.841 1073 55.3 % -10.7 % 39.1 %
coarse_3regime 1.723 1.566 1026 57.1 % -9.4 % 38.0 %
per_regime_f1 1.626 1.790 1043 57.7 % -10.2 % 34.1 %
per_regime_expectancy 1.598 1.592 948 60.7 % -8.9 % 35.4 %
per_regime_f1_with_floor 1.320 1.617 1037 56.7 % -10.7 % 28.2 %

Pattern : per-regime variants trade selectivity for absolute return.

  • Win Rate ↑ monotonically (55.3 % baseline → 60.7 % per_regime_expectancy, the most selective)
  • Total Return ↓ on every variant (39.1 % baseline → 28.2 % worst)
  • Max DD ~ flat (within ±1.8pp of baseline)
  • Trade count ~ flat (948-1073, ±10 % of baseline)

The hypothesised mechanism (per-regime threshold optimises BUY selection per-regime → higher signal quality) does NOT translate into Sortino lift on this 5-fold × 5-crypto sample budget.

3. Pairwise BH-corrected comparisons — primary metric (Sortino)

PDF's "Pairwise Comparisons" table, paired t-test on matched (crypto, fold) cells :

vs Mean A (none) Mean B (variant) p-adj (BH) Cohen's d Sig.
coarse_3regime 1.865 1.800 0.8729 0.11 NO
per_regime_expectancy 1.767 1.642 0.8729 0.10 NO
per_regime_f1 1.767 1.705 0.8729 0.06 NO
per_regime_f1_with_floor 1.767 1.369 0.2110 0.45 NO

Note : Mean A for coarse_3regime (1.865) differs from the variant-summary value (1.767) above because the pairwise table averages over the matched (crypto, fold) cells where both A and B have valid data — i.e., the paired-test sample, not the simple variant mean. Both are correct and report the same direction.

Only per_regime_f1_with_floor has a non-trivial effect size (d=0.45, "small" per Cohen's conventions), but the comparison is not significant (p-adj=0.21) AND points TOWARDS the baseline (none > variant).

4. Multi-metric Significance Matrix (PDF section)

PDF's per-metric verdict — = BH p<0.05 AND |d|≥0.3 in winner direction ; ~ = significant but small effect ; = not significant.

Pair Sortino Expectancy Total Return Win Rate
none vs coarse_3regime ✗ p=0.873, d=+0.11 ✗ p=0.793, d=+0.13 ✗ p=0.814, d=+0.11 ✗ p=0.971, d=-0.03
none vs per_regime_expectancy ✗ p=0.873, d=+0.10 ✗ p=0.793, d=-0.13 ✗ p=0.814, d=+0.09 ✗ p=0.548, d=-0.30
none vs per_regime_f1 ✗ p=0.873, d=+0.06 ✗ p=0.896, d=+0.03 ✗ p=0.814, d=+0.14 ✗ p=0.971, d=-0.13
none vs per_regime_f1_with_floor ✗ p=0.211, d=+0.45 ✗ p=0.793, d=+0.16 ✗ p=0.227, d=+0.48 ✗ p=0.977, d=+0.01

Lock rule (PDF caption verbatim) : "a factor is LOCKED only when at least 2 metrics show BH-adjusted p < 0.05 AND |Cohen's d| ≥ 0.3 in the winner direction".

Result : 0/4 metrics agree for the winner (none) on every pair. No variant clears the lock rule — and the only non-trivial effect sizes (d=0.45 / 0.48 on Sortino + Total Return for per_regime_f1_with_floor) point TOWARDS the baseline (the variant loses by that margin).

5. ML metrics — Couche A (signal model)

PDF's "ML Metrics" table — model-level discrimination metrics, factor-independent of the threshold optimisation :

Variant f1_buy precision recall AUC f1_macro Brier ECE Δ f1
none 0.432 0.438 0.446 0.734 0.651 0.1256 0.0000 ⚠ 0.318
per_regime_f1 0.428 0.425 0.455 0.727 0.647 0.1263 0.0000 ⚠ 0.314
coarse_3regime 0.422 0.422 0.446 0.732 0.646 0.1245 0.0000 ⚠ 0.313
per_regime_expectancy 0.417 0.433 0.452 0.726 0.641 0.1265 0.0000 ⚠ 0.308
per_regime_f1_with_floor 0.416 0.436 0.424 0.726 0.643 0.1264 0.0000 ⚠ 0.310

ECE = 0.0000 across all variants — same anomaly seen on Track 5 (label_smoothing) per its results dossier (bit-for-bit zero is impossible for a real classifier). The bug is documented in CVN-N011-EA-S09 (#770) and pre-dates Track 9. Doesn't affect the verdict (the f1_buy + Sortino numbers are decisive).

Critical reading : f1_buy itself regresses on every per-regime variant.

Variant Δf1_buy vs baseline
per_regime_f1 -0.004
coarse_3regime -0.010
per_regime_expectancy -0.015
per_regime_f1_with_floor -0.016

The mission's headline metric f1_buy ≥ +0.020 lift gate fails on every variant — the smallest delta is -0.004 (numerically zero, but in the wrong direction).

6. Signal funnel — Couche B (débit)

Variant raw_buy CUSUM block Conc. block Conc. survival Total survival
none 1661 0.867 0.277 0.723 0.723
coarse_3regime 1452 0.864 0.249 0.751 0.751
per_regime_f1 1535 0.867 0.240 0.760 0.760
per_regime_f1_with_floor 1491 0.867 0.230 0.770 0.770
per_regime_expectancy 1324 0.867 0.207 0.793 0.793

Reading : per-regime variants generate fewer raw BUYs (1324-1535 vs 1661 baseline = -8 to -20 %) — they are pickier upstream of CUSUM. CUSUM block-rate stays flat (0.864-0.867). Concurrency block-rate drops with selectivity (0.277 → 0.207) → total signal survival rises (0.723 → 0.793).

But the signal-survival rise does NOT translate into f1_buy or Sortino lift — the per-regime variants reject borderline opportunities that, in aggregate, were positive-EV. This is the key mechanism behind the "lower trades, higher win rate, lower total return" pattern in §2.

7. Per-crypto performance (PDF section + page 9)

Variant AAVEUSDC ARBUSDC LDOUSDC OPUSDC UNIUSDC
none 0.058 2.730 2.697 2.019 1.567
coarse_3regime 0.178 2.307 1.977 2.656 1.685
per_regime_f1 -0.204 2.688 2.479 1.970 1.197
per_regime_expectancy 0.053 2.618 2.253 1.862 1.204
per_regime_f1_with_floor -0.279 2.564 1.844 1.494 0.980

(Sortino values, bold = best variant for that crypto.)

AAVEUSDC actively harmed : both per_regime_f1 and per_regime_f1_with_floor push AAVEUSDC's Sortino into negative territory (-0.204 / -0.279 vs +0.058 baseline). AAVEUSDC has the lowest baseline trade count (164 trades vs 217-340 elsewhere), suggesting per-regime calibration degenerates on small per-fold-per-regime samples — the very failure mode the §4.1 fallback guardrail (< 30 total samples / < 5 positive samples) was designed to catch.

Per-asset gate (≥ 4/5 cryptos improve vs baseline on f1_buy) :

Variant Cryptos improving Sortino vs none Verdict
coarse_3regime 3/5 (AAVE +0.12, OP +0.64, UNI +0.12)
per_regime_f1 1/5 (LDO -0.22, but ARBU +0.04 and AAVE -0.26)
per_regime_expectancy 0/5 (-0.005 to -0.36 across all)
per_regime_f1_with_floor 1/5 (LDO loses, ARBU loses, AAVE worst)

coarse_3regime is the closest to baseline (3/5) but still fails the ≥4/5 gate.

8. Stability by fold (PDF heatmap, Sortino)

Variant Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 per-fold variance
none 2.29 0.32 1.97 1.55 2.52 0.609
coarse_3regime 1.84 0.40 1.75 2.13 2.24 0.466
per_regime_f1 2.38 0.56 1.09 2.01 2.09 0.493
per_regime_expectancy 2.23 0.76 0.88 1.62 2.49 0.539
per_regime_f1_with_floor 2.11 0.35 0.89 1.33 1.93 0.443

Fold 4 is uniformly weak across all variants (0.32-0.76) — likely a tough market regime in that period (cross-Track artefact, not a Track 9 specific issue). All variants show high per-fold variance → the F1 plan §6 stability gate (per-fold variance ≤ 0.05 on f1_buy, but Sortino variance is even larger) is dominated by data-period effects, not the per-regime calibration choice.

9. Gate evaluation per F1_BUY_BOOST_PLAN.md §6

Criterion per_regime_f1 per_regime_expectancy per_regime_f1_with_floor coarse_3regime
f1_buy lift ≥ +0.020 with CI95 excluding 0 ❌ Δ=-0.004 ❌ Δ=-0.015 ❌ Δ=-0.016 ❌ Δ=-0.010
Joint metric : Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 % ❌ d_sortino=+0.06 (towards none) ❌ d_expectancy=-0.13 ❌ d_sortino=+0.45 (towards none) ❌ d_sortino=+0.11 (towards none)
Stability : per-fold variance of f1_buy ≤ 0.05 ⚠ all-variants high (Sortino var ~0.5) ⚠ same ⚠ same ⚠ same
Per-asset : f1_buy improves on ≥ 4/5 cryptos ❌ 1/5 ❌ 0/5 ❌ 1/5 ❌ 3/5
Sample size : ≥ 50 BUY trades / fold ✅ all variants ≥ 100 trades/fold
MLOps : documentation/stories/CVN-N001-EE-S03/mlops_readiness.md complete ✅ (filed in PR #791)
Lock rule (PDF) : ≥ 2 metrics with BH p<0.05 AND |d|≥0.3 in winner direction ❌ 0/4 ❌ 0/4 ❌ 0/4 ❌ 0/4

Verdict per criterion : every primary criterion fails for every variant. coarse_3regime is the closest to baseline (3/5 cryptos improve) but still fails the ≥4/5 gate AND the f1_buy lift gate AND the lock rule.

10. Why per-regime threshold didn't lift Sortino on this data

The hypotheses §5 of the plan dossier predicted four possible mechanisms. The PDF data supports two of them :

  1. Hypothesis 1 (regime detector noise) — supported. Baseline none wins on most cryptos despite the per-regime variants having more selectivity (lower raw_buy, higher Win Rate). If regime labels were sharp + correct, per-regime should at least match baseline ; instead, coarse_3regime (which collapses 6 noisy regimes into 3 bigger buckets) is the best per-regime variant. Mechanism : RegimeDetector.heuristic_v1 (ATR + slope rules) produces noisy labels at the per-fold sample budget ; per-regime calibration averages over noise + small-sample sentinels.
  2. Hypothesis 2 (selectivity loses absolute return) — strongly supported. Win Rate climbs monotonically (55.3 % → 60.7 %) but Total Return drops monotonically (39.1 % → 28.2 %). Per-regime variants reject borderline opportunities that, in aggregate, were positive-EV (the rejected BUYs had average win rates between baseline and the per-regime threshold, and their cumulative contribution to Sortino was positive). The Concurrency-block-rate drop confirms : per-regime variants are NOT generating MORE high-quality BUYs ; they are just filtering out more LOW-quality BUYs, but the filter cuts too deep.
  3. Hypothesis 3 (floor as regulariser) — refuted. per_regime_f1_with_floor is the WORST variant (Sortino 1.320, Total Return 28.2 %). The +0.05 floor cap on threshold-deviation-from-global is too tight ; it forces the per-regime thresholds back near global on most regimes, and where it doesn't, the deviation is in a worse direction.
  4. Hypothesis 4 (coarse 3-regime sweet spot) — partially supported. coarse_3regime is the closest to baseline (Sortino 1.723 vs 1.767, only -0.044 ; Total Return 38.0 % vs 39.1 %, -1.1pp) and the per-asset gate is closest to passing (3/5). Halving the parameter count helped, but didn't recover the loss → the regime-detector signal-to-noise ratio is the upstream bottleneck, not the parameter count.

Cross-Track lesson : Track 9 joins Tracks 5 (label smoothing) + 6 (focal loss) in the calibration / training-signal manipulation tier that does not yield positive lift on this dataset. The F1 plan §6 cross-track lesson stands : after this run, the calibration tier (Track 9) joins the label/loss tier (Tracks 5/6) as ABANDONED. The F1 lift will need to come from the data tier (Track 1, In progress — split-PR per ADR-0079 invariant 9) OR the architecture tier (Track 11, In progress — split-PR per ADR-0079 invariant 9). Both Stories stay In progress until their block A follow-up PRs ship — see F1 plan §10 tracking table for canonical row-level status.

11. Decisions

11.1 Lock variant — NO LOCK

factor_per_regime_threshold=none (current global F1-optimal threshold) remains the production baseline. No Console flip. ftf_config.base_env unchanged.

11.2 Verdict — ABANDON

The per-regime hypothesis is not supported at the current dataset / regime detector. Track 9 closes following the same pattern as Tracks 5/6 :

  • ✅ Implementation code stays in tree (src/commun/trading/per_regime_threshold_calibrator.py + tests). Useful for ad-hoc experiments and as the reference implementation if a future regime detector ships materially better labels.
  • ✅ FTF factor per_regime_threshold stays in MODEL_FACTORS (mirrors Track 5/6 pattern). Future operator-triggered sweeps can re-evaluate if (a) the regime detector is upgraded to a denoising or learned classifier, OR (b) the dataset grows enough to bring per-fold-per-regime samples into the lock-rule's effect-size detection floor.
  • ❌ No champion_per_regime_* model registered. No promotion gate to schedule.
  • ❌ No quarterly re-fit cadence. Re-evaluation is operator-triggered, not scheduled.

11.3 Cross-Track interaction notes

  • Track 1 (BTC features, In testing) : if Track 1 LOCKs and adds the BTC enrichment features, the model's per-regime probability calibration will shift. The per-regime threshold could be re-evaluated post-Track-1-LOCK on the BTC-enriched feature set ; today's ABANDON does not preclude this.
  • Track 11 (ensemble diversity, PR #793 review) : same logic. If Track 11 LOCKs a stacked variant, the ensemble's P(BUY) distribution will differ from the single-XGB baseline ; the per-regime threshold optimum can shift. Re-evaluate post-Track-11-LOCK if the operator deems it worth the compute.
  • Regime detector upgrade : if a future Story upgrades RegimeDetector (e.g., from heuristic_v1 to a learned classifier), the per-regime threshold becomes a candidate for re-evaluation as the noisy-label hypothesis (§10 H1) would no longer apply.

11.4 Hidden recommendation : per_regime_expectancy for capital-preservation use-cases

Not a Track 9 LOCK candidate, but worth recording : per_regime_expectancy posts the best Win Rate (60.7 %) AND best Max DD (-8.9 %) of the matrix. It would be a candidate for a capital-preservation objective (lower DD, higher win rate, willingly trading total return for stability). This isn't this Story's mandate (the F1 plan optimises for Sortino + f1_buy), but the operator could file a follow-up Story under a different mission (e.g., the "filter tuning" / capital preservation tier) that targets per_regime_expectancy as a candidate variant. Out of scope for wp#42 closure.

12. Sprint version + OP closure

12.1 OP wp#42 transition

Per ADR-69 + workflow §14 :

  • Story status In testingClosed
  • OP comment to add :
Track 9 (per_regime_threshold) sweep ftf_20260430_194027_3d0171_ATR0.5_1.5_H4 completed
with verdict ABANDON. No Console flip; baseline `none` retained.

Results dossier: documentation/missions/ml-boost/2026-05-01-track9-per-regime-threshold-results.md
Implementation PR: #791 (squash 2a997d81, merged 2026-04-30)

Lock rule (PDF executive summary): 0/4 metrics agree on any variant vs baseline.
Largest pairwise effect: per_regime_f1_with_floor d=+0.45 in baseline's favor (p-adj=0.21, NS).
AAVEUSDC actively harmed by per_regime_f1 + per_regime_f1_with_floor (negative Sortino).

Implementation stays in tree per Track 5/6 precedent; FTF factor remains in MODEL_FACTORS
for future re-evaluation if the regime detector is upgraded OR the dataset grows.

Verdict: ABANDON; gate decision: keep_available implementation, NO LOCK.

12.2 Sprint version closure check

Per workflow §14 : if wp#42 is the last open Story in its sprint version, follow §16.4 — gate review + close version + retro. Operator to check OP UI and apply.

12.3 Memory entry

No durable lesson new from this run — the cross-Track lesson "training-signal manipulation tier doesn't lift on this data" was already captured by Tracks 5 + 6. No feedback_*.md memory write needed.

A memory entry MIGHT be warranted for the project layer : "calibration tier confirmed ABANDONED ; F1 lift must come from data (Track 1) or architecture (Track 11) tiers". This is project-state, not a behavioural lesson — already implicit in the F1 plan §6 outcomes table once it's updated post-run.


Sign-off checklist (gate before OP wp#42 closure)

  • §1-§9 populated with actual sweep numbers from PDF report ftf_report_ftf_20260430_194027_3d0171_ATR0.5_1.5_H4.pdf
  • §10 hypothesis pick — H1 (regime detector noise) + H2 (selectivity loses absolute return) supported by data
  • §11.1-§11.2 verdict recorded : ABANDON, no Console flip
  • §11.3 cross-Track interaction noted (Track 1 + Track 11 future re-evaluation hooks)
  • §11.4 hidden recommendation captured (per_regime_expectancy for capital-preservation follow-up)
  • OP wp#42 status In testingClosed with comment from §12.1 — operator action (needs OPENPROJECT_API_KEY)
  • Sprint version closure gate evaluated per workflow §14 — operator action
  • F1 plan §6 outcomes table updated with Track 9 ABANDON entry — operator action OR follow-up docs PR