Plan dossier — Track 9 : Per-regime threshold optimization¶
Date : 2026-04-29
Story : CVN-N001-EE-S03 (OP wp#42)
GH issue : #714
Author : Dominique (operator) + Claude
Session type : plan_review (per ADR-68)
Severity : P2 — quick-win bundle Track 3, calibration tier (different lever from Track 5/6 ABANDONED loss-function attempts)
Sequencing : per F1_BUY_BOOST_PLAN.md §6 Phase 1 — Track 9 is the next pick after Track 5 + Track 6 closures (both ABANDONED).
Review history¶
- v1 (committee
c560b67a) — PASSED / EXECUTION_RISK, consensus strong (5/5 experts). 0 blockers. 11 recommendations triaged in §11 below — 4 applied pre-impl (refined guardrail, realistic costs, position concentration gate, regime detector version pinning), 7 applied at impl time (observability + alerts) or deferred to follow-up Stories.
1. Context — why now, why this lever¶
Tracks 5 (label smoothing + cleanlab, both branches ABANDONED) and 6 (focal loss, all 4 variants ABANDONED) showed a consistent shape : training signal manipulation does not help at the current dataset / labelling regime. Cohen's d ∈ [-1.8, -1.1] in the wrong direction across both Tracks ; the cross-track lesson recorded in F1_BUY_BOOST_PLAN.md §6 Outcomes explicitly pivots away from loss/label tuning.
Track 9 is the calibration tier (tier 5 of the F1 plan) — a different lever entirely. Instead of changing the model's internal training signal, it changes the post-inference decision threshold conditional on the market state (regime). The model produces p(BUY) ∈ [0, 1] ; the existing global ThresholdCalibrator (committee 7371c57d + 825d2fdf, see src/commun/trading/threshold_calibrator.py) picks one F1-optimal threshold per model version. Track 9 hypothesizes that one threshold cannot be optimal across all market regimes — the model is over-confident in trending markets and under-confident in volatile / transition markets, so the threshold should adapt.
The CVNTrade pipeline already classifies market regimes via src/commun/regime/regime_detector.py into 6 codes :
| Regime code | Macro-state |
|---|---|
TREND_BULL |
Sustained directional uptrend, low volatility |
TREND_BEAR |
Sustained directional downtrend, low volatility |
RANGE_CALM |
Sideways, narrow bands, low volatility |
RANGE_VOLATILE |
Sideways, wide bands, high volatility |
TRANSITION_UP |
Pivot from down/range to bullish trend |
TRANSITION_DOWN |
Pivot from up/range to bearish trend |
The classification already drives a regime filter in the post-inference filter chain (per Filter Funnel architecture step 5). Track 9 pushes regime-awareness one step earlier — into the threshold decision itself.
2. Hypothesis (falsifiable)¶
Per-regime thresholds lift f1_buy materially over a global threshold at the current dataset / labelling / model regime. Specifically :
- H0 (null) :
mean(f1_buy | per_regime_threshold) - mean(f1_buy | global_threshold)is indistinguishable from 0 (CI95 includes 0) → ABANDON. - H1 (alternative) : Δf1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.
The hypothesis is falsifiable per the same gate criteria as Tracks 5 / 6.
3. Variant matrix¶
5 variants per the F1 plan §4.2 convention (5 unique configs per FTF factor, including baseline) :
| Variant | What it does | n_thresholds | Aggregation |
|---|---|---|---|
none (baseline) |
Existing global F1-optimal threshold from ThresholdCalibrator |
1 | n/a |
per_regime_f1 |
Fit F1-optimal threshold per regime on val set, applied per inference based on detected regime | 6 | per-regime |
per_regime_expectancy |
Fit expectancy-net optimal threshold per regime (uses cost formula v3, may reject regimes with negative expectancy) | 6 | per-regime |
per_regime_f1_with_floor |
per_regime_f1 + global floor : max(per_regime, global - 0.05) to avoid runaway thresholds in low-sample regimes |
6 | per-regime + safety floor |
coarse_3regime |
Collapse regimes into 3 buckets (trend / range / transition) — fewer params, less overfitting | 3 | per-bucket |
5 variants. Per-regime aggregation across folds : median (committee reco 6, same as global threshold aggregation in aggregate_across_folds).
4. Implementation path¶
4.1 Extend ThresholdCalibrator to per-regime mode¶
In src/commun/trading/threshold_calibrator.py :
- New class
PerRegimeThresholdCalibratorthat wrapsThresholdCalibrator.fit()per regime slice of the val set. - Slice the val set by regime via
regime_detector.classify_regime— same regime tagger used in production filter chain. - Per-regime sample size guardrail (refined per committee
c560b67areco #2) : a regime falls back to the global threshold if either of these conditions holds —total_samples < 30ORpositive_samples (BUY=1) < 5. The positive-sample floor protects against degenerate F1 fits in regimes that are total-volume-OK but BUY-rate-near-zero. Both fallback conditions log loudly per ADR-25 withevent=per_regime_threshold_fallback regime=... reason=insufficient_samples|insufficient_positives total_n=... positive_n=.... - Persist as MLflow artifact
per_regime_threshold_calibrator.jsonalongside the existingthreshold_calibrator.json; samefrom_metadataschema with a newversion=2(per-regime keys). Pin the regime detector version (per committeec560b67areco #8) — the artefact recordsregime_detector_version: "heuristic_v1"(currentRegimeSnapshot.regime_version) ; loading the artefact under a different regime detector version raisesRuntimeErrorper ADR-25 (no silent breakage if regimes are redefined later). - Realistic cost integration in the
per_regime_expectancyvariant (per committeec560b67areco #5) — the expectancy fit uses the same cost formula as F1 plan §4 :gross_pnl - taker_fee_bps - spread_bps - slippage_bps - funding_bps. Round-trip ≈ 45 bps interim assumption (pending Track 2 dynamic slippage). The expectancy threshold per regime will reject regimes where even the optimal threshold yields negative expectancy ; the rejection is logged asevent=regime_rejected reason=negative_expectancy regime=... best_threshold=... best_expectancy_bps=....
4.2 Inference path¶
In src/commun/inference/inference_api.py (apply_thresholds) :
- If
CVN_THRESHOLD_PER_REGIME=1, callregime_detector.classify_regimeon the inference window and use the per-regime threshold ; otherwise use the global threshold (current behavior). - Emit structured event
event=threshold_applied regime=<code> threshold=<value> source=per_regime|globalper ADR-32. - Fail-fast (per ADR-25) if
CVN_THRESHOLD_PER_REGIME=1is set but the loaded model artefact has no per-regime calibrator.
4.3 FTF factor + guardrail¶
Add factor=per_regime_threshold to src/commun/finetune/ablation_matrix.py (per ADR-56) with the 5 variants above. Each variant gates a CVN_THRESHOLD_* env var combination :
| Variant | env vars |
|---|---|
none |
(defaults — global F1 threshold from existing calibrator) |
per_regime_f1 |
CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_METHOD=f1_binary |
per_regime_expectancy |
CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_METHOD=expectancy |
per_regime_f1_with_floor |
CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_METHOD=f1_binary, CVN_THRESHOLD_PER_REGIME_FLOOR=0.05 |
coarse_3regime |
CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_GROUPING=coarse |
Guardrail in src/commun/finetune/guardrails.py (per ADR-58) :
- CVN_THRESHOLD_PER_REGIME_METHOD ∈ {f1_binary, expectancy} — reject other values
- CVN_THRESHOLD_PER_REGIME_FLOOR ∈ [0.0, 0.2] — reject silly values
- CVN_THRESHOLD_PER_REGIME=1 ⇒ a per-regime calibrator artefact MUST exist (fail-fast at inference if missing, per ADR-25)
4.4 Tests¶
tests/unit/training/test_per_regime_threshold_calibrator.py— unit tests forPerRegimeThresholdCalibrator.fit/predict, including the < 30 sample fallback pathtests/integration/test_track9_per_regime_threshold.py— end-to-end with the 5 variants on a small synthetic dataset, asserts per-variant determinism + correct env var routingtests/unit/test_ftf_guardrails.py— extend with the new env var validation- Reproducer-style assertion : a sample with regime=
TREND_BULLANDCVN_THRESHOLD_PER_REGIME=1MUST use the bull threshold, not the global one (catches a regression of the wiring)
4.5 Observability + MLOps readiness¶
- New event
event=threshold_applied regime=... threshold=... source=...indexed in Loki (per ADR-32) - Grafana panel "Threshold by regime" — shows the per-regime thresholds applied over time + how often each regime fires (validates the regime detector is producing balanced data, not stuck on one regime)
- MLOps readiness file
documentation/stories/CVN-N001-EE-S03/mlops_readiness.mdfilled per ADR-70 before merge - Runbook
documentation/runbooks/runbook_per_regime_threshold_drift.md(P2) : alert if any regime threshold drifts > 2σ from its training value over 7 d (suggests model behavior in that regime has changed)
5. Acceptance gate (per F1 plan §6)¶
The 6 official gates apply :
| Gate | Threshold |
|---|---|
| F1_buy lift | mean Δf1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0 |
| Joint metric | Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 % |
| Stability | per-fold variance of f1_buy ≤ 0.05 |
| Per-asset | f1_buy improves on ≥ 4/5 cryptos |
| Sample size | ≥ 50 BUY trades / fold |
| MLOps | documentation/stories/CVN-N001-EE-S03/mlops_readiness.md complete |
Supporting metric (per committee c560b67a reco #11) — position concentration per regime : max_position_size_per_regime ≤ 2 × global_average. Captured in the results dossier ; not a hard gate (no block on lock) but flagged for post-deploy monitoring if a regime ends up with concentration > 2× average.
If every gate clears → operator decision lock (Console flip the chosen variant in ftf_config.base_env). If any gate fails on every variant → abandon. If a variant clears AT MOST one gate beyond the F1_buy gate → keep available (no production lock, but available for joint variants in future tracks).
6. Out of scope¶
- Full regime grid sweep (every regime × every confidence cutoff) — would be ~20 variants, exceeds the 5-variant FTF convention. Stick with the 5 above ; if results are encouraging, a follow-up Story can extend.
- Online regime adaptation (re-fit per-regime thresholds at runtime as regimes shift) — premature ; we don't have evidence the offline-fit thresholds work yet.
- Regime detector improvements — the existing
regime_detector.py(heuristic v1) is treated as a black box. If this Track ABANDONS, the operator may revisit whether the regimes themselves are well-defined ; if it LOCKS, the regime detector becomes load-bearing and gets its own follow-up Story for hardening. - Per-regime training (separate model per regime) — different lever, much more invasive ; reserved for Track 11 (ensemble diversity) if Track 9 doesn't deliver.
7. Falsifiability + rollback¶
- Falsifiability : the gate criteria above (especially F1_buy CI95 + per-asset 4/5) are pre-registered. If the FTF sweep produces Δf1 ∈ [-0.01, +0.01] with CI95 including 0, that's the H0 outcome — ABANDON cleanly.
- Rollback : if Track 9 LOCKs and a production regression appears (per the new runbook drift alert) :
- Console flip
CVN_THRESHOLD_PER_REGIME=0(per ADR-59) — fully reverts to global threshold within 1 minute, no model retrain needed - The per-regime calibrator artefact stays in MLflow but is unused
- Issue a hotfix PR if the bug is in the calibrator code itself
- Per-regime calibrator artefact persists per-version, so rolling back the model also rolls back the threshold.
8. Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Some regimes have < 30 val samples → unreliable per-regime threshold | High | Medium | Per-regime sample size guardrail + fallback to global (§4.1). Loud log + Loki alert when fallback fires |
| Regime detector misclassifies → wrong threshold applied | Medium | Medium | Existing regime filter chain step (post-inference) provides a second-line check. If misclassification rate > 5 % on val set, abandon and revisit regime detector |
| Per-regime overfitting to val set | Medium | Medium | Bootstrap CI95 over folds + median aggregation per committee reco 6. Floor variant (per_regime_f1_with_floor) caps the deviation from global |
| Track 9 LOCKs but post-deploy production regimes shift (concept drift) | Medium | High | Drift runbook + alert (§4.5). Quarterly re-fit cadence per #709 MLOps readiness template §3 (data drift section) |
Cross-track confound : Track 6 is In testing (gate decision pending) — running Track 9 before Track 6 closes risks attribution |
Low | Medium | Track 6 verdict is now ABANDONED (closed 2026-04-29). Sequencing is clean |
9. Why this is not the next loss-function attempt¶
The Track 5 + Track 6 lesson narrows specifically to training signal manipulation (label engineering + loss tuning). Track 9 is :
- Post-training (operates on already-trained model outputs) — the model itself is unchanged
- Calibration-tier (tier 5 of the F1 plan, distinct from tier 2 LABEL ENGINEERING + tier 3 LOSS FUNCTION)
- Decision-rule level (changes when the model's
p(BUY)triggers a BUY action, not how the model producesp(BUY))
If Track 9 also abandons, the lesson generalizes more strongly and the next pick should pivot to data tracks (Track 1 BTC features, Track 12 fractional diff) per the F1 plan §6 implication block. If Track 9 locks, calibration becomes a productive lever and Track 11 (ensemble + per-regime ensembles) becomes naturally aligned.
10. Cross-references¶
- F1 plan §5 Track 9 : "Threshold optimization per-régime" + §6 sequencing
- ADRs : ADR-25 (no silent fallback), ADR-32 (event=key=value structured logs), ADR-56 (every change FTF-testable), ADR-58 (every factor → guardrail + integration test), ADR-70 (MLOps readiness mandatory)
- Existing infra :
src/commun/trading/threshold_calibrator.py(will be extended),src/commun/regime/regime_detector.py(regime classifier reused as-is),src/commun/inference/inference_api.py::apply_thresholds(inference surface) - Sister Tracks : Track 5 results (ABANDON), Track 6 results (ABANDON, dossier landing in PR #788 at
documentation/missions/ml-boost/2026-04-29-track6-focal-loss-results.md) - Production filter chain :
architecture/FILTER_FUNNEL.md(Step 5 = regime filter, post-inference) - Design doc that introduced
ThresholdCalibrator:design/CVN-N001-threshold-calibrator.md(committee7371c57d+825d2fdf)
11. Committee recommendations triage (post PASSED / EXECUTION_RISK)¶
| # | Recommendation | Source | Disposition |
|---|---|---|---|
| 1 | Runtime monitoring + confidence gating for regime detector | All experts | Apply at impl time — emit event=regime_classified regime=... confidence=... (the RegimeSnapshot already carries confidence). Add Grafana panel + alert if confidence < 0.6 for > 10 % of inferences over 1 h. Confidence < 0.5 falls back to global threshold (extends §4.1 fallback path) |
| 2 | Refined per-regime sample size guardrail (positive labels) | Data-Scientist + ML-Eng | Applied pre-impl in §4.1 step 3 — fallback condition is total < 30 OR positives < 5, both logged with explicit reasons |
| 3 | Pre-deployment regime stability validation | Architect + Ops | Apply at impl time — pre-merge check : compare regime frequencies on val set vs last 30 d of inference data, fail if any regime drifts by > 20 % proportionally. Extends make qa to call a new check_regime_stability.py script |
| 4 | Observability for expectancy-based rejections | Crypto-Trader | Applied pre-impl in §4.1 step 5 — event=regime_rejected reason=negative_expectancy ... emitted ; alert if any single regime is rejected on > 50 % of folds (suggests systemic issue, not noise) |
| 5 | Realistic costs in expectancy calculation | Crypto-Trader + Ops | Applied pre-impl in §4.1 step 5 — uses the F1 plan §4 cost formula (round-trip ≈ 45 bps interim, replaceable by Track 2 dynamic slippage when it lands) |
| 6 | Regime detector hardening as follow-up | All experts | Defer — already documented in §6 Out of scope as a post-LOCK Story. If Track 9 ABANDONS, this point becomes moot ; if it LOCKS, file a follow-up Story under CVN-N001-EE (next sprint) before live promotion |
| 7 | Monitor per-regime F1 stability + variance during sweep | Data-Scientist + ML-Eng | Apply at impl time — extend the FTF results dossier table to include per-regime f1_buy + variance per fold + per crypto. Operator reads these in the gate decision |
| 8 | Regime detector version pinning | Ops | Applied pre-impl in §4.1 step 4 — calibrator artefact records regime_detector_version ; mismatch raises RuntimeError per ADR-25 |
| 9 | Ablation for regime grouping (6 vs 3) | Data-Scientist + ML-Eng | Already covered — the variant matrix (§3) explicitly includes coarse_3regime as variant 5. The FTF sweep IS the ablation between 6 (per_regime_f1) and 3 (coarse_3regime) |
| 10 | Stress-test thresholds under regime transitions | Ops + ML-Eng | Apply at impl time — integration test test_track9_per_regime_threshold.py includes a stress case where 20 % of val samples are forced into adjacent regimes (simulates regime detector misclassification) ; assert f1_buy degrades gracefully (< 5 % loss vs clean classification) |
| 11 | Position concentration per regime | Crypto-Trader | Applied pre-impl in §5 — added as a supporting metric in the gate criteria block (not gating, flagged for post-deploy monitoring) |
Net effect on §4 implementation path : 4 recos applied directly (refined positive-sample guardrail #2, realistic costs in expectancy fit #5, regime detector version pinning #8, position concentration tracking #11) + 6 to apply at impl time (observability + alerts + stability checks + stress test) + 1 deferred to a follow-up Story (#6 regime detector hardening, contingent on Track 9 LOCKing). The EXECUTION_RISK code is acknowledged ; the production behavior is hardened end-to-end without scope expansion that would warrant a v2.
Question for the committee (v1 — see verdict above)¶
Validate the 5-variant FTF matrix (
none,per_regime_f1,per_regime_expectancy,per_regime_f1_with_floor,coarse_3regime), the < 30-sample fallback to global threshold, and the per-regime calibrator extending the existing globalThresholdCalibratorartefact. Are there hidden modes (e.g. regimes that swap stability across training vs inference windows, regime-specific class imbalance that breaks the F1 fit, regime detector confidence below a threshold that should also gate the per-regime threshold) where Track 9 would silently produce wrong thresholds without visible alerts ?