Plan dossier — Track 9 : Per-regime threshold optimization¶

Date : 2026-04-29 Story : CVN-N001-EE-S03 (OP wp#42) GH issue : #714 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P2 — quick-win bundle Track 3, calibration tier (different lever from Track 5/6 ABANDONED loss-function attempts) Sequencing : per F1_BUY_BOOST_PLAN.md §6 Phase 1 — Track 9 is the next pick after Track 5 + Track 6 closures (both ABANDONED).

Review history¶

v1 (committee c560b67a) — PASSED / EXECUTION_RISK, consensus strong (5/5 experts). 0 blockers. 11 recommendations triaged in §11 below — 4 applied pre-impl (refined guardrail, realistic costs, position concentration gate, regime detector version pinning), 7 applied at impl time (observability + alerts) or deferred to follow-up Stories.

1. Context — why now, why this lever¶

Tracks 5 (label smoothing + cleanlab, both branches ABANDONED) and 6 (focal loss, all 4 variants ABANDONED) showed a consistent shape : training signal manipulation does not help at the current dataset / labelling regime. Cohen's d ∈ [-1.8, -1.1] in the wrong direction across both Tracks ; the cross-track lesson recorded in F1_BUY_BOOST_PLAN.md §6 Outcomes explicitly pivots away from loss/label tuning.

Track 9 is the calibration tier (tier 5 of the F1 plan) — a different lever entirely. Instead of changing the model's internal training signal, it changes the post-inference decision threshold conditional on the market state (regime). The model produces p(BUY) ∈ [0, 1] ; the existing global ThresholdCalibrator (committee 7371c57d + 825d2fdf, see src/commun/trading/threshold_calibrator.py) picks one F1-optimal threshold per model version. Track 9 hypothesizes that one threshold cannot be optimal across all market regimes — the model is over-confident in trending markets and under-confident in volatile / transition markets, so the threshold should adapt.

The CVNTrade pipeline already classifies market regimes via src/commun/regime/regime_detector.py into 6 codes :

Regime code	Macro-state
`TREND_BULL`	Sustained directional uptrend, low volatility
`TREND_BEAR`	Sustained directional downtrend, low volatility
`RANGE_CALM`	Sideways, narrow bands, low volatility
`RANGE_VOLATILE`	Sideways, wide bands, high volatility
`TRANSITION_UP`	Pivot from down/range to bullish trend
`TRANSITION_DOWN`	Pivot from up/range to bearish trend

The classification already drives a regime filter in the post-inference filter chain (per Filter Funnel architecture step 5). Track 9 pushes regime-awareness one step earlier — into the threshold decision itself.

2. Hypothesis (falsifiable)¶

Per-regime thresholds lift f1_buy materially over a global threshold at the current dataset / labelling / model regime. Specifically :

H0 (null) : mean(f1_buy | per_regime_threshold) - mean(f1_buy | global_threshold) is indistinguishable from 0 (CI95 includes 0) → ABANDON.
H1 (alternative) : Δf1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.

The hypothesis is falsifiable per the same gate criteria as Tracks 5 / 6.

3. Variant matrix¶

5 variants per the F1 plan §4.2 convention (5 unique configs per FTF factor, including baseline) :

Variant	What it does	n_thresholds	Aggregation
`none` (baseline)	Existing global F1-optimal threshold from `ThresholdCalibrator`	1	n/a
`per_regime_f1`	Fit F1-optimal threshold per regime on val set, applied per inference based on detected regime	6	per-regime
`per_regime_expectancy`	Fit expectancy-net optimal threshold per regime (uses cost formula v3, may reject regimes with negative expectancy)	6	per-regime
`per_regime_f1_with_floor`	`per_regime_f1` + global floor : `max(per_regime, global - 0.05)` to avoid runaway thresholds in low-sample regimes	6	per-regime + safety floor
`coarse_3regime`	Collapse regimes into 3 buckets (trend / range / transition) — fewer params, less overfitting	3	per-bucket

5 variants. Per-regime aggregation across folds : median (committee reco 6, same as global threshold aggregation in aggregate_across_folds).

4. Implementation path¶

4.1 Extend `ThresholdCalibrator` to per-regime mode¶

In src/commun/trading/threshold_calibrator.py :

New class PerRegimeThresholdCalibrator that wraps ThresholdCalibrator.fit() per regime slice of the val set.
Slice the val set by regime via regime_detector.classify_regime — same regime tagger used in production filter chain.
Per-regime sample size guardrail (refined per committee c560b67a reco #2) : a regime falls back to the global threshold if either of these conditions holds — total_samples < 30 OR positive_samples (BUY=1) < 5. The positive-sample floor protects against degenerate F1 fits in regimes that are total-volume-OK but BUY-rate-near-zero. Both fallback conditions log loudly per ADR-25 with event=per_regime_threshold_fallback regime=... reason=insufficient_samples|insufficient_positives total_n=... positive_n=....
Persist as MLflow artifact per_regime_threshold_calibrator.json alongside the existing threshold_calibrator.json ; same from_metadata schema with a new version=2 (per-regime keys). Pin the regime detector version (per committee c560b67a reco #8) — the artefact records regime_detector_version: "heuristic_v1" (current RegimeSnapshot.regime_version) ; loading the artefact under a different regime detector version raises RuntimeError per ADR-25 (no silent breakage if regimes are redefined later).
Realistic cost integration in the per_regime_expectancy variant (per committee c560b67a reco #5) — the expectancy fit uses the same cost formula as F1 plan §4 : gross_pnl - taker_fee_bps - spread_bps - slippage_bps - funding_bps. Round-trip ≈ 45 bps interim assumption (pending Track 2 dynamic slippage). The expectancy threshold per regime will reject regimes where even the optimal threshold yields negative expectancy ; the rejection is logged as event=regime_rejected reason=negative_expectancy regime=... best_threshold=... best_expectancy_bps=....

4.2 Inference path¶

In src/commun/inference/inference_api.py (apply_thresholds) :

If CVN_THRESHOLD_PER_REGIME=1, call regime_detector.classify_regime on the inference window and use the per-regime threshold ; otherwise use the global threshold (current behavior).
Emit structured event event=threshold_applied regime=<code> threshold=<value> source=per_regime|global per ADR-32.
Fail-fast (per ADR-25) if CVN_THRESHOLD_PER_REGIME=1 is set but the loaded model artefact has no per-regime calibrator.

4.3 FTF factor + guardrail¶

Add factor=per_regime_threshold to src/commun/finetune/ablation_matrix.py (per ADR-56) with the 5 variants above. Each variant gates a CVN_THRESHOLD_* env var combination :

Variant	env vars
`none`	(defaults — global F1 threshold from existing calibrator)
`per_regime_f1`	`CVN_THRESHOLD_PER_REGIME=1`, `CVN_THRESHOLD_PER_REGIME_METHOD=f1_binary`
`per_regime_expectancy`	`CVN_THRESHOLD_PER_REGIME=1`, `CVN_THRESHOLD_PER_REGIME_METHOD=expectancy`
`per_regime_f1_with_floor`	`CVN_THRESHOLD_PER_REGIME=1`, `CVN_THRESHOLD_PER_REGIME_METHOD=f1_binary`, `CVN_THRESHOLD_PER_REGIME_FLOOR=0.05`
`coarse_3regime`	`CVN_THRESHOLD_PER_REGIME=1`, `CVN_THRESHOLD_PER_REGIME_GROUPING=coarse`

Guardrail in src/commun/finetune/guardrails.py (per ADR-58) : - CVN_THRESHOLD_PER_REGIME_METHOD ∈ {f1_binary, expectancy} — reject other values - CVN_THRESHOLD_PER_REGIME_FLOOR ∈ [0.0, 0.2] — reject silly values - CVN_THRESHOLD_PER_REGIME=1 ⇒ a per-regime calibrator artefact MUST exist (fail-fast at inference if missing, per ADR-25)

4.4 Tests¶

tests/unit/training/test_per_regime_threshold_calibrator.py — unit tests for PerRegimeThresholdCalibrator.fit/predict, including the < 30 sample fallback path
tests/integration/test_track9_per_regime_threshold.py — end-to-end with the 5 variants on a small synthetic dataset, asserts per-variant determinism + correct env var routing
tests/unit/test_ftf_guardrails.py — extend with the new env var validation
Reproducer-style assertion : a sample with regime=TREND_BULL AND CVN_THRESHOLD_PER_REGIME=1 MUST use the bull threshold, not the global one (catches a regression of the wiring)

4.5 Observability + MLOps readiness¶

New event event=threshold_applied regime=... threshold=... source=... indexed in Loki (per ADR-32)
Grafana panel "Threshold by regime" — shows the per-regime thresholds applied over time + how often each regime fires (validates the regime detector is producing balanced data, not stuck on one regime)
MLOps readiness file documentation/stories/CVN-N001-EE-S03/mlops_readiness.md filled per ADR-70 before merge
Runbook documentation/runbooks/runbook_per_regime_threshold_drift.md (P2) : alert if any regime threshold drifts > 2σ from its training value over 7 d (suggests model behavior in that regime has changed)

5. Acceptance gate (per F1 plan §6)¶

The 6 official gates apply :

Gate	Threshold
F1_buy lift	mean Δf1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0
Joint metric	Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 %
Stability	per-fold variance of f1_buy ≤ 0.05
Per-asset	f1_buy improves on ≥ 4/5 cryptos
Sample size	≥ 50 BUY trades / fold
MLOps	`documentation/stories/CVN-N001-EE-S03/mlops_readiness.md` complete

Supporting metric (per committee c560b67a reco #11) — position concentration per regime : max_position_size_per_regime ≤ 2 × global_average. Captured in the results dossier ; not a hard gate (no block on lock) but flagged for post-deploy monitoring if a regime ends up with concentration > 2× average.

If every gate clears → operator decision lock (Console flip the chosen variant in ftf_config.base_env). If any gate fails on every variant → abandon. If a variant clears AT MOST one gate beyond the F1_buy gate → keep available (no production lock, but available for joint variants in future tracks).

6. Out of scope¶

Full regime grid sweep (every regime × every confidence cutoff) — would be ~20 variants, exceeds the 5-variant FTF convention. Stick with the 5 above ; if results are encouraging, a follow-up Story can extend.
Online regime adaptation (re-fit per-regime thresholds at runtime as regimes shift) — premature ; we don't have evidence the offline-fit thresholds work yet.
Regime detector improvements — the existing regime_detector.py (heuristic v1) is treated as a black box. If this Track ABANDONS, the operator may revisit whether the regimes themselves are well-defined ; if it LOCKS, the regime detector becomes load-bearing and gets its own follow-up Story for hardening.
Per-regime training (separate model per regime) — different lever, much more invasive ; reserved for Track 11 (ensemble diversity) if Track 9 doesn't deliver.

7. Falsifiability + rollback¶

Falsifiability : the gate criteria above (especially F1_buy CI95 + per-asset 4/5) are pre-registered. If the FTF sweep produces Δf1 ∈ [-0.01, +0.01] with CI95 including 0, that's the H0 outcome — ABANDON cleanly.
Rollback : if Track 9 LOCKs and a production regression appears (per the new runbook drift alert) :
Console flip CVN_THRESHOLD_PER_REGIME=0 (per ADR-59) — fully reverts to global threshold within 1 minute, no model retrain needed
The per-regime calibrator artefact stays in MLflow but is unused
Issue a hotfix PR if the bug is in the calibrator code itself
Per-regime calibrator artefact persists per-version, so rolling back the model also rolls back the threshold.

8. Risks¶

Risk	Likelihood	Impact	Mitigation
Some regimes have < 30 val samples → unreliable per-regime threshold	High	Medium	Per-regime sample size guardrail + fallback to global (§4.1). Loud log + Loki alert when fallback fires
Regime detector misclassifies → wrong threshold applied	Medium	Medium	Existing regime filter chain step (post-inference) provides a second-line check. If misclassification rate > 5 % on val set, abandon and revisit regime detector
Per-regime overfitting to val set	Medium	Medium	Bootstrap CI95 over folds + median aggregation per committee reco 6. Floor variant (`per_regime_f1_with_floor`) caps the deviation from global
Track 9 LOCKs but post-deploy production regimes shift (concept drift)	Medium	High	Drift runbook + alert (§4.5). Quarterly re-fit cadence per #709 MLOps readiness template §3 (data drift section)
Cross-track confound : Track 6 is `In testing` (gate decision pending) — running Track 9 before Track 6 closes risks attribution	Low	Medium	Track 6 verdict is now ABANDONED (closed 2026-04-29). Sequencing is clean

9. Why this is not the next loss-function attempt¶

The Track 5 + Track 6 lesson narrows specifically to training signal manipulation (label engineering + loss tuning). Track 9 is :

Post-training (operates on already-trained model outputs) — the model itself is unchanged
Calibration-tier (tier 5 of the F1 plan, distinct from tier 2 LABEL ENGINEERING + tier 3 LOSS FUNCTION)
Decision-rule level (changes when the model's p(BUY) triggers a BUY action, not how the model produces p(BUY))

If Track 9 also abandons, the lesson generalizes more strongly and the next pick should pivot to data tracks (Track 1 BTC features, Track 12 fractional diff) per the F1 plan §6 implication block. If Track 9 locks, calibration becomes a productive lever and Track 11 (ensemble + per-regime ensembles) becomes naturally aligned.

10. Cross-references¶

F1 plan §5 Track 9 : "Threshold optimization per-régime" + §6 sequencing
ADRs : ADR-25 (no silent fallback), ADR-32 (event=key=value structured logs), ADR-56 (every change FTF-testable), ADR-58 (every factor → guardrail + integration test), ADR-70 (MLOps readiness mandatory)
Existing infra : src/commun/trading/threshold_calibrator.py (will be extended), src/commun/regime/regime_detector.py (regime classifier reused as-is), src/commun/inference/inference_api.py::apply_thresholds (inference surface)
Sister Tracks : Track 5 results (ABANDON), Track 6 results (ABANDON, dossier landing in PR #788 at documentation/missions/ml-boost/2026-04-29-track6-focal-loss-results.md)
Production filter chain : architecture/FILTER_FUNNEL.md (Step 5 = regime filter, post-inference)
Design doc that introduced ThresholdCalibrator : design/CVN-N001-threshold-calibrator.md (committee 7371c57d + 825d2fdf)

11. Committee recommendations triage (post PASSED / EXECUTION_RISK)¶

#	Recommendation	Source	Disposition
1	Runtime monitoring + confidence gating for regime detector	All experts	Apply at impl time — emit `event=regime_classified regime=... confidence=...` (the `RegimeSnapshot` already carries `confidence`). Add Grafana panel + alert if confidence < 0.6 for > 10 % of inferences over 1 h. Confidence < 0.5 falls back to global threshold (extends §4.1 fallback path)
2	Refined per-regime sample size guardrail (positive labels)	Data-Scientist + ML-Eng	Applied pre-impl in §4.1 step 3 — fallback condition is `total < 30 OR positives < 5`, both logged with explicit reasons
3	Pre-deployment regime stability validation	Architect + Ops	Apply at impl time — pre-merge check : compare regime frequencies on val set vs last 30 d of inference data, fail if any regime drifts by > 20 % proportionally. Extends `make qa` to call a new `check_regime_stability.py` script
4	Observability for expectancy-based rejections	Crypto-Trader	Applied pre-impl in §4.1 step 5 — `event=regime_rejected reason=negative_expectancy ...` emitted ; alert if any single regime is rejected on > 50 % of folds (suggests systemic issue, not noise)
5	Realistic costs in expectancy calculation	Crypto-Trader + Ops	Applied pre-impl in §4.1 step 5 — uses the F1 plan §4 cost formula (round-trip ≈ 45 bps interim, replaceable by Track 2 dynamic slippage when it lands)
6	Regime detector hardening as follow-up	All experts	Defer — already documented in §6 Out of scope as a post-LOCK Story. If Track 9 ABANDONS, this point becomes moot ; if it LOCKS, file a follow-up Story under `CVN-N001-EE` (next sprint) before live promotion
7	Monitor per-regime F1 stability + variance during sweep	Data-Scientist + ML-Eng	Apply at impl time — extend the FTF results dossier table to include per-regime f1_buy + variance per fold + per crypto. Operator reads these in the gate decision
8	Regime detector version pinning	Ops	Applied pre-impl in §4.1 step 4 — calibrator artefact records `regime_detector_version` ; mismatch raises `RuntimeError` per ADR-25
9	Ablation for regime grouping (6 vs 3)	Data-Scientist + ML-Eng	Already covered — the variant matrix (§3) explicitly includes `coarse_3regime` as variant 5. The FTF sweep IS the ablation between 6 (per_regime_f1) and 3 (coarse_3regime)
10	Stress-test thresholds under regime transitions	Ops + ML-Eng	Apply at impl time — integration test `test_track9_per_regime_threshold.py` includes a stress case where 20 % of val samples are forced into adjacent regimes (simulates regime detector misclassification) ; assert f1_buy degrades gracefully (< 5 % loss vs clean classification)
11	Position concentration per regime	Crypto-Trader	Applied pre-impl in §5 — added as a supporting metric in the gate criteria block (not gating, flagged for post-deploy monitoring)

Net effect on §4 implementation path : 4 recos applied directly (refined positive-sample guardrail #2, realistic costs in expectancy fit #5, regime detector version pinning #8, position concentration tracking #11) + 6 to apply at impl time (observability + alerts + stability checks + stress test) + 1 deferred to a follow-up Story (#6 regime detector hardening, contingent on Track 9 LOCKing). The EXECUTION_RISK code is acknowledged ; the production behavior is hardened end-to-end without scope expansion that would warrant a v2.

Question for the committee (v1 — see verdict above)¶

Validate the 5-variant FTF matrix (none, per_regime_f1, per_regime_expectancy, per_regime_f1_with_floor, coarse_3regime), the < 30-sample fallback to global threshold, and the per-regime calibrator extending the existing global ThresholdCalibrator artefact. Are there hidden modes (e.g. regimes that swap stability across training vs inference windows, regime-specific class imbalance that breaks the F1 fit, regime detector confidence below a threshold that should also gate the per-regime threshold) where Track 9 would silently produce wrong thresholds without visible alerts ?