Skip to content

Plan dossier — Track 9 : Per-regime threshold optimization

Date : 2026-04-29 Story : CVN-N001-EE-S03 (OP wp#42) GH issue : #714 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P2 — quick-win bundle Track 3, calibration tier (different lever from Track 5/6 ABANDONED loss-function attempts) Sequencing : per F1_BUY_BOOST_PLAN.md §6 Phase 1 — Track 9 is the next pick after Track 5 + Track 6 closures (both ABANDONED).

Review history

  • v1 (committee c560b67a) — PASSED / EXECUTION_RISK, consensus strong (5/5 experts). 0 blockers. 11 recommendations triaged in §11 below — 4 applied pre-impl (refined guardrail, realistic costs, position concentration gate, regime detector version pinning), 7 applied at impl time (observability + alerts) or deferred to follow-up Stories.

1. Context — why now, why this lever

Tracks 5 (label smoothing + cleanlab, both branches ABANDONED) and 6 (focal loss, all 4 variants ABANDONED) showed a consistent shape : training signal manipulation does not help at the current dataset / labelling regime. Cohen's d ∈ [-1.8, -1.1] in the wrong direction across both Tracks ; the cross-track lesson recorded in F1_BUY_BOOST_PLAN.md §6 Outcomes explicitly pivots away from loss/label tuning.

Track 9 is the calibration tier (tier 5 of the F1 plan) — a different lever entirely. Instead of changing the model's internal training signal, it changes the post-inference decision threshold conditional on the market state (regime). The model produces p(BUY) ∈ [0, 1] ; the existing global ThresholdCalibrator (committee 7371c57d + 825d2fdf, see src/commun/trading/threshold_calibrator.py) picks one F1-optimal threshold per model version. Track 9 hypothesizes that one threshold cannot be optimal across all market regimes — the model is over-confident in trending markets and under-confident in volatile / transition markets, so the threshold should adapt.

The CVNTrade pipeline already classifies market regimes via src/commun/regime/regime_detector.py into 6 codes :

Regime code Macro-state
TREND_BULL Sustained directional uptrend, low volatility
TREND_BEAR Sustained directional downtrend, low volatility
RANGE_CALM Sideways, narrow bands, low volatility
RANGE_VOLATILE Sideways, wide bands, high volatility
TRANSITION_UP Pivot from down/range to bullish trend
TRANSITION_DOWN Pivot from up/range to bearish trend

The classification already drives a regime filter in the post-inference filter chain (per Filter Funnel architecture step 5). Track 9 pushes regime-awareness one step earlier — into the threshold decision itself.

2. Hypothesis (falsifiable)

Per-regime thresholds lift f1_buy materially over a global threshold at the current dataset / labelling / model regime. Specifically :

  • H0 (null) : mean(f1_buy | per_regime_threshold) - mean(f1_buy | global_threshold) is indistinguishable from 0 (CI95 includes 0) → ABANDON.
  • H1 (alternative) : Δf1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.

The hypothesis is falsifiable per the same gate criteria as Tracks 5 / 6.

3. Variant matrix

5 variants per the F1 plan §4.2 convention (5 unique configs per FTF factor, including baseline) :

Variant What it does n_thresholds Aggregation
none (baseline) Existing global F1-optimal threshold from ThresholdCalibrator 1 n/a
per_regime_f1 Fit F1-optimal threshold per regime on val set, applied per inference based on detected regime 6 per-regime
per_regime_expectancy Fit expectancy-net optimal threshold per regime (uses cost formula v3, may reject regimes with negative expectancy) 6 per-regime
per_regime_f1_with_floor per_regime_f1 + global floor : max(per_regime, global - 0.05) to avoid runaway thresholds in low-sample regimes 6 per-regime + safety floor
coarse_3regime Collapse regimes into 3 buckets (trend / range / transition) — fewer params, less overfitting 3 per-bucket

5 variants. Per-regime aggregation across folds : median (committee reco 6, same as global threshold aggregation in aggregate_across_folds).

4. Implementation path

4.1 Extend ThresholdCalibrator to per-regime mode

In src/commun/trading/threshold_calibrator.py :

  1. New class PerRegimeThresholdCalibrator that wraps ThresholdCalibrator.fit() per regime slice of the val set.
  2. Slice the val set by regime via regime_detector.classify_regime — same regime tagger used in production filter chain.
  3. Per-regime sample size guardrail (refined per committee c560b67a reco #2) : a regime falls back to the global threshold if either of these conditions holds — total_samples < 30 OR positive_samples (BUY=1) < 5. The positive-sample floor protects against degenerate F1 fits in regimes that are total-volume-OK but BUY-rate-near-zero. Both fallback conditions log loudly per ADR-25 with event=per_regime_threshold_fallback regime=... reason=insufficient_samples|insufficient_positives total_n=... positive_n=....
  4. Persist as MLflow artifact per_regime_threshold_calibrator.json alongside the existing threshold_calibrator.json ; same from_metadata schema with a new version=2 (per-regime keys). Pin the regime detector version (per committee c560b67a reco #8) — the artefact records regime_detector_version: "heuristic_v1" (current RegimeSnapshot.regime_version) ; loading the artefact under a different regime detector version raises RuntimeError per ADR-25 (no silent breakage if regimes are redefined later).
  5. Realistic cost integration in the per_regime_expectancy variant (per committee c560b67a reco #5) — the expectancy fit uses the same cost formula as F1 plan §4 : gross_pnl - taker_fee_bps - spread_bps - slippage_bps - funding_bps. Round-trip ≈ 45 bps interim assumption (pending Track 2 dynamic slippage). The expectancy threshold per regime will reject regimes where even the optimal threshold yields negative expectancy ; the rejection is logged as event=regime_rejected reason=negative_expectancy regime=... best_threshold=... best_expectancy_bps=....

4.2 Inference path

In src/commun/inference/inference_api.py (apply_thresholds) :

  1. If CVN_THRESHOLD_PER_REGIME=1, call regime_detector.classify_regime on the inference window and use the per-regime threshold ; otherwise use the global threshold (current behavior).
  2. Emit structured event event=threshold_applied regime=<code> threshold=<value> source=per_regime|global per ADR-32.
  3. Fail-fast (per ADR-25) if CVN_THRESHOLD_PER_REGIME=1 is set but the loaded model artefact has no per-regime calibrator.

4.3 FTF factor + guardrail

Add factor=per_regime_threshold to src/commun/finetune/ablation_matrix.py (per ADR-56) with the 5 variants above. Each variant gates a CVN_THRESHOLD_* env var combination :

Variant env vars
none (defaults — global F1 threshold from existing calibrator)
per_regime_f1 CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_METHOD=f1_binary
per_regime_expectancy CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_METHOD=expectancy
per_regime_f1_with_floor CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_METHOD=f1_binary, CVN_THRESHOLD_PER_REGIME_FLOOR=0.05
coarse_3regime CVN_THRESHOLD_PER_REGIME=1, CVN_THRESHOLD_PER_REGIME_GROUPING=coarse

Guardrail in src/commun/finetune/guardrails.py (per ADR-58) : - CVN_THRESHOLD_PER_REGIME_METHOD{f1_binary, expectancy} — reject other values - CVN_THRESHOLD_PER_REGIME_FLOOR[0.0, 0.2] — reject silly values - CVN_THRESHOLD_PER_REGIME=1 ⇒ a per-regime calibrator artefact MUST exist (fail-fast at inference if missing, per ADR-25)

4.4 Tests

  • tests/unit/training/test_per_regime_threshold_calibrator.py — unit tests for PerRegimeThresholdCalibrator.fit/predict, including the < 30 sample fallback path
  • tests/integration/test_track9_per_regime_threshold.py — end-to-end with the 5 variants on a small synthetic dataset, asserts per-variant determinism + correct env var routing
  • tests/unit/test_ftf_guardrails.py — extend with the new env var validation
  • Reproducer-style assertion : a sample with regime=TREND_BULL AND CVN_THRESHOLD_PER_REGIME=1 MUST use the bull threshold, not the global one (catches a regression of the wiring)

4.5 Observability + MLOps readiness

  • New event event=threshold_applied regime=... threshold=... source=... indexed in Loki (per ADR-32)
  • Grafana panel "Threshold by regime" — shows the per-regime thresholds applied over time + how often each regime fires (validates the regime detector is producing balanced data, not stuck on one regime)
  • MLOps readiness file documentation/stories/CVN-N001-EE-S03/mlops_readiness.md filled per ADR-70 before merge
  • Runbook documentation/runbooks/runbook_per_regime_threshold_drift.md (P2) : alert if any regime threshold drifts > 2σ from its training value over 7 d (suggests model behavior in that regime has changed)

5. Acceptance gate (per F1 plan §6)

The 6 official gates apply :

Gate Threshold
F1_buy lift mean Δf1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0
Joint metric Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 %
Stability per-fold variance of f1_buy ≤ 0.05
Per-asset f1_buy improves on ≥ 4/5 cryptos
Sample size ≥ 50 BUY trades / fold
MLOps documentation/stories/CVN-N001-EE-S03/mlops_readiness.md complete

Supporting metric (per committee c560b67a reco #11) — position concentration per regime : max_position_size_per_regime ≤ 2 × global_average. Captured in the results dossier ; not a hard gate (no block on lock) but flagged for post-deploy monitoring if a regime ends up with concentration > 2× average.

If every gate clears → operator decision lock (Console flip the chosen variant in ftf_config.base_env). If any gate fails on every variant → abandon. If a variant clears AT MOST one gate beyond the F1_buy gate → keep available (no production lock, but available for joint variants in future tracks).

6. Out of scope

  • Full regime grid sweep (every regime × every confidence cutoff) — would be ~20 variants, exceeds the 5-variant FTF convention. Stick with the 5 above ; if results are encouraging, a follow-up Story can extend.
  • Online regime adaptation (re-fit per-regime thresholds at runtime as regimes shift) — premature ; we don't have evidence the offline-fit thresholds work yet.
  • Regime detector improvements — the existing regime_detector.py (heuristic v1) is treated as a black box. If this Track ABANDONS, the operator may revisit whether the regimes themselves are well-defined ; if it LOCKS, the regime detector becomes load-bearing and gets its own follow-up Story for hardening.
  • Per-regime training (separate model per regime) — different lever, much more invasive ; reserved for Track 11 (ensemble diversity) if Track 9 doesn't deliver.

7. Falsifiability + rollback

  • Falsifiability : the gate criteria above (especially F1_buy CI95 + per-asset 4/5) are pre-registered. If the FTF sweep produces Δf1 ∈ [-0.01, +0.01] with CI95 including 0, that's the H0 outcome — ABANDON cleanly.
  • Rollback : if Track 9 LOCKs and a production regression appears (per the new runbook drift alert) :
  • Console flip CVN_THRESHOLD_PER_REGIME=0 (per ADR-59) — fully reverts to global threshold within 1 minute, no model retrain needed
  • The per-regime calibrator artefact stays in MLflow but is unused
  • Issue a hotfix PR if the bug is in the calibrator code itself
  • Per-regime calibrator artefact persists per-version, so rolling back the model also rolls back the threshold.

8. Risks

Risk Likelihood Impact Mitigation
Some regimes have < 30 val samples → unreliable per-regime threshold High Medium Per-regime sample size guardrail + fallback to global (§4.1). Loud log + Loki alert when fallback fires
Regime detector misclassifies → wrong threshold applied Medium Medium Existing regime filter chain step (post-inference) provides a second-line check. If misclassification rate > 5 % on val set, abandon and revisit regime detector
Per-regime overfitting to val set Medium Medium Bootstrap CI95 over folds + median aggregation per committee reco 6. Floor variant (per_regime_f1_with_floor) caps the deviation from global
Track 9 LOCKs but post-deploy production regimes shift (concept drift) Medium High Drift runbook + alert (§4.5). Quarterly re-fit cadence per #709 MLOps readiness template §3 (data drift section)
Cross-track confound : Track 6 is In testing (gate decision pending) — running Track 9 before Track 6 closes risks attribution Low Medium Track 6 verdict is now ABANDONED (closed 2026-04-29). Sequencing is clean

9. Why this is not the next loss-function attempt

The Track 5 + Track 6 lesson narrows specifically to training signal manipulation (label engineering + loss tuning). Track 9 is :

  • Post-training (operates on already-trained model outputs) — the model itself is unchanged
  • Calibration-tier (tier 5 of the F1 plan, distinct from tier 2 LABEL ENGINEERING + tier 3 LOSS FUNCTION)
  • Decision-rule level (changes when the model's p(BUY) triggers a BUY action, not how the model produces p(BUY))

If Track 9 also abandons, the lesson generalizes more strongly and the next pick should pivot to data tracks (Track 1 BTC features, Track 12 fractional diff) per the F1 plan §6 implication block. If Track 9 locks, calibration becomes a productive lever and Track 11 (ensemble + per-regime ensembles) becomes naturally aligned.

10. Cross-references

  • F1 plan §5 Track 9 : "Threshold optimization per-régime" + §6 sequencing
  • ADRs : ADR-25 (no silent fallback), ADR-32 (event=key=value structured logs), ADR-56 (every change FTF-testable), ADR-58 (every factor → guardrail + integration test), ADR-70 (MLOps readiness mandatory)
  • Existing infra : src/commun/trading/threshold_calibrator.py (will be extended), src/commun/regime/regime_detector.py (regime classifier reused as-is), src/commun/inference/inference_api.py::apply_thresholds (inference surface)
  • Sister Tracks : Track 5 results (ABANDON), Track 6 results (ABANDON, dossier landing in PR #788 at documentation/missions/ml-boost/2026-04-29-track6-focal-loss-results.md)
  • Production filter chain : architecture/FILTER_FUNNEL.md (Step 5 = regime filter, post-inference)
  • Design doc that introduced ThresholdCalibrator : design/CVN-N001-threshold-calibrator.md (committee 7371c57d + 825d2fdf)

11. Committee recommendations triage (post PASSED / EXECUTION_RISK)

# Recommendation Source Disposition
1 Runtime monitoring + confidence gating for regime detector All experts Apply at impl time — emit event=regime_classified regime=... confidence=... (the RegimeSnapshot already carries confidence). Add Grafana panel + alert if confidence < 0.6 for > 10 % of inferences over 1 h. Confidence < 0.5 falls back to global threshold (extends §4.1 fallback path)
2 Refined per-regime sample size guardrail (positive labels) Data-Scientist + ML-Eng Applied pre-impl in §4.1 step 3 — fallback condition is total < 30 OR positives < 5, both logged with explicit reasons
3 Pre-deployment regime stability validation Architect + Ops Apply at impl time — pre-merge check : compare regime frequencies on val set vs last 30 d of inference data, fail if any regime drifts by > 20 % proportionally. Extends make qa to call a new check_regime_stability.py script
4 Observability for expectancy-based rejections Crypto-Trader Applied pre-impl in §4.1 step 5event=regime_rejected reason=negative_expectancy ... emitted ; alert if any single regime is rejected on > 50 % of folds (suggests systemic issue, not noise)
5 Realistic costs in expectancy calculation Crypto-Trader + Ops Applied pre-impl in §4.1 step 5 — uses the F1 plan §4 cost formula (round-trip ≈ 45 bps interim, replaceable by Track 2 dynamic slippage when it lands)
6 Regime detector hardening as follow-up All experts Defer — already documented in §6 Out of scope as a post-LOCK Story. If Track 9 ABANDONS, this point becomes moot ; if it LOCKS, file a follow-up Story under CVN-N001-EE (next sprint) before live promotion
7 Monitor per-regime F1 stability + variance during sweep Data-Scientist + ML-Eng Apply at impl time — extend the FTF results dossier table to include per-regime f1_buy + variance per fold + per crypto. Operator reads these in the gate decision
8 Regime detector version pinning Ops Applied pre-impl in §4.1 step 4 — calibrator artefact records regime_detector_version ; mismatch raises RuntimeError per ADR-25
9 Ablation for regime grouping (6 vs 3) Data-Scientist + ML-Eng Already covered — the variant matrix (§3) explicitly includes coarse_3regime as variant 5. The FTF sweep IS the ablation between 6 (per_regime_f1) and 3 (coarse_3regime)
10 Stress-test thresholds under regime transitions Ops + ML-Eng Apply at impl time — integration test test_track9_per_regime_threshold.py includes a stress case where 20 % of val samples are forced into adjacent regimes (simulates regime detector misclassification) ; assert f1_buy degrades gracefully (< 5 % loss vs clean classification)
11 Position concentration per regime Crypto-Trader Applied pre-impl in §5 — added as a supporting metric in the gate criteria block (not gating, flagged for post-deploy monitoring)

Net effect on §4 implementation path : 4 recos applied directly (refined positive-sample guardrail #2, realistic costs in expectancy fit #5, regime detector version pinning #8, position concentration tracking #11) + 6 to apply at impl time (observability + alerts + stability checks + stress test) + 1 deferred to a follow-up Story (#6 regime detector hardening, contingent on Track 9 LOCKing). The EXECUTION_RISK code is acknowledged ; the production behavior is hardened end-to-end without scope expansion that would warrant a v2.


Question for the committee (v1 — see verdict above)

Validate the 5-variant FTF matrix (none, per_regime_f1, per_regime_expectancy, per_regime_f1_with_floor, coarse_3regime), the < 30-sample fallback to global threshold, and the per-regime calibrator extending the existing global ThresholdCalibrator artefact. Are there hidden modes (e.g. regimes that swap stability across training vs inference windows, regime-specific class imbalance that breaks the F1 fit, regime detector confidence below a threshold that should also gate the per-regime threshold) where Track 9 would silently produce wrong thresholds without visible alerts ?