Runbook — Per-regime threshold drift (Track 9, P2)¶
Severity : P2 (production drift, model behaviour change but no immediate trade halt)
Owner : @dococeven
Story : CVN-N001-EE-S03 (wp#42) · plan dossier 2026-04-29-track9-per-regime-threshold-plan.md
Linked code : src/commun/trading/per_regime_threshold_calibrator.py · src/commun/pipeline/inference_api.py::_resolve_thresholds
This runbook covers four drift symptoms specific to the per-regime threshold layer (Track 9). The kill-switch (CVN_THRESHOLD_PER_REGIME=0 Console flip) is the rollback for all of them — no model retrain is required.
1. Symptom : per-regime threshold drifts > 2σ from training¶
Detection : Grafana panel cvntrade-track9-per-regime-threshold shows a 7-day rolling per-regime threshold > 2σ off the artefact baseline.
Likely causes :
- Regime taxonomy drift — the regime detector version was bumped silently. Check
regime_detector.RegimeSnapshot.regime_versionagainst the artefact'sregime_detector_versionfield. ADR-25 fail-fast infrom_metadatashould have caught this — if it didn't, file a P1 incident. - Concept drift in trading regime — the model was trained on a market regime that's no longer representative. This is the expected drift mode.
- Bug in the threshold routing — a recent change to
_resolve_thresholdsintroduced an off-by-one or a wrong key lookup.
Action :
- Pull last 7 days of
event=threshold_appliedfrom Loki. Group by regime ; take per-regime median. - Compare each median against the per-regime calibrator artefact's
fitted_default. - If drift is on a single regime → likely concept drift. Schedule a re-fit (re-train + re-calibrate) for the affected crypto.
- If drift is on all regimes → likely a code or env regression. Console-flip
CVN_THRESHOLD_PER_REGIME=0to revert ; open a hot-fix Story.
2. Symptom : low regime classifier confidence (committee reco #1)¶
Detection : Grafana cvntrade-track9-per-regime-threshold panel "Regime confidence distribution" shows confidence < 0.6 on > 10 % of inferences over 1 h. Loki query : {event="regime_classified"} | confidence < 0.6.
What the system does automatically : _resolve_thresholds falls back to the global threshold when regime_confidence < self._regime_confidence_floor (default 0.5). Source attribution shows global_low_regime_confidence. So the trading behaviour is already protected — this alert is informational, not critical.
When to act :
- If sustained > 24 h → the regime detector itself is unreliable in current market conditions. File a Story to investigate (likely
regime_detectorv2 work, contingent on Track 9 LOCK). - If correlated with a specific crypto → regime feature pipeline issue for that asset (check
aggregate_regime_featureslogs for that symbol).
3. Symptom : missing per-regime artefact at inference (RuntimeError)¶
Detection : RuntimeError from InferenceAPI.__init__'s _enforce_per_regime_artefact_consistency OR from PerRegimeThresholdCalibrator.from_metadata.
Cause : the env flag CVN_THRESHOLD_PER_REGIME=1 is on but the model artefact doesn't carry per_regime_threshold_calibrator.json. Most common reason : the artefact was missed during MLflow upload, or a model trained pre-Track-9 was deployed alongside the new env flag.
Action :
- Check the model's MLflow artefacts list — does
per_regime_threshold_calibrator.jsonexist alongsidethreshold_calibrator.json? - If yes → check the MLflow client load path ; the artefact may be present but not loaded into
InferenceAPI. Inspect the model registry → Console mapping. - If no → re-train the model with the current code so the artefact is generated, OR Console-flip
CVN_THRESHOLD_PER_REGIME=0to revert to global threshold (ADR-25 fail-fast contract is doing its job).
4. Symptom : event=regime_rejected reason=negative_expectancy on > 50 % of folds (committee reco #4)¶
Detection : Loki query {event="regime_rejected", reason="negative_expectancy"} aggregated over the last 5 folds for a single regime.
What it means : a regime in the per_regime_expectancy variant fits no profitable threshold given the F1 plan §4 cost formula (round-trip ≈ 45 bps). This is systemic, not noise — the model has no edge for that regime under realistic costs.
Action :
- Don't silently drop the regime — it tells us something. Compare against the same regime's per-regime f1 fit : if
per_regime_f1finds a usable threshold butper_regime_expectancyrejects, the issue is cost-margin, not classification. - If
per_regime_f1also fails for the same regime → the regime itself has no signal. Either (a) regime taxonomy needs revisiting (committee reco #6 — regime detector hardening), or (b) the model needs cost-aware features for that regime. File a Story. - If only the expectancy variant fails → operator has the choice : keep
per_regime_f1as the locked variant (skip expectancy), or accept a few rejected regimes and rely on the fallback path (per-regime calibrator returns global for rejected regimes).
5. Symptom : per-regime f1 stability degraded (gate 3 violation)¶
Detection : Per-fold f1_buy variance > 0.05 across the 5 folds for any regime ; surfaced in the FTF results dossier table (committee reco #7).
Action :
- This is a gate failure, not a runtime alert — it blocks the
lockdecision in §6 of F1 plan. - If the failing regime has < 100 samples per fold → bump the FTF run to more cryptos / longer windows ; re-evaluate.
- If the failing regime has plenty of samples but high variance → the regime is too fine-grained for the model. Switch to
coarse_3regimevariant. - If
coarse_3regimealso has high variance → ABANDON the Track. The signal isn't there.
Rollback decision tree¶
Track 9 production behaviour wrong ?
│
├── Was the change in env config ? → Console flip CVN_THRESHOLD_PER_REGIME=0 (instant revert)
│
├── Is it a code bug (routing, parsing) ? → Hot-fix PR ; artefact stays valid
│
└── Is it a model / regime drift ? → Re-train + re-calibrate (1 sprint)
OR Console flip per_regime_floor up (less aggressive)
The kill-switch is symmetric : every variant lives under one env flag. Flipping it OFF reverts cleanly to the pre-Track-9 global threshold path.
Linked context¶
- Plan dossier —
2026-04-29-track9-per-regime-threshold-plan.md - MLOps readiness —
stories/CVN-N001-EE-S03/mlops_readiness.md - Calibrator source —
src/commun/trading/per_regime_threshold_calibrator.py - ADR-25 (no silent fallback) — fail-fast contract for missing artefact + version drift
- ADR-32 (event=key=value structured logs) — every threshold decision emits an event
- ADR-59 (Console-only param flips) — lock / unlock the variant