Skip to content

Runbook — Per-regime threshold drift (Track 9, P2)

Severity : P2 (production drift, model behaviour change but no immediate trade halt) Owner : @dococeven Story : CVN-N001-EE-S03 (wp#42) · plan dossier 2026-04-29-track9-per-regime-threshold-plan.md Linked code : src/commun/trading/per_regime_threshold_calibrator.py · src/commun/pipeline/inference_api.py::_resolve_thresholds

This runbook covers four drift symptoms specific to the per-regime threshold layer (Track 9). The kill-switch (CVN_THRESHOLD_PER_REGIME=0 Console flip) is the rollback for all of them — no model retrain is required.


1. Symptom : per-regime threshold drifts > 2σ from training

Detection : Grafana panel cvntrade-track9-per-regime-threshold shows a 7-day rolling per-regime threshold > 2σ off the artefact baseline.

Likely causes :

  1. Regime taxonomy drift — the regime detector version was bumped silently. Check regime_detector.RegimeSnapshot.regime_version against the artefact's regime_detector_version field. ADR-25 fail-fast in from_metadata should have caught this — if it didn't, file a P1 incident.
  2. Concept drift in trading regime — the model was trained on a market regime that's no longer representative. This is the expected drift mode.
  3. Bug in the threshold routing — a recent change to _resolve_thresholds introduced an off-by-one or a wrong key lookup.

Action :

  1. Pull last 7 days of event=threshold_applied from Loki. Group by regime ; take per-regime median.
  2. Compare each median against the per-regime calibrator artefact's fitted_default.
  3. If drift is on a single regime → likely concept drift. Schedule a re-fit (re-train + re-calibrate) for the affected crypto.
  4. If drift is on all regimes → likely a code or env regression. Console-flip CVN_THRESHOLD_PER_REGIME=0 to revert ; open a hot-fix Story.

2. Symptom : low regime classifier confidence (committee reco #1)

Detection : Grafana cvntrade-track9-per-regime-threshold panel "Regime confidence distribution" shows confidence < 0.6 on > 10 % of inferences over 1 h. Loki query : {event="regime_classified"} | confidence < 0.6.

What the system does automatically : _resolve_thresholds falls back to the global threshold when regime_confidence < self._regime_confidence_floor (default 0.5). Source attribution shows global_low_regime_confidence. So the trading behaviour is already protected — this alert is informational, not critical.

When to act :

  1. If sustained > 24 h → the regime detector itself is unreliable in current market conditions. File a Story to investigate (likely regime_detector v2 work, contingent on Track 9 LOCK).
  2. If correlated with a specific crypto → regime feature pipeline issue for that asset (check aggregate_regime_features logs for that symbol).

3. Symptom : missing per-regime artefact at inference (RuntimeError)

Detection : RuntimeError from InferenceAPI.__init__'s _enforce_per_regime_artefact_consistency OR from PerRegimeThresholdCalibrator.from_metadata.

Cause : the env flag CVN_THRESHOLD_PER_REGIME=1 is on but the model artefact doesn't carry per_regime_threshold_calibrator.json. Most common reason : the artefact was missed during MLflow upload, or a model trained pre-Track-9 was deployed alongside the new env flag.

Action :

  1. Check the model's MLflow artefacts list — does per_regime_threshold_calibrator.json exist alongside threshold_calibrator.json ?
  2. If yes → check the MLflow client load path ; the artefact may be present but not loaded into InferenceAPI. Inspect the model registry → Console mapping.
  3. If no → re-train the model with the current code so the artefact is generated, OR Console-flip CVN_THRESHOLD_PER_REGIME=0 to revert to global threshold (ADR-25 fail-fast contract is doing its job).

4. Symptom : event=regime_rejected reason=negative_expectancy on > 50 % of folds (committee reco #4)

Detection : Loki query {event="regime_rejected", reason="negative_expectancy"} aggregated over the last 5 folds for a single regime.

What it means : a regime in the per_regime_expectancy variant fits no profitable threshold given the F1 plan §4 cost formula (round-trip ≈ 45 bps). This is systemic, not noise — the model has no edge for that regime under realistic costs.

Action :

  1. Don't silently drop the regime — it tells us something. Compare against the same regime's per-regime f1 fit : if per_regime_f1 finds a usable threshold but per_regime_expectancy rejects, the issue is cost-margin, not classification.
  2. If per_regime_f1 also fails for the same regime → the regime itself has no signal. Either (a) regime taxonomy needs revisiting (committee reco #6 — regime detector hardening), or (b) the model needs cost-aware features for that regime. File a Story.
  3. If only the expectancy variant fails → operator has the choice : keep per_regime_f1 as the locked variant (skip expectancy), or accept a few rejected regimes and rely on the fallback path (per-regime calibrator returns global for rejected regimes).

5. Symptom : per-regime f1 stability degraded (gate 3 violation)

Detection : Per-fold f1_buy variance > 0.05 across the 5 folds for any regime ; surfaced in the FTF results dossier table (committee reco #7).

Action :

  1. This is a gate failure, not a runtime alert — it blocks the lock decision in §6 of F1 plan.
  2. If the failing regime has < 100 samples per fold → bump the FTF run to more cryptos / longer windows ; re-evaluate.
  3. If the failing regime has plenty of samples but high variance → the regime is too fine-grained for the model. Switch to coarse_3regime variant.
  4. If coarse_3regime also has high variance → ABANDON the Track. The signal isn't there.

Rollback decision tree

Track 9 production behaviour wrong ?
├── Was the change in env config ? → Console flip CVN_THRESHOLD_PER_REGIME=0 (instant revert)
├── Is it a code bug (routing, parsing) ? → Hot-fix PR ; artefact stays valid
└── Is it a model / regime drift ? → Re-train + re-calibrate (1 sprint)
                                     OR Console flip per_regime_floor up (less aggressive)

The kill-switch is symmetric : every variant lives under one env flag. Flipping it OFF reverts cleanly to the pre-Track-9 global threshold path.


Linked context