MLOps readiness — CVN-N001-EE-S03 — Per-regime threshold (Track 9, F1_buy boost)¶
Story: CVN-N001-EE-S03 (wp#42) · GH issue #714
Owner: @dococeven (DRI for production behaviour of this change)
Filled on: 2026-04-30
Reviewed by committee: session c560b67a (plan_review PASSED EXECUTION_RISK ; verdict OK with 11 recos triaged in plan dossier §11)
1. Production monitoring (MUST)¶
| Metric | Type | Source | Dashboard | Threshold (warn / crit) | Owner |
|---|---|---|---|---|---|
event=threshold_applied source=per_regime\|global_no_regime\|global_unknown_regime\|global_floor\|global_low_regime_confidence |
counter | InferenceAPI._resolve_thresholds (Loki) |
Grafana cvntrade-track9-per-regime-threshold panel "Threshold source distribution" |
per_regime < 50 % over 1 h → warn ; < 20 % → crit (means most inferences fall back to global, defeating the Track) | @dococeven |
event=per_regime_threshold_fallback regime=... reason=insufficient_samples\|insufficient_positives\|negative_expectancy |
counter (training-time) | PerRegimeThresholdCalibrator.fit (Loki) |
Grafana panel "Fallback reasons" | any single reason on > 3 of 5 folds → warn (suggests systemic issue, not noise — committee reco #4) | @dococeven |
event=regime_classified regime=... confidence=... (committee reco #1) |
gauge / counter | existing regime_detector.classify_regime Loki line + new emission in inference path |
Grafana panel "Regime confidence distribution" | confidence < 0.6 over > 10 % of inferences over 1 h → warn | @dococeven |
f1_buy per fold per regime (committee reco #7) |
gauge | FTF results dossier table (extended) | offline analysis ; not a Grafana panel | per-fold variance > 0.05 → ABANDON variant (gate criterion 3) | operator |
inference_latency_p99 (existing) |
histogram | OpenTelemetry span on predict_single |
existing Grafana cvntrade-inference-latency |
p99 +50 % vs pre-Track-9 baseline → warn (regime classification adds work) | @dococeven |
Required minima covered :
- ✅ prediction-rate metric —
signals.buy_probadistribution (existing) + newevent=threshold_appliedfor source attribution - ✅ outcome metric —
f1_buyper fold per regime +expectancy_net_realized(existing) - ✅ health metric —
event=per_regime_threshold_fallback+ existinginference_latency_p99
All metrics tagged with FTF factor / variant id (per_regime_f1 / per_regime_expectancy / per_regime_f1_with_floor / coarse_3regime / none) per ADR-30.
2. Alerting & runbooks (MUST)¶
- Runbook P2 :
runbook_per_regime_threshold_drift.md— handles drift > 2σ from training threshold over 7 d (committee reco #6 backstop) + low regime confidence + missing artefact at inference + negative expectancy on > 50 % of folds. - Alerts :
event=per_regime_threshold_fallback reason=insufficient_positivesfires on > 50 % of folds for any regime → P2 alert routed to@dococevenevent=regime_classified confidence < 0.6fires on > 10 % of inferences over 1 h → P2 alertevent=regime_rejected reason=negative_expectancyfires on > 50 % of folds for any regime → P2 alert (committee reco #4)
3. Drift detection (MUST)¶
| Drift type | Detection method | Threshold | Action |
|---|---|---|---|
| Per-regime threshold drift | rolling 7d comparison vs regime_detector_version artefact baseline (Track 9 dedicated) |
per-regime threshold drifts > 2σ from training | runbook §3 — investigate or rollback |
| Regime frequency drift | proportion of inferences per regime, 7d window vs val-set baseline | proportional drift > 20 % on any regime | runbook §4 — pre-deploy stability validation (committee reco #3) |
| Class distribution drift (existing) | PSI on y_true per fold |
PSI > 0.2 | existing playbook |
regime_detector_version mismatch |
pin enforcement at MLflow load time (PerRegimeThresholdCalibrator.from_metadata) |
strict equality | RuntimeError per ADR-25 (committee reco #8) |
| Cross-regime f1_buy variance | per-regime f1 in FTF results dossier | per-fold variance > 0.05 | gate 3 of F1 plan §6 — block lock |
4. Staged rollout (MUST)¶
| Stage | Surface | Duration | Gate |
|---|---|---|---|
| 1 | FTF sweep on defi_top5 (5 cryptos × 5 folds × 5 variants = 125 rows) |
run-completion | every gate of F1 plan §6 must clear → operator decision lock / keep available / abandon |
| 2 | Shadow inference on BTCUSDC paper trading, 7 days |
7 d | ≤ 5 % delta on cumulative P&L vs baseline ; no RuntimeError from artefact loading |
| 3 | Operator Console flip CVN_THRESHOLD_PER_REGIME=1 for 1 crypto live (BTCUSDC) |
7 d | f1_buy ≥ baseline + 0.015 ; max_drawdown ≤ baseline + 1 % |
| 4 | Rollout to all 5 defi-top5 cryptos | continuous | quarterly drift review per §3 |
Per ADR-59, the lock decision is a Console flip (no code change). The artefact stays per-version in MLflow ; rolling back the model also rolls back the per-regime calibrator.
5. Rollback plan (MUST)¶
| Symptom | Action | Reversal latency |
|---|---|---|
RuntimeError from PerRegimeThresholdCalibrator.from_metadata (schema or regime_version drift) |
Console flip CVN_THRESHOLD_PER_REGIME=0 (ADR-59) — fully reverts to global threshold, no model retrain |
< 1 minute |
| Production f1_buy regression > 0.02 over 7 d | same Console flip | < 1 minute |
| Per-regime calibrator drift alert (P2) over 14 d | flip the variant via Console UI (per_regime_f1_with_floor instead of per_regime_f1) to dampen the deviation |
< 1 minute |
| Bug in calibrator code itself (parser drift, off-by-one in routing) | hot-fix PR + redeploy console-next (no model retrain needed since artefact stays valid) |
< 1 hour |
| Regime detector hardening required (committee reco #6 — contingent on Track 9 LOCK) | follow-up Story under CVN-N001-EE next sprint ; old behaviour available as fallback | ~1 sprint |
The rollback is symmetric : every variant lives under one env flag (CVN_THRESHOLD_PER_REGIME) plus optional method/floor/grouping overrides ; flipping the master flag to 0 reverts cleanly.
6. Owner & DRI (MUST)¶
- DRI :
@dococeven - Backup :
@cvntrade-ml - Escalation :
@cvntrade-architect(architectural drift) ;@cvntrade-ops(production incident impacting kill-switch / SL/TP behaviour)
Sign-off checklist (gate before PR merge)¶
- §1-§6 all complete
- Plan dossier
2026-04-29-track9-per-regime-threshold-plan.mdPASSED committeeplan_review(sessionc560b67a) - Runbook
runbook_per_regime_threshold_drift.mdlands in this PR - All 6 official gates of F1 plan §6 met OR explicit
keep available/abandonverdict in the results dossier — checked at FTF sweep completion, post-merge (Story closes only after the operator decision per workflow §2.5) - Expert Committee
pr_reviewPASSED — runs against this PR before merge (mandatory per ADR-68 for substantial ML changes)