MLOps readiness — `CVN-N001-EE-S03 — Per-regime threshold (Track 9, F1_buy boost)`¶

Story: CVN-N001-EE-S03 (wp#42) · GH issue #714 Owner: @dococeven (DRI for production behaviour of this change) Filled on: 2026-04-30 Reviewed by committee: session c560b67a (plan_review PASSED EXECUTION_RISK ; verdict OK with 11 recos triaged in plan dossier §11)

1. Production monitoring (MUST)¶

Metric	Type	Source	Dashboard	Threshold (warn / crit)	Owner
`event=threshold_applied source=per_regime\\|global_no_regime\\|global_unknown_regime\\|global_floor\\|global_low_regime_confidence`	counter	`InferenceAPI._resolve_thresholds` (Loki)	Grafana `cvntrade-track9-per-regime-threshold` panel "Threshold source distribution"	per_regime < 50 % over 1 h → warn ; < 20 % → crit (means most inferences fall back to global, defeating the Track)	`@dococeven`
`event=per_regime_threshold_fallback regime=... reason=insufficient_samples\\|insufficient_positives\\|negative_expectancy`	counter (training-time)	`PerRegimeThresholdCalibrator.fit` (Loki)	Grafana panel "Fallback reasons"	any single reason on > 3 of 5 folds → warn (suggests systemic issue, not noise — committee reco #4)	`@dococeven`
`event=regime_classified regime=... confidence=...` (committee reco #1)	gauge / counter	existing `regime_detector.classify_regime` Loki line + new emission in inference path	Grafana panel "Regime confidence distribution"	confidence < 0.6 over > 10 % of inferences over 1 h → warn	`@dococeven`
`f1_buy` per fold per regime (committee reco #7)	gauge	FTF results dossier table (extended)	offline analysis ; not a Grafana panel	per-fold variance > 0.05 → ABANDON variant (gate criterion 3)	operator
`inference_latency_p99` (existing)	histogram	OpenTelemetry span on `predict_single`	existing Grafana `cvntrade-inference-latency`	p99 +50 % vs pre-Track-9 baseline → warn (regime classification adds work)	`@dococeven`

Required minima covered :

✅ prediction-rate metric — signals.buy_proba distribution (existing) + new event=threshold_applied for source attribution
✅ outcome metric — f1_buy per fold per regime + expectancy_net_realized (existing)
✅ health metric — event=per_regime_threshold_fallback + existing inference_latency_p99

All metrics tagged with FTF factor / variant id (per_regime_f1 / per_regime_expectancy / per_regime_f1_with_floor / coarse_3regime / none) per ADR-30.

2. Alerting & runbooks (MUST)¶

Runbook P2 : runbook_per_regime_threshold_drift.md — handles drift > 2σ from training threshold over 7 d (committee reco #6 backstop) + low regime confidence + missing artefact at inference + negative expectancy on > 50 % of folds.
Alerts :
event=per_regime_threshold_fallback reason=insufficient_positives fires on > 50 % of folds for any regime → P2 alert routed to @dococeven
event=regime_classified confidence < 0.6 fires on > 10 % of inferences over 1 h → P2 alert
event=regime_rejected reason=negative_expectancy fires on > 50 % of folds for any regime → P2 alert (committee reco #4)

3. Drift detection (MUST)¶

Drift type	Detection method	Threshold	Action
Per-regime threshold drift	rolling 7d comparison vs `regime_detector_version` artefact baseline (Track 9 dedicated)	per-regime threshold drifts > 2σ from training	runbook §3 — investigate or rollback
Regime frequency drift	proportion of inferences per regime, 7d window vs val-set baseline	proportional drift > 20 % on any regime	runbook §4 — pre-deploy stability validation (committee reco #3)
Class distribution drift (existing)	PSI on `y_true` per fold	PSI > 0.2	existing playbook
`regime_detector_version` mismatch	pin enforcement at MLflow load time (`PerRegimeThresholdCalibrator.from_metadata`)	strict equality	RuntimeError per ADR-25 (committee reco #8)
Cross-regime f1_buy variance	per-regime f1 in FTF results dossier	per-fold variance > 0.05	gate 3 of F1 plan §6 — block lock

4. Staged rollout (MUST)¶

Stage	Surface	Duration	Gate
1	FTF sweep on `defi_top5` (5 cryptos × 5 folds × 5 variants = 125 rows)	run-completion	every gate of F1 plan §6 must clear → operator decision `lock` / `keep available` / `abandon`
2	Shadow inference on `BTCUSDC` paper trading, 7 days	7 d	≤ 5 % delta on cumulative P&L vs baseline ; no `RuntimeError` from artefact loading
3	Operator Console flip `CVN_THRESHOLD_PER_REGIME=1` for 1 crypto live (`BTCUSDC`)	7 d	f1_buy ≥ baseline + 0.015 ; max_drawdown ≤ baseline + 1 %
4	Rollout to all 5 defi-top5 cryptos	continuous	quarterly drift review per §3

Per ADR-59, the lock decision is a Console flip (no code change). The artefact stays per-version in MLflow ; rolling back the model also rolls back the per-regime calibrator.

5. Rollback plan (MUST)¶

Symptom	Action	Reversal latency
`RuntimeError` from `PerRegimeThresholdCalibrator.from_metadata` (schema or regime_version drift)	Console flip `CVN_THRESHOLD_PER_REGIME=0` (ADR-59) — fully reverts to global threshold, no model retrain	< 1 minute
Production f1_buy regression > 0.02 over 7 d	same Console flip	< 1 minute
Per-regime calibrator drift alert (P2) over 14 d	flip the variant via Console UI (`per_regime_f1_with_floor` instead of `per_regime_f1`) to dampen the deviation	< 1 minute
Bug in calibrator code itself (parser drift, off-by-one in routing)	hot-fix PR + redeploy `console-next` (no model retrain needed since artefact stays valid)	< 1 hour
Regime detector hardening required (committee reco #6 — contingent on Track 9 LOCK)	follow-up Story under CVN-N001-EE next sprint ; old behaviour available as fallback	~1 sprint

The rollback is symmetric : every variant lives under one env flag (CVN_THRESHOLD_PER_REGIME) plus optional method/floor/grouping overrides ; flipping the master flag to 0 reverts cleanly.

6. Owner & DRI (MUST)¶

DRI : @dococeven
Backup : @cvntrade-ml
Escalation : @cvntrade-architect (architectural drift) ; @cvntrade-ops (production incident impacting kill-switch / SL/TP behaviour)

Sign-off checklist (gate before PR merge)¶

§1-§6 all complete
Plan dossier 2026-04-29-track9-per-regime-threshold-plan.md PASSED committee plan_review (session c560b67a)
Runbook runbook_per_regime_threshold_drift.md lands in this PR
All 6 official gates of F1 plan §6 met OR explicit keep available / abandon verdict in the results dossier — checked at FTF sweep completion, post-merge (Story closes only after the operator decision per workflow §2.5)
Expert Committee pr_review PASSED — runs against this PR before merge (mandatory per ADR-68 for substantial ML changes)

MLOps readiness — CVN-N001-EE-S03 — Per-regime threshold (Track 9, F1_buy boost)¶