MLOps readiness — `<CVN-N...-S0X title>`¶

Story: <cvn_id> — link to GH issue + OP wp Owner: <github handle> (DRI for production behaviour of this change) Filled on: YYYY-MM-DD Reviewed by committee: session <id> (verdict OK / OK-WITH-CHANGES)

1. Production monitoring (MUST)¶

What metrics, where they live, who watches them.

Metric	Type	Source	Dashboard	Threshold (warn / crit)	Owner
`<event_name.field>`	counter / gauge / histogram	Loki query / Prometheus / OpenTelemetry span	Grafana panel URL	`< / >`	`<handle>`

Required minima for any new ML track:

At least 1 prediction-rate metric (e.g. signals.buy_proba quantile distribution per crypto)
At least 1 outcome metric (e.g. pnl_per_trade, expectancy_net_realized)
At least 1 health metric (e.g. inference_latency_p99, feature_pipeline_failures)
All metrics MUST be tagged with the FTF factor / variant id when applicable (ADR-30)

Strongly encouraged (especially for components known to fail silently per past incidents):

Internal pipeline metrics : feature distributions per fold, filter pass rates per stage, feature-selection variance dispersion (would have caught #706), prediction-capture rate per fold (would have caught #700/#701)
Cost monitoring as a recommended health metric : API calls per trade, compute $ per training run, MLflow storage growth per Story

If your Story doesn't add ANY new metric, either (a) it shouldn't be a Story, or (b) document explicitly which existing metric covers the change.

2. Alerting & runbooks (MUST)¶

Threshold metrics from §1 that trigger pages, and what the on-call does.

Alert	Trigger	Runbook	Severity	Notification channel
`<alert_name>`	`metric > threshold for N min`	`documentation/runbooks/<slug>.md`	P1 / P2 / P3	Pagerduty / Slack / email

Required minima:

At least 1 P1 alert (page on-call) covering the failure mode that would cause silent revenue loss (e.g. expectancy_net < 0 over 24h on >2 cryptos)
Each alert has a runbook in documentation/runbooks/ with: symptom → diagnosis steps → remediation → escalation
Runbook MUST link the Grafana dashboard and the relevant log queries (Loki {job=...})

If you skip this section, write **SKIP — JUSTIFICATION**: <why> and accept committee may reject.

3. Drift detection (MUST)¶

How you detect that the model degrades or that input distribution shifts.

Drift type	Method	Window	Threshold	Action on trigger
Data drift	PSI (population stability index) on top-K features by FI	rolling 7d vs training	PSI > 0.2 (warn), PSI > 0.5 (crit)	shadow re-train / page
Concept drift	rolling perf gap (live `f1_buy` vs training OOS `f1_buy`)	14d window, per crypto	gap > 0.05 (warn), gap > 0.10 (crit)	retrain candidate, hold live

Required minima:

Both data-drift AND concept-drift implemented for any track changing model architecture, features, or labels (i.e., all F1_buy boost tracks)
Drift metrics emitted as OpenTelemetry spans / log_event (ADR-30, ADR-32)
Triggered drift produces a drift_alert runbook entry, NOT silent retraining (ADR-25 no silent fallback)

For tracks that only change calibration / threshold (no feature / label change), data-drift section can be marked N/A — calibration-only change with a one-line justification.

4. Staged rollout (MUST)¶

How the change reaches production traffic. Three stages with explicit pass criteria.

Stage	Traffic %	Duration	Pass criteria	Rollback trigger
Shadow	0% (predictions logged, not acted)	≥ 7d	predictions parity vs champion within ±1% on `f1_buy`, no exceptions in logs	any P1 alert
Canary	1 crypto out of 5 (or 10% portfolio weight)	≥ 7d	live `expectancy_net` ≥ baseline OR within CI95 ; n_trades ≥ 50 ; no drift alert	live `expectancy_net` < 0 over 48h
Full	100% (all 5 cryptos)	—	all canary criteria + Sortino ≥ baseline	per ADR-26 oncall procedure

Required minima:

Shadow stage MUST be ≥ 7 days even for "obvious" wins (ADR-67 lesson). Calibration-only Stories (no feature/label/architecture change) MAY shorten via explicit **SKIP — JUSTIFICATION**: calibration-only, shadow=Nd because <reason> ; committee may still reject.
Canary stage MUST run on a single crypto first, never all-at-once
Full rollout requires explicit operator sign-off in the OP Story, not auto-promotion
The operator filling this template names the specific crypto that hosts the canary (e.g. BTCUSDC chosen because it has the deepest order book) — not "we'll pick later"

5. Rollback plan (MUST)¶

How to revert the change if production behaves badly. Targets a 5-minute revert window.

Mechanism	Description	Revert SLA
FTF factor flag	`CVN_FACTOR_<name>=off` set in `ftf_config` Console (ADR-59)	< 1 min
Model version pinning	`cvntrade_<MODEL>_<SYMBOL>_<STRATEGY>` MLflow alias points to previous champion	< 5 min
Feature artefact pinning	`FeatureEngineeringAPI.from_mlflow_run(run_id=<previous>)`	< 5 min

Required minima:

The change MUST be revertable WITHOUT a code deploy (config / alias flip only) — if not, that's a structural problem that needs ADR-56 review. The config-only path is the PRIMARY rollback ; the hotfix flow (cf. OPERATIONS.md §16.7) is allowed only as secondary, emergency-only path when the primary path itself is broken (e.g., FTF runner unavailable).
The revert mechanism MUST be tested in shadow stage at least once (manual flip + verify champion behavior restored)
Document the specific config value that flips off the change (e.g. "set factor_label_smoothing=0.0 in ftf_config.base_env")

6. Owner & DRI (MUST)¶

Who is accountable for production behaviour of this change for the next 90 days.

DRI: <handle> (single human, not a team)
Backup DRI: <handle> (covers when DRI is OOO). MUST be actively engaged — quarterly handoff drill (15 min walkthrough of dashboards + rollback flag flip) OR shared on-call rotation. A backup who has never read this file is not a backup.
Decision authority: who can flip the rollback flag without committee — <handle>
Sunset date: YYYY-MM-DD (90 days post full rollout) — by this date the change is either accepted as part of the champion stack OR revisited

If the DRI changes ownership before the sunset date, the new DRI MUST update this file and post in the OP Story comment.

Sign-off checklist (gate before PR merge)¶

§1 monitoring : ≥ 1 prediction + ≥ 1 outcome + ≥ 1 health metric defined and discoverable in Grafana
§2 alerting : ≥ 1 P1 alert exists with a runbook
§3 drift : data + concept (or N/A justified) wired and emitting
§4 rollout : shadow + canary + full stages with explicit pass criteria + named canary crypto
§5 rollback : revert path documented, tested in shadow
§6 DRI : single human named, sunset date in 90 days
Dependency declarations (CVN-N011-EA-S11 wp#92, F15) : every NEW Python import introduced by this Story is declared in BOTH requirements.txt (root) AND airflow_docker/requirements.txt, OR the H1 deptry gate (.github/workflows/pr-deps-guard.yml) explicitly green on the latest PR commit. Lazy / conditional imports inside try/except are also covered by H2 (.github/workflows/build-push.yml post-build smoke). Closes the missing-dep regression class (sympy / cleanlab / etc.). Skipping this checkbox = the same incident class can recur silently.
Committee pr_review session reviewed this filled template (verdict OK or OK-WITH-CHANGES)
Story OP comment links the committee session id

The PR description MUST link this filled file. PRs that merge ML code without a linked, signed-off MLOps readiness file violate ADR-70 and ADR-25.

MLOps readiness — <CVN-N...-S0X title>¶

1. Production monitoring (MUST)¶

2. Alerting & runbooks (MUST)¶

3. Drift detection (MUST)¶

4. Staged rollout (MUST)¶

5. Rollback plan (MUST)¶

6. Owner & DRI (MUST)¶

Sign-off checklist (gate before PR merge)¶

MLOps readiness — `<CVN-N...-S0X title>`¶