MLOps readiness — <CVN-N...-S0X title>¶
Story: <cvn_id> — link to GH issue + OP wp
Owner: <github handle> (DRI for production behaviour of this change)
Filled on: YYYY-MM-DD
Reviewed by committee: session <id> (verdict OK / OK-WITH-CHANGES)
1. Production monitoring (MUST)¶
What metrics, where they live, who watches them.
| Metric | Type | Source | Dashboard | Threshold (warn / crit) | Owner |
|---|---|---|---|---|---|
<event_name.field> |
counter / gauge / histogram | Loki query / Prometheus / OpenTelemetry span | Grafana panel URL | < / > |
<handle> |
Required minima for any new ML track:
- At least 1 prediction-rate metric (e.g.
signals.buy_probaquantile distribution per crypto) - At least 1 outcome metric (e.g.
pnl_per_trade,expectancy_net_realized) - At least 1 health metric (e.g.
inference_latency_p99,feature_pipeline_failures) - All metrics MUST be tagged with the FTF factor / variant id when applicable (ADR-30)
Strongly encouraged (especially for components known to fail silently per past incidents):
- Internal pipeline metrics : feature distributions per fold, filter pass rates per stage, feature-selection variance dispersion (would have caught #706), prediction-capture rate per fold (would have caught #700/#701)
- Cost monitoring as a recommended health metric : API calls per trade, compute $ per training run, MLflow storage growth per Story
If your Story doesn't add ANY new metric, either (a) it shouldn't be a Story, or (b) document explicitly which existing metric covers the change.
2. Alerting & runbooks (MUST)¶
Threshold metrics from §1 that trigger pages, and what the on-call does.
| Alert | Trigger | Runbook | Severity | Notification channel |
|---|---|---|---|---|
<alert_name> |
metric > threshold for N min |
documentation/runbooks/<slug>.md |
P1 / P2 / P3 | Pagerduty / Slack / email |
Required minima:
- At least 1 P1 alert (page on-call) covering the failure mode that would cause silent revenue loss (e.g.
expectancy_net < 0 over 24h on >2 cryptos) - Each alert has a runbook in
documentation/runbooks/with: symptom → diagnosis steps → remediation → escalation - Runbook MUST link the Grafana dashboard and the relevant log queries (Loki
{job=...})
If you skip this section, write **SKIP — JUSTIFICATION**: <why> and accept committee may reject.
3. Drift detection (MUST)¶
How you detect that the model degrades or that input distribution shifts.
| Drift type | Method | Window | Threshold | Action on trigger |
|---|---|---|---|---|
| Data drift | PSI (population stability index) on top-K features by FI | rolling 7d vs training | PSI > 0.2 (warn), PSI > 0.5 (crit) | shadow re-train / page |
| Concept drift | rolling perf gap (live f1_buy vs training OOS f1_buy) |
14d window, per crypto | gap > 0.05 (warn), gap > 0.10 (crit) | retrain candidate, hold live |
Required minima:
- Both data-drift AND concept-drift implemented for any track changing model architecture, features, or labels (i.e., all F1_buy boost tracks)
- Drift metrics emitted as OpenTelemetry spans / log_event (ADR-30, ADR-32)
- Triggered drift produces a
drift_alertrunbook entry, NOT silent retraining (ADR-25 no silent fallback)
For tracks that only change calibration / threshold (no feature / label change), data-drift section can be marked N/A — calibration-only change with a one-line justification.
4. Staged rollout (MUST)¶
How the change reaches production traffic. Three stages with explicit pass criteria.
| Stage | Traffic % | Duration | Pass criteria | Rollback trigger |
|---|---|---|---|---|
| Shadow | 0% (predictions logged, not acted) | ≥ 7d | predictions parity vs champion within ±1% on f1_buy, no exceptions in logs |
any P1 alert |
| Canary | 1 crypto out of 5 (or 10% portfolio weight) | ≥ 7d | live expectancy_net ≥ baseline OR within CI95 ; n_trades ≥ 50 ; no drift alert |
live expectancy_net < 0 over 48h |
| Full | 100% (all 5 cryptos) | — | all canary criteria + Sortino ≥ baseline | per ADR-26 oncall procedure |
Required minima:
- Shadow stage MUST be ≥ 7 days even for "obvious" wins (ADR-67 lesson). Calibration-only Stories (no feature/label/architecture change) MAY shorten via explicit
**SKIP — JUSTIFICATION**: calibration-only, shadow=Nd because <reason>; committee may still reject. - Canary stage MUST run on a single crypto first, never all-at-once
- Full rollout requires explicit operator sign-off in the OP Story, not auto-promotion
- The operator filling this template names the specific crypto that hosts the canary (e.g.
BTCUSDCchosen because it has the deepest order book) — not "we'll pick later"
5. Rollback plan (MUST)¶
How to revert the change if production behaves badly. Targets a 5-minute revert window.
| Mechanism | Description | Revert SLA |
|---|---|---|
| FTF factor flag | CVN_FACTOR_<name>=off set in ftf_config Console (ADR-59) |
< 1 min |
| Model version pinning | cvntrade_<MODEL>_<SYMBOL>_<STRATEGY> MLflow alias points to previous champion |
< 5 min |
| Feature artefact pinning | FeatureEngineeringAPI.from_mlflow_run(run_id=<previous>) |
< 5 min |
Required minima:
- The change MUST be revertable WITHOUT a code deploy (config / alias flip only) — if not, that's a structural problem that needs ADR-56 review. The config-only path is the PRIMARY rollback ; the hotfix flow (cf.
OPERATIONS.md§16.7) is allowed only as secondary, emergency-only path when the primary path itself is broken (e.g., FTF runner unavailable). - The revert mechanism MUST be tested in shadow stage at least once (manual flip + verify champion behavior restored)
- Document the specific config value that flips off the change (e.g. "set
factor_label_smoothing=0.0inftf_config.base_env")
6. Owner & DRI (MUST)¶
Who is accountable for production behaviour of this change for the next 90 days.
- DRI:
<handle>(single human, not a team) - Backup DRI:
<handle>(covers when DRI is OOO). MUST be actively engaged — quarterly handoff drill (15 min walkthrough of dashboards + rollback flag flip) OR shared on-call rotation. A backup who has never read this file is not a backup. - Decision authority: who can flip the rollback flag without committee —
<handle> - Sunset date:
YYYY-MM-DD(90 days post full rollout) — by this date the change is either accepted as part of the champion stack OR revisited
If the DRI changes ownership before the sunset date, the new DRI MUST update this file and post in the OP Story comment.
Sign-off checklist (gate before PR merge)¶
- §1 monitoring : ≥ 1 prediction + ≥ 1 outcome + ≥ 1 health metric defined and discoverable in Grafana
- §2 alerting : ≥ 1 P1 alert exists with a runbook
- §3 drift : data + concept (or N/A justified) wired and emitting
- §4 rollout : shadow + canary + full stages with explicit pass criteria + named canary crypto
- §5 rollback : revert path documented, tested in shadow
- §6 DRI : single human named, sunset date in 90 days
- Dependency declarations (CVN-N011-EA-S11 wp#92, F15) : every NEW Python
importintroduced by this Story is declared in BOTHrequirements.txt(root) ANDairflow_docker/requirements.txt, OR the H1deptrygate (.github/workflows/pr-deps-guard.yml) explicitly green on the latest PR commit. Lazy / conditional imports insidetry/exceptare also covered by H2 (.github/workflows/build-push.ymlpost-build smoke). Closes the missing-dep regression class (sympy / cleanlab / etc.). Skipping this checkbox = the same incident class can recur silently. - Committee
pr_reviewsession reviewed this filled template (verdict OK or OK-WITH-CHANGES) - Story OP comment links the committee session id
The PR description MUST link this filled file. PRs that merge ML code without a linked, signed-off MLOps readiness file violate ADR-70 and ADR-25.