Skip to content

ADR-0070 — MLOps readiness plan template is mandatory before any ML production merge

Status: active Date: 2026-04-28 Introduced by: CVN-N001-EF-S01 (#709), F1_buy boost committee session 8db2529d Supersedes: none


Context

The F1_buy boost plan committee review (session 8db2529d, Ops finding) flagged a recurring failure mode in the project: ML changes have been merging to production with monitoring, drift detection, and rollback paths defined informally — sometimes by chat, sometimes not at all. When a change degrades silently, the operator discovers it days later via aggregate dashboards rather than via a paged alert.

Concrete observations from the last quarter:

  • The variance feature-selection method shipped to production for ~2 weeks emitting 1.000291 for every feature post-StandardScaler (issue #706, root cause structural). No alert fired because no metric measured "feature selection variance dispersion" — there was no template forcing the question.
  • Predictions hook silently dropped 100% of OOS predictions for 1 fold (#700/#701) due to a reset_index() exception caught silently. The runbook for "predictions_captured=False > X% of folds" did not exist; we noticed via per-row spot-check.
  • Phase 2 rerun was launched without a documented canary stage. When a flaw surfaces mid-run, rollback is "kill all variants and restart" — a 4-hour cost.

The committee finding was : "Mandate a Comprehensive MLOps Readiness Plan for Each Track: Before any track proceeds to implementation, require a detailed plan covering: (1) Production Monitoring, (2) Alerting & Runbooks, (3) Drift Detection, (4) Staged Rollout Strategy."

We need to convert this from a one-off recommendation into a binding gate.

Decision

Every ML Story whose implementation touches model training, label generation, feature engineering, inference path, or trading filters MUST publish a filled TEMPLATE_mlops_readiness.md BEFORE its implementation PR can be merged.

The template lives at documentation/templates/TEMPLATE_mlops_readiness.md and has 6 mandatory sections : production monitoring, alerting & runbooks, drift detection, staged rollout, rollback plan, owner & DRI.

The filled version is committed to one of:

  • documentation/stories/<cvn_id>/mlops_readiness.md (preferred for substantial Stories), or
  • a ## MLOps readiness section pasted into the GitHub issue body, with the OP Story arch_notes cf pointing to the issue (lighter for small Stories).

The PR description MUST link the filled file (or the issue section) and the committee pr_review session that approved it.

Story scope Template required Sections enforced
Touches src/training/, src/commun/{pipeline,inference,filters,labels}/, model artefacts, FTF factor matrix YES all 6
Touches src/backtest/ only (no live trading impact) YES §1, §2, §5 (no rollout/drift since not live)
Touches src/commun/audit/ or pure observability code YES (lite) §1 + §2 + §5 + §6 (audit failures are silent ; need rollback path + DRI)
Pure docs / dashboards / FTF-only experiment with no model promotion path NO

The committee pr_review (per ADR-68) verifies that the filled template is present, complete per the table above, and passes the §Sign-off checklist. PRs without a sign-off block on the merge gate.

Invariants

  • Template lives at documentation/templates/TEMPLATE_mlops_readiness.md — single source of truth, all Stories copy from there. Updates to the template are themselves ADR-eligible if they remove sections.
  • Filled file path is predictable — either documentation/stories/<cvn_id>/mlops_readiness.md or the GH issue with a ## MLOps readiness H2 anchor. PR template (when introduced) auto-links one of the two.
  • Skipping a MUST section requires written justification — entry of the form **SKIP — JUSTIFICATION**: <why> directly in the section. Not in chat, not in a separate doc. Committee may reject.
  • DRI is a single human — never a team alias, never "the operator". The handle named in §6 is accountable for the next 90 days. Re-assignment requires editing the file and commenting on the OP Story.
  • Rollback must be config-only — §5 must document a revert path that does NOT require a code deploy (FTF factor flag, MLflow alias flip, or ftf_config edit per ADR-59). If a code deploy is the only revert path, the change has structural debt that triggers ADR-56 review.
  • Canary stage names a specific crypto — §4 must name the crypto hosting the canary (e.g. "BTCUSDC because deepest order book"), not "we'll pick later". Vague rollout plans defeat the gate.
  • Sunset date is 90 days post full rollout — every change is reviewed at +90d to either become permanent or be revisited. Forces accountability for "we shipped it and forgot".
  • Drift thresholds are calibration-eligible — the template proposes defaults (PSI 0.2/0.5, perf-gap 0.05/0.10) but Stories MAY justify per-crypto-volatility values in §3 with a one-line rationale. The defaults are not sacred ; the discipline of naming a threshold is.

Alternatives rejected

  • Per-Epic MLOps section instead of per-Story — too coarse: Stories within the same Epic (e.g., F1_buy boost S01-S05) ship at different times and need separate canary strategies. The template fits per-Story to match merge granularity.
  • Wiki page or Confluence-style central document — centralised pages drift from the actual change set ; the template lives in the Story dossier so it travels with the diff.
  • Soft recommendation, no merge gate — exact failure mode of the past quarter. Without the gate, the template becomes a "we should fill this someday" artefact.
  • Auto-generated from code (e.g., pytest plugin scanning for @mlops_ready decorators) — over-engineered for our current cadence (~1 ML PR / week). Revisit if cadence reaches > 5 / week.

Consequences

  • Positive: every ML production change carries a written rollback plan and a named DRI. "Who do I page when this breaks at 3am?" has an answer in the PR description.
  • Positive: drift detection becomes a default, not an afterthought. New tracks ship with PSI + concept-drift wiring from day one.
  • Positive: synergistic with ADR-69 (OpenProject orchestrator) — the filled template lives in the Story dossier, which is the artifact the OP Story points to.
  • Positive: synergistic with ADR-68 (committee = default review channel) — pr_review session has a checklist to verify, not a blank slate.
  • Negative: ~30-60 min overhead per ML Story to fill the template. Acceptable given the cost of one silent regression (~4-8h to diagnose).
  • Negative: the gate creates one more committee dependency on the merge path — mitigated by keeping the template short (1 page when filled).
  • Neutral: the template is opinionated about minima (≥ 1 P1 alert, ≥ 7d shadow, etc.). These can be adjusted via ADR amendment if proven wrong.

Rollback

This ADR is process. Rollback = remove the file and revert references to ADR-70 in the template + CLAUDE.md. The filled-template files in Story dossiers stay (they are useful documentation regardless of policy).

If the gate proves systematically wasteful (e.g., > 20 % of ML Stories blocked at merge purely for template completeness over a sample of ≥ 20 Stories, with zero correlated production incidents prevented), revisit the section list before retiring the policy.

References

  • ADR-25 — No silent fallback in ML pipelines (the policy this template operationalises)
  • ADR-26 — Grafana as single entry point (where §1 metrics surface)
  • ADR-30 — Structured logs as stable interface (how §1 metrics propagate)
  • ADR-32, ADR-33 — log_event format + closed event catalogue (the contract §1 builds on)
  • ADR-56 — Every pipeline change must be FTF-testable / A/B testable (the rollback flag in §5)
  • ADR-59 — All pipeline params in PostgreSQL ftf_config (the config-only revert in §5)
  • ADR-68 — Expert Committee = default review channel (the pr_review that enforces this gate)
  • ADR-69 — OpenProject is the project orchestrator (where the filled template lives in the Story dossier)
  • documentation/templates/TEMPLATE_mlops_readiness.md — the template itself
  • Issues : #707 (F1_buy boost plan), #709 (this Story), #729 (Epic CVN-N001-EF)
  • Committee sessions : 8db2529d (original Ops finding source), 3e0a3008 (this ADR's plan_review, PASSED / OK avg 8.6, 9 recommendations applied to template + scope-table)