Skip to content

MLOps readiness — CVN-N001-EI-S03 — Split / regime reconstruction (Block 3b)

  • Story: CVN-N001-EI-S03 · wp#226 · GH #1058 · Epic CVN-N001-EI (#1055)
  • Template: documentation/templates/TEMPLATE_mlops_readiness.md (ADR-70)
  • Plan dossier: documentation/reviews/2026-05-28-cvn-n001-ei-s03-split-regime-reconstruction-plan.md (committee plan_review 0854dc31 PASSED)
  • Decision: documentation/reviews/2026-05-28-cvn-n001-ei-s03-regime-signal-source-decision.md (Option B unconditional)

Nature of this change (read first): S03 ships (a) the S41 split-ablation diagnostic — a two-layer (Airflow + Hamilton) instrument that re-splits and re-fits a model under six split families to measure whether best_iter + AUC move materially (the C-d test), and resolves the S02 temporal-autocorrelation reserve; and (b) the CVN_SPLIT_MODE gated production split — an off-by-default, A/B-testable capability (ADR-56) that inserts a purged walk-forward embargo gap, NOT promoted to the production default (ADR-2 — awaits the S03 + S04 verdicts). The diagnostic audits models; it ships none. The gated split changes nothing in production while CVN_SPLIT_MODE=off (the default). The behavioural sections (drift / staged rollout / expectancy / canary) are therefore N/A — justified for this PR; promotion of the gated split is a separate, future decision with its own readiness.


1. Production monitoring (MUST)

Metric Type Source Threshold (warn / crit) Owner
s41_verdict.status counter (per status) Loki {namespace="cvntrade"} \|= "event=s41_verdict" warn on severity=error (INCONCLUSIVE_TOOLING); review on BASELINE_LEAK_INFLATED (HALT-level — leak signal) dococeven
s41_cell_outcome / s41_group_outcome counter Loki \|= "event=s41_cell_outcome" / "event=s41_group_outcome" n/a (diagnostic verdict) dococeven
s41_embargo_violation counter Loki \|= "event=s41_embargo_violation" crit on any — a guarded split produced train/test overlap or short embargo (a code bug) dococeven
s41_power_estimate.powered gauge (bool) Loki \|= "event=s41_power_estimate" n/a (false → family-6 underpowered → INCONCLUSIVE_UNDERPOWERED) dococeven
split_mode_applied.mode counter Loki \|= "event=split_mode_applied" alert if mode != off unexpectedly (the gated split must stay off until promoted) dococeven
  • Prediction-rate metric: N/A — adds no prediction. The diagnostic re-fits to measure AUC, not to serve.
  • Outcome metric: N/A — no trading-outcome change (the gated split is off by default).

2. Alerting & runbooks (MUST)

  • s41_embargo_violation (crit) → the leak guard fired in a guarded family: a regime_split bug. Runbook: inspect the cell's family/kind, fix regime_split, re-run. (Should be impossible — unit-tested — hence crit if seen in prod.)
  • split_mode_applied mode != off while no promotion decision exists → someone set CVN_SPLIT_MODE in ftf_config; confirm it was an intentional A/B run, else reset to off (Console, ADR-59).
  • BASELINE_LEAK_INFLATED group verdict → the program is under a leak HALT signal (the S02 reserve direction); escalate per the plan §1 routing before any modeling conclusion.

3. Drift detection (MUST)

N/A — no feature / label / architecture / calibration change. S03 adds a diagnostic + a gated-off split capability. While CVN_SPLIT_MODE=off (default) the production train/test boundary is byte-identical to today. Production drift detection is owned by the N010 Drift Store epics, unaffected here.

4. Staged rollout (MUST)

N/A — diagnostic + gated-off capability. The S41 diagnostic is operator-triggered (no traffic, no predictions). The CVN_SPLIT_MODE split is off by default and not promoted (ADR-2); there is nothing to shadow / canary / roll out in this PR. If the S03 + S04 verdicts later justify promoting the purged-WF split to the production default, that promotion is a separate Story with its own staged rollout + canary + this template re-filled — explicitly out of scope here.

5. Rollback plan (MUST)

Mechanism Description Revert SLA
Config flag CVN_SPLIT_MODE=off (the default) — disables the purged-WF split, no deploy needed (Console, ADR-59) < 1 min
Git revert Revert the S03 commits on main — removes the diagnostic + the gated split wiring. No model / feature / production-config artefact involved (the gated split was never promoted). < 5 min (next DAG-sync + build)

6. Owner & DRI (MUST)

  • Owner / DRI: dococeven
  • Diagnostic verdict authority: committee (the S41 group verdict + the §12 routing feed a committee experiment_review, not an automatic production action — and a control-group result triggers a generalisation Story, not a program-wide ship, per the decision dossier §reconciliation).

Sign-off checklist (gate before PR merge)

  • §1 monitoring : s41_verdict / s41_embargo_violation / s41_power_estimate / split_mode_applied defined + Loki-discoverable ; prediction/outcome N/A-justified (no behavioural change while gated off)
  • §2 alerting : embargo-violation (crit) + split_mode != off + BASELINE_LEAK_INFLATED runbooks stated
  • §3 drift : N/A-justified (no feature/label/architecture/calibration change)
  • §4 rollout : N/A-justified (diagnostic + gated-off split ; promotion = separate Story)
  • §5 rollback : CVN_SPLIT_MODE=off flag (< 1 min) + git revert (< 5 min)
  • §6 owner : dococeven ; verdict authority = committee
  • Story OP comment links the committee plan_review session 0854dc31 / OP Meeting #234