Skip to content

ADR-0098 — A diagnostic/experiment MUST map the deployment regime it informs (metric AND regime), verified at the PLAN stage

Status: active (operator-mandated 2026-06-04 ; lesson-of-record from CVN-N001-EI-S04)

Context-of-record: the S04 / s42 capacity diagnostic measured validation AUC on fixed hyperparameters, ran a full multi-fold stability validation, and only at the application stage discovered that production searches hyperparameters under a trading objective (f1_buy) — so the AUC-on-fixed-HP finding does not map to deployment. See reports/2026-06-03-multi-fold-hp-stability-filtering.md and S10. Second instance (regime-as-distribution, → Invariant 5): the S05 / s43 framing §0bis found the deployed config is not a snapshot but a stochastic process (HPO re-selects materially different best_params each cycle) — S05 §0bis + r3.

Context

A diagnostic exists to inform a deployment decision. Its result only carries that information if its measurement frame matches the deployment frame on two axes:

  • Metric. What the diagnostic optimises/measures vs what production actually optimises. A proxy metric (e.g. AUC) is only legitimate if it transfers to the deployment objective (e.g. the trading metric the HPO selects on) — and that transfer must be established, not assumed.
  • Regime. How the diagnostic produces the artefact vs how production produces it. Fixed-hyperparameter, single-seed, no-threshold-calibration, no-search measurements do not map a deployment that searches hyperparameters, calibrates a threshold, and selects jointly.

When either axis is mis-specified, the diagnostic can be internally valid yet decision-irrelevant: it answers a question disconnected from production. The expensive failure mode (s42) is to discover this after building and validating the diagnostic, instead of at its plan. The cost is not just wasted compute — a mis-framed verdict (B_SYSTEMATIC_OVERFIT) can drive wrong action if its frame gap is not surfaced.

Decision

Every diagnostic/experiment plan MUST contain, before the experiment is built or run, an explicit framing §0bis that answers, in writing:

  1. What does production actually optimise for this decision (the deployment objective), and is the diagnostic measuring that — or a proxy whose transfer is established, not assumed?
  2. What regime does production use to produce the artefact (search vs fixed HP, calibrated vs raw threshold, joint vs marginal, single vs multi-seed, which data window), and does the diagnostic reproduce that regime — or is the gap named and bounded? And is that regime a single artefact or a stochastic process — does the selection vary cycle-to-cycle? (read MLflow for selection stability across retrains.) A deployment that re-searches lands a materially different artefact each cycle; mapping it means mapping its distribution, not one draw — see Invariant 5.
  3. If a proxy or a simplified regime is used deliberately (cheaper, isolating a mechanism), the plan states it is a proxy/mechanistic result and that deployment-transfer is a separate, required step — the diagnostic does not claim a deployment conclusion it has not earned.

This is a plan-gate: a diagnostic plan without the framing §0bis is not review-ready. It extends ADR-0093 (cluster dry-run gate, referenced) §0bis (verify the load-bearing path consumes what you think) up one level — from "does the code path consume the key I think?" to "does the experiment's frame map the deployment I think?".

Invariants

  • Invariant 1 — framing §0bis present at plan. Every diagnostic/experiment plan (and the canonical diagnostic-story / experiment-report templates) carries the metric-and-regime mapping check, answered against the live deployment config (read it — ftf_config.base_env, the HPO objective, the search ranges — do not infer from memory), before building the diagnostic.
  • Invariant 2 — proxy claims are bounded. A result on a proxy metric or a simplified regime is labelled as such and MUST NOT be stated as a deployment conclusion until transfer is demonstrated. "Better on AUC (fixed HP)" is not "better in production".
  • Invariant 3 — read live, per-artefact, not inferred. The deployment objective and regime are confirmed from current config and, where it matters, per-run (e.g. the objective a given run actually optimised), not assumed from the current default — the S03 / S10 lesson.
  • Invariant 4 — the gap is surfaced, not buried. If a frame gap is discovered mid-flight, it is raised to the verdict's headline (and an issue), not left as a footnote — a mis-framed verdict left standing drives wrong action.
  • Invariant 5 — map the distribution of a stochastic regime, not a snapshot. The framing §0bis is a gate item with two steps: (i) detect whether the deployment re-selects the artefact each cycle — read MLflow for HPO selection stability run-to-run; (ii) if it does, the diagnostic MUST characterise its metric over a sample of that selection distribution (K draws), not a single draw — not even the most-recent, which is just one draw of the process (the S05 / s43 finding: per-crypto HPO best_params vary materially cycle-to-cycle — lr 0.098–0.146, num_leaves 27–59 — so there is no single "deployed config", only a re-drawing process). Scope: this is a regime discipline (which artefact the diagnostic feeds, and over what spread); concluding from a snapshot at the verdict level is ADR-0099's territory (existence≠selection). "Map the regime" = "map its distribution".

Alternatives rejected

  • "Catch it in review." Rejected: the s42 plan was reviewed (committee PASSED) and the frame gap still passed — because nobody was required to read the live deployment objective/regime. A named plan-gate with a live-config read is the control; review alone is not.
  • "Proxy metrics are fine, validate later." Rejected as the default: the whole multi-fold validation of an AUC finding was spent before its deployment-relevance was checked. The transfer question is cheap (read the config) and comes first, not after the expensive validation.

Consequences

  • Diagnostic/experiment templates gain a mandatory "Deployment-frame mapping (§0bis)" section; a plan without it is not review-ready.
  • Cheap up-front reads (deployment objective, search ranges, calibration) replace expensive late discoveries.
  • Some diagnostics will be re-scoped or not run once the frame gap is seen at plan time — that is the point.
  • The transferable methodological output of a mis-framed-but-rigorous diagnostic (e.g. the multi-fold stability method) survives; the deployment verdict does not.

References

  • Lesson-of-record: reports/2026-06-03-multi-fold-hp-stability-filtering.md (§ deployment-regime mismatch) · S10
  • Related ADRs: ADR-0093 (§0bis cross-check — this is its plan-level generalisation), ADR-0095 (diagnostic-story template — gains the framing section), ADR-0097 (experiment-report template — pre-registration + §3 now includes deployment-frame mapping), ADR-90 (HP defaults/ranges in PG — the live config to read), ADR-15 (theta calibrated OOS — part of the deployment regime).