Skip to content

Plan dossier (r4 — plan_review PASSED)

Review note. This is the plan for Story S09 (diagnostic template, ADR-0095). It fixes the what & why and pre-registers the decision rule before any code or run. Submitted to the committee for plan_review. Canonical structural reference: CVN-N001-EI-S05.

Story: CVN-N001-EI-S09 · OpenProject: wp#237 · GitHub issue: #1099 · plan dossier: this file (...-plan.md).

Revision journalr1 (2026-06-09): initial draft. r2 (2026-06-09): referee round 1 incorporated — full English pass; deterministic execution config specified (ADR-0096 + LightGBM determinism); H1 made cross-process (reproducibility, not just in-process repeatability); canonical reproducibility envelope ε_prod added and gated; decision rule re-ordered so a missing canonical no longer masks a determinism verdict; epsilon frozen (value + provenance); cell scope reconciled (5 cells for cause classification); gate generalization bounded; truth table + routing/sequence figures added. r3 (2026-06-09): referee round 2 — the replay↔prod environment delta closed: prod-config made binary (prod runs un-capped → ε_prod measured under prod's real config, not §3.C; "ideally" removed); env-parity precondition P0 added (image SHA via ADR-92 + instance type — without it FIDELITY_OK is not posable and a divergence is classified env/version gap, not fidelity gap); ε_prod estimator made conservative (one-sided tolerance bound, N≥10, not max-of-5); H1-FAIL cause sub-branch pre-registered (numeric residue vs real instability; pure FAIL preserved); determinism event always emitted (incl. status=ERROR); the prior stated (A6 gaps 0.1–0.3 ≫ any epsilon → live hypothesis is gross divergence); ε_num justified as a fidelity threshold. r4 (2026-06-09): committee plan_review PASSED·OK (session f5fdbf8d, OP Meeting 268, 5 experts, strong consensus, 0 blockers, 0 dissent). Three cheap, co-located recommendations folded — P0 extended to feature/artefact version parity (ADR-23) and OS/lib versions (§3.F); H1-FAIL cause emitted as a warn-level Loki event (§1). Remaining recommendations tracked as follow-ups (see "Committee plan_review").


Part A — Narrative (value & accessibility)

A1. Problem statement (non-technical)

The question in one sentence. When our diagnostics "replay" a production training run to inspect it (the s18 replay), do they replay the same thing production did — and do they replay it the same way twice?

Why now. At the close of S04 (LightGBM capacity ablation), the diagnostic__s42 runs came out with s18_status=FAIL on all 5 cells: the replayed f1_buy (0.36–0.47) did not match the stored canonical f1_buy from production. This is action item A6 of the s42 RCA. The problem is not S04's capacity verdict (orthogonal, intra-run) — it is that until this divergence is explained, we cannot claim "our diagnostic is faithful to the production baseline." Several diagnostics rest on this replay (s41 / s42 / s43 — running right now for S05).

The traps / illusions. Today the divergence surfaces only incidentally, via a run's Phase A anchor. Two illusions to avoid: 1. "The replay is wrong" — maybe, but maybe it is unstable (non-deterministic): it would yield a different f1_buy each time, and any comparison to the canonical would be meaningless. 2. "FAIL = a bug in the diagnostic" — on the contrary, an honestly detected FAIL is the information; the real danger is a replay that diverges silently and is taken as faithful.

What we actually measure. Two separate things, in this order: - Determinism — replayed twice in two fresh processes with the same pinned config and seed, does the replay return a bit-identical f1_buy? - Canonical divergence — does the replay (once determinism holds) depart from the stored canonical f1_buy beyond a tolerance epsilon?

Determinism must be established first: without it, talking about "divergence vs production" is meaningless (we would be comparing noise).

Honesty of the verdict. If we cannot decide (machinery failure, incomplete cell), the verdict is INCONCLUSIVE_TOOLING — an owned "we don't know," never a false PASS.

Why it matters (the decisions it feeds). Three concrete decisions: - Immediate downstream — this check bounds the asterisk on the in-flight S05/s43 verdict ("faithful to the prod baseline" vs "fidelity not established"). - Programme downstream — it gates the right to make skip_phase_a=true a default (skipping the expensive capture) — today forbidden because the Phase A anchor is what reveals A6. - Feeds D2 of the tradability decision protocol (Epic CVN-N001-EK), which requires a substrate of validated fidelity.

A2. User stories

# As a… I want… so that… Section
US1 quant / reviewer to know whether the s18 replay is deterministic (cross-process) I do not compare noise to a canonical A1, §3.A, H1
US2 quant to know whether the replay diverges from the prod baseline, independently of a run the fidelity claim of any s18 diagnostic is bounded A1, §3.B, H2
US3 operator a job (nightly / manual) emitting a readable verdict + Loki events I can monitor fidelity without launching a full diagnostic §3.D, runbook
US4 programme DRI the cause of the divergence (non-determinism vs fidelity gap) I can decide whether skip_phase_a=true may become the default (gate) A1, §6, decision routing
US5 dev2 (Epic EK) a substrate whose fidelity is validated or bounded I can feed the D2 pre-study (dependency model) A1, §6

A3. Hypotheses (pre-registered)

Each hypothesis = statement + acceptance criterion tested + method. We distinguish tested hypotheses from surfaced working assumptions (confirmed in-pod, not statistically tested).

Terminology note (referee #12). These are acceptance invariants, not NHST nulls: the held criterion is PASS (the inverse of the usual "reject the null" convention), and there is no α / power — determinism is an exact-equality invariant, divergence a fixed-threshold criterion.

H1 — Replay determinism (cross-process). Under the pinned deterministic config (§3.C), the s18_step0_replay path is reproducible: two executions in two fresh processes (same seed, same cell, same window) return a bit-identical (IEEE-754 exact-equal) f1_buy. - Acceptance invariant (H1): f1_buy(proc_a) == f1_buy(proc_b) exactly. - Why cross-process (referee #2): H2 compares the replay to a canonical produced by production, in a different process/pod. An in-process double-call only proves repeatability (Gauge R&R: the first "R"); the fidelity comparison of H2 needs reproducibility (the second "R") — i.e. determinism that survives a fresh process. So H1 is cross-process. An in-process double-call is kept only as a cheap pre-check (R1) that does not, by itself, license H2. - Method: run _replay_cell in two fresh subprocesses under the pinned config; compare f1_buy. event=s18_replay_determinism status=PASS|FAIL abs_delta_runs=<>. No statistics — a reproducibility assertion (golden master).

H2 — Fidelity to the canonical. The replay's f1_buy matches the canonical f1_buy stored in production (finetune_results) within epsilon. - Acceptance invariant (H2): abs(observed_f1_buy − canonical_f1_buy) ≤ epsilon. - Method: read the canonical out-of-run (query_canonical_from_pg), compare to the replay after H1 has established determinism. event=s18_replay_divergence severity=warn|info abs_delta=<>. Independent of skip_phase_a (the check does not depend on a diagnostic run). - epsilon is frozen here — see §3.E. Its floor is the production-path reproducibility envelope (referee #3): a perfectly faithful, deterministic replay still differs from a single canonical draw by production's own run-to-run spread, so epsilon must dominate that spread or H2 confounds infidelity with the canonical generator's noise.

Working assumption WA1 (surfaced, not statistically tested) — if H2 is rejected while H1 PASSes, the cause is a real fidelity gap in the replay path (applied env, LdP loop order, data window, artefact version), not noise. WA1 steers the cause diagnosis (§6 deliverable); it is not a decisional test. WA1's candidate causes are mapped to what _replay_cell actually controls in §6 (referee #7).

The disambiguation is the diagnostic — it is the Cartesian product (H1 × H2) that names the cause (Fig 1), not H1 or H2 alone.

A4. State of the art

  • Reproducibility as a first-class property. Non-determinism in ML training (data order, unseeded RNG, parallel reduction order) is a known source of irreproducible metrics [Pineau2021]. The discipline is to test bit-reproducibility, not assume it — here via a same-seed double-run in fresh processes (characterization / golden-master testing [Feathers2004]).
  • Repeatability vs reproducibility (Gauge R&R). Measurement-system analysis separates repeatability (same operator/instrument, repeated) from reproducibility (across operators/conditions) [AIAG-MSA, Montgomery2013]. "Replay-divergence" is a reproducibility question; an in-process repeat is the wrong "R" (referee #2).
  • Capture-replay / training-serving skew. A diagnostic that replays a production training path is exposed to the same failure mode as training-serving skew: the replayed environment silently differs from the canonical one [GoogleMLRules]. The mitigation is an independent fidelity check keyed on a stored canonical, not an incidental anchor.
  • Order of operations (determinism before divergence). Comparing a stochastic quantity to a reference is meaningless until the quantity is shown stable; establishing determinism precedes any divergence claim — the same logic as fixing measurement repeatability before assessing bias.
  • Bit-reproducibility requires a pinned numeric environment. GBM/BLAS reductions are order-sensitive; bit-identical results require capped thread pools and library-level determinism flags [LightGBM-Determinism], which is exactly the contract of ADR-0096 (compute pods cap thread pools to cgroup) and the s43 thread-capped golden fixture (referee #1).
  • Pre-registration of the decision rule. The verdict mapping (thresholds, tie-break) is frozen before the run to avoid post-hoc rationalization [Nosek2018], per ADR-0095.
Reference Grounds
[Pineau2021] Pineau et al., Improving Reproducibility in ML Research (JMLR 2021) H1, determinism
[Feathers2004] Feathers, Working Effectively with Legacy Code (golden master) H1, method
[AIAG-MSA] AIAG, Measurement Systems Analysis (MSA) — Gauge R&R repeatability vs reproducibility (referee #2/#13)
[Montgomery2013] Montgomery, Introduction to Statistical Quality Control Gauge R&R
[GoogleMLRules] Zinkevich, Rules of Machine Learning (training-serving skew) H2, fidelity
[LightGBM-Determinism] LightGBM docs — deterministic, force_row_wise §3.C deterministic config
[Nosek2018] Nosek et al., The preregistration revolution (PNAS 2018) pre-registration invariant

ADRs cited: ADR-0095 (diagnostic template), ADR-0096 (thread-pool cap / bit-identity), ADR-25 (no silent fallback / fail-loud), ADR-31 (no print), ADR-92 (build SHA surfaced).

A5. Consolidation & traceability

Problem → hypothesis → US → section → literature (no dangling thread):

Problem Hypothesis US Section Literature
Replay unstable (cross-process)? H1 US1 §3.A, §3.C Pineau2021, Feathers2004, AIAG-MSA, LightGBM-Determinism
Replay ≠ prod? H2 US2 §3.B, §3.E GoogleMLRules
Cause (unstable vs gap)? H1×H2 + WA1 US4 A1, §6 AIAG-MSA
Operable monitoring US3 §3.D, runbook
Substrate for D2 (EK) US5 A1, §6

Decision routing (verdict → action) — pre-registered; pseudo-code in §1, truth table Fig 1, flow Fig 2:

Verdict Meaning Action
FIDELITY_OK H1 PASS + H2 invariant held lift the A6 caveat for the covered surface; authorize skip_phase_a=true default only within that surface (referee #6)
NON_DETERMINISTIC H1 FAIL fix determinism (seed / order / window / threads) before any fidelity claim; A6 stays open
CANONICAL_DIVERGENCE H1 PASS + H2 rejected real fidelity gap → diagnose the cause (WA1); bounds the "faithful to prod" claim of s18 verdicts (incl. S05)
INCONCLUSIVE_TOOLING machinery / incomplete cell / NaN fix + re-run (no-crash, ADR-25)

Part B — Technical specification

§0 Provenance

  • Discovery: close of CVN-N001-EI-S04 (capacity verdict B_DEFAULTS_OVERFIT) — s18_status=FAIL on 5/5 cells in s42 Phase A, observed_f1 0.36–0.47.
  • Sources: s42 RCA (A6) · s42 fix-plan §2.E · s42 experiment §4.
  • Orthogonality: independent of the S04 intra-run capacity verdict; this is a substrate-validity guard (cross-cutting s41 / s42 / s43).
  • Existing code: src/commun/finetune/diagnostic/s18_step0_replay.pyVerdict(status PASS|FAIL|ERROR, expected_f1_buy, observed_f1_buy, abs_delta, epsilon, …), query_canonical_from_pg, _replay_cell. The current check is single-replay vs canonical and coupled to the Phase A anchor.

§1 Objective + pre-registered decision rule

Objective: deliver an independent detection (outside a diagnostic run) of (A) replay non-determinism (cross-process) and (B) canonical divergence, with Loki events and a verdict that names the cause — the precondition to the skip_phase_a=true-by-default gate.

Stated prior (referee #6 — allocates the diagnostic effort). The A6 deltas are gross: replayed f1_buy 0.36–0.47 vs a canonical plausibly ~0.6–0.7, i.e. an order of 0.1–0.3 ≫ any plausible epsilon (ε_num = 0.005). The live hypothesis is therefore CANONICAL_DIVERGENCE (a coarse fidelity/env gap); H1 is a fast prerequisite to rule out "it's just noise," not the expected failure. Consequently the bulk of the cause-diagnosis effort goes to the _replay_cell control-surface map (§6.5: env / order / window / artefact version) and to the env-parity precondition P0 below — not to chasing 1e-9 determinism residues.

Decision rule (frozen — pseudo-code). Order: env-parity precondition P0 → determinism H1 → canonical fidelity H2. Determinism is evaluated and emitted independently of the canonical (referee #4); a missing canonical only blocks H2. FIDELITY_OK is not posable unless P0 holds (referee #2):

# Inputs: cell (ReferenceCell), epsilon (frozen, §3.E = max(ε_num, ε_prod)), ε_num (frozen, §3.E),
#         canonical row (PG, out-of-run; carries f1_buy, build_sha, instance_type via ADR-92; may be None),
#         this replay's build_sha + instance_type.
# Pinned deterministic config (§3.C) applied in every replay subprocess.

# Stage A — determinism (H1), cross-process; needs neither the canonical nor P0
r_a = replay_cell_fresh_process(cell)      # observed f1_buy + extended (diagnostic-only) metrics
r_b = replay_cell_fresh_process(cell)      # identical pinned config / seed / window, FRESH process

if r_a is ERROR or r_b is ERROR or isnan(r_a.f1_buy) or isnan(r_b.f1_buy):
    emit s18_replay_determinism status=ERROR                 # always emitted (referee #5, no obs. hole)
    emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=replay_error
    return INCONCLUSIVE_TOOLING                              # NaN guarded here (referee #11)

det_delta = abs(r_a.f1_buy - r_b.f1_buy)
det_status = PASS if (r_a.f1_buy == r_b.f1_buy) else FAIL    # IEEE-754 exact equality (referee #14)
emit s18_replay_determinism status=det_status abs_delta_runs=det_delta

if det_status == FAIL:
    # Pure FAIL preserved; cause sub-branch routed (referee #3 — anti dead-end), NOT a tolerance on H1
    det_cause = "numeric_residue" if det_delta <= ε_num else "real_instability"
    emit s18_replay_determinism_cause severity=warn cause=det_cause det_delta=det_delta  # rec #9: operator-visible, no artefact parse
    emit s18_replay_verdict verdict=NON_DETERMINISTIC det_delta=det_delta cause=det_cause
    return NON_DETERMINISTIC                                 # H1 FAILS → H2 not assessed (tie-break #1)

# H1 holds → H2 is now meaningful
if canonical is None:
    emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=canonical_absent
    return INCONCLUSIVE_TOOLING                              # canonical blocks ONLY H2 (referee #4)
if ε_prod is UNMEASURED:                                     # §3.E gate (referee #3/#4bis)
    emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=epsilon_prod_unmeasured
    return INCONCLUSIVE_TOOLING

# Precondition P0 — env parity replay↔canonical-generator (referee #2). Without it, FIDELITY is not posable.
parity = (r_a.build_sha == canonical.build_sha) and (r_a.instance_type == canonical.instance_type)
canon_delta = abs(r_a.f1_buy - canonical.f1_buy)
emit s18_replay_divergence severity=(info if canon_delta <= epsilon else warn) abs_delta=canon_delta parity=parity

if not parity:
    if canon_delta <= epsilon:
        emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=env_parity_unverified
        return INCONCLUSIVE_TOOLING          # matches DESPITE env delta → cannot claim clean FIDELITY_OK
    else:
        emit s18_replay_verdict verdict=CANONICAL_DIVERGENCE cause=env_parity_gap canon_delta=canon_delta
        return CANONICAL_DIVERGENCE          # divergence attributed to env/version, NOT a logic fidelity gap

# P0 holds → a within-ε result is a clean fidelity pass; a beyond-ε result is a logic fidelity gap
verdict = FIDELITY_OK if canon_delta <= epsilon else CANONICAL_DIVERGENCE
cause   = None         if canon_delta <= epsilon else "logic_fidelity_gap"
emit s18_replay_verdict verdict=verdict cell=<crypto/fold> det_delta=det_delta canon_delta=canon_delta cause=cause
return verdict

Tie-break / priorities (frozen): 1. Determinism first — a NON_DETERMINISTIC masks any divergence reading (we do NOT name CANONICAL_DIVERGENCE if H1 failed — the divergence would be uninterpretable). 2. Tooling wins — any error path (replay ERROR, NaN, canonical absent for H2) → INCONCLUSIVE_TOOLING, never a false PASS, never a UI crash. Determinism is emitted before the canonical is consulted, so a missing canonical no longer hides a determinism failure (referee #4). 3. f1_buy(a) == f1_buy(b) exact — determinism is bit-identity (IEEE-754 exact equality), not "close." Any inequality = FAIL. This is what separates instability from a fidelity gap (and it is only meaningful under the pinned config §3.C). 4. epsilon is frozen in §3.E (value + provenance) — not a free parameter; this closes the pre-registration hole the referee flagged (#4bis). 5. Env parity P0 gates FIDELITY (referee #2) — FIDELITY_OK is posable only when the replay's build_sha + instance_type equal those of the run that produced the canonical (recorded via ADR-92). Under non-parity: a within-ε match → INCONCLUSIVE_TOOLING reason=env_parity_unverified (matches despite an env delta — no clean fidelity claim); a beyond-ε divergence → CANONICAL_DIVERGENCE cause=env_parity_gap (attributed to env/version, not a logic gap). So a divergence is never silently over-read as a fidelity-logic failure when the environments differ. 6. ε_prod unmeasured gates H2 — until D0 measures the prod envelope (§3.E), the decisional fidelity verdict is withheld (INCONCLUSIVE_TOOLING reason=epsilon_prod_unmeasured); H1 + smoke may run. 7. Determinism event always emitted (incl. status=ERROR) — no observability hole (referee #5). 8. H1-FAIL keeps a pure FAIL but routes a cause sub-branch (numeric_residue if det_delta ≤ ε_num else real_instability) — this is not a tolerance on H1, it prevents the diagnostic from bricking if perfect bit-identity proves unreachable in the stack (referee #3).

Fig 1 — truth table (exactly one verdict per cell). P0 (env parity) is the gating precondition on the PASS row only (it is irrelevant when H1 FAIL/ERROR already decides). FAIL/ERROR rows are P0-independent.

H1 PASS row, split by P0 (env parity replay↔canonical-generator):

P0 \ H2 (canonical) ≤ ε > ε canonical absent ε_prod unmeasured
parity holds FIDELITY_OK CANONICAL_DIVERGENCE (cause=logic_fidelity_gap) INCONCLUSIVE_TOOLING (canonical_absent) INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured)
parity fails INCONCLUSIVE_TOOLING (env_parity_unverified) CANONICAL_DIVERGENCE (cause=env_parity_gap) INCONCLUSIVE_TOOLING (canonical_absent) INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured)

H1 FAIL / ERROR rows (P0- and H2-independent):

H1 any H2 / P0
FAIL (diverges across processes) NON_DETERMINISTIC (cause = numeric_residue | real_instability)
ERROR / NaN (replay broke) INCONCLUSIVE_TOOLING (replay_error)

Reading: the FAIL row collapses to NON_DETERMINISTIC regardless of H2/P0 (tie-break #1); the ERROR row collapses to INCONCLUSIVE_TOOLING; H2 and P0 are consulted only on the PASS row. The test-strategy exhaustiveness test asserts this full mapping is total.

§2 Scope

In scope: - Determinism check (cross-process double-replay, bit-identity assertion) under a pinned config. - Canonical divergence check (replay vs finetune_results, independent of skip_phase_a). - D0 — production reproducibility envelope (ε_prod): characterize prod's own f1_buy run-to-run spread to set epsilon's floor (§3.E). Gated input to the decisional H2 verdict. - A job (manual DAG or nightly CI — §7) emitting the 3 events + a verdict artefact. - Cause diagnosis (non-determinism vs fidelity gap), mapped to what _replay_cell controls (§6). - Loki catalogue entries (3 events).

Out of scope: - Fixing the divergence (an env/order/window fix is a follow-up, contingent on the cause found). - Flipping skip_phase_a=true to default (that is the downstream gated decision, not this deliverable). - Any economic / capacity verdict (orthogonal — S04).

§3 Approach (Hamilton-native, 2 layers)

A. Determinism layer (_probe_determinism) — pure function over two fresh-process _replay_cell executions; returns (det_delta, status). Reuses the existing _replay_cell (no re-implementation of the training path); the fresh-process isolation is the orchestration's job (subprocess / task).

B. Divergence layer (_probe_divergence) — pure function; reads the canonical via query_canonical_from_pg (out-of-run), compares to the determined replay. Decoupled from the Phase A anchor.

C. Deterministic execution config (frozen — referee #1). Every replay subprocess runs under: - thread caps: OMP_NUM_THREADS=1, MKL_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1 (per ADR-0096; order-sensitive BLAS/GBM reductions otherwise break bit-identity within the same process); - LightGBM determinism: deterministic=true, force_row_wise=true, fixed seed / bagging_seed / feature_fraction_seed, num_threads=1 ([LightGBM-Determinism]); - PYTHONHASHSEED fixed; single, pinned data window (calendar fold-3).

Binary settlement on the production config (referee #1 — no "ideally"). §3.C governs the replay. The production training path that produced the stored canonical runs un-capped (multi-threaded for throughput — it is not under §3.C). We therefore do not assume prod determinism: the canonical carries production's own run-to-run spread, which is captured by ε_prod measured under prod's real config (§3.E), not under §3.C. (Were prod ever run under §3.C, ε_prod → 0 and epsilon = ε_num — but that is not the operating assumption.) Without §3.C on the replay side, H1 would emit spurious NON_DETERMINISTIC for a reason foreign to the replay (the referee's core Tier-1 point).

D. Decision + orchestration (_decide_verdict) — pure function implementing the §1 pseudo-code (Cartesian product H1×H2 → exactly one verdict; Fig 1). A job loops over the reference cells, emits the events, writes the verdict artefact. Hamilton-native style (named node graph, isolated I/O, decision in pure functions — no adapter wrappers). Extended metrics are diagnostic-only, not decisional: the rule uses det_delta/canon_delta on f1_buy only (referee #8).

E. epsilon — frozen value + provenance (referee #3, #4, #4bis, #8). epsilon = max(ε_num, ε_prod) where - ε_num = 0.005 — the numerical replay tolerance inherited from the s43 DAG / s18_step0_replay. Justification as a fidelity threshold (referee #8), not merely an inherited replay knob: 0.005 in absolute f1_buy is ~1 % of a ~0.6 operating f1_buy and is two orders of magnitude below the A6 gap (0.1–0.3) the check must catch — i.e. small enough that a genuine fidelity gap cannot hide under it, large enough to absorb float/repr noise of an otherwise-faithful replay. (It is not a tradability threshold; it is the "indistinguishable-from-faithful" band on the metric.) - ε_prod — the production-path reproducibility envelope, measured under production's real (un-capped) config (referee #1), not §3.C: it must bound how much the stored canonical itself could have varied. Conservative estimator (referee #4 — the cost-sensitivity under-power lesson): a one-sided upper tolerance bound (≈95 %/95 %) on |Δf1_buy| over N ≥ 10 fresh-process re-runs of the prod path on the reference cell — not the max-of-5 (a biased-low estimator of the true max). Measured in D0 and gated: until measured, the decisional H2 verdict is withheld (INCONCLUSIVE_TOOLING reason=epsilon_prod_unmeasured); H1 + smoke may run (mirrors the s43 cost gate).

F. Env-parity precondition P0 (referee #2; committee recs #1/#7 folded). FIDELITY_OK is not posable unless the replay's execution environment matches the one that produced the canonical, on the parity vector: - build_sha(replay) == build_sha(canonical-run) (champollion image, ADR-92); - instance_type(replay) == instance_type(canonical-run) (CPU microarch); - feature/artefact version parity (ADR-23) — the OHLCV / Enrichment / FeatureEngineering output versions consumed by _replay_cell identical to the canonical's (committee rec #1; this is also WA1's "artefact version" cause made a precondition); - OS/lib versions (e.g. glibc, numpy / LightGBM / BLAS build) recorded and compared where available (committee rec #7; SIMD/math-lib reductions are environment-bound). Where a dimension cannot be recorded from the canonical run, P0 is treated as not established (conservative → no FIDELITY_OK).

All recorded/surfaced via ADR-92 (build SHA) + ADR-23 (feature pinning) — the canonical row carries them; the replay records its own. Rationale: bit-identity to a canonical produced on a different image / lib set / CPU microarch is unachievable in principle (SIMD widths, math-lib reductions) even under §3.C flags — so a divergence under non-parity is an environment/version artefact, not a logic fidelity gap. P0 therefore gates the fidelity claim (§1): non-parity routes a beyond-ε divergence to CANONICAL_DIVERGENCE cause=env_parity_gap and a within-ε match to INCONCLUSIVE_TOOLING reason=env_parity_unverified — never a clean FIDELITY_OK. This is what makes "faithful to prod" a defensible claim rather than "faithful in the diagnostic environment."

§4 Files

File Action
src/commun/finetune/diagnostic/s18_replay_fidelity.py new: _probe_determinism (cross-process), _probe_divergence, _decide_verdict (pure), _measure_prod_envelope (D0). Imports _replay_cell from the existing module — does not mutate it (ADR-25)
src/commun/finetune/diagnostic/s18_step0_replay.py touch only if needed: expose _replay_cell for fresh-process invocation; keep Verdict stable (add, never mutate)
dags/dag_diagnostic__s18_replay_fidelity.py new if §7 = manual DAG (capture+analyse in the same pod)
.github/workflows/… new if §7 = nightly CI
tests/unit/finetune/diagnostic/test_s18_replay_fidelity.py new: the Fig 1 exhaustiveness test (3×3 → verdict total), no-crash, NaN/None handling, determinism-before-canonical ordering
documentation/stories/CVN-N001-EI-S09/{index.md,test_strategy.md} · design/… · runbooks/… hub initialized; architecture / runbook / test-strategy at merge (ADR-0095)
Loki observability catalogue (doc) extend: s18_replay_determinism, s18_replay_divergence, s18_replay_verdict

§5 Risks

Risk Impact Mitigation
Float equality too strict under multi-thread reductions spurious NON_DETERMINISTIC the §3.C pinned config (thread caps + LightGBM determinism) is the mitigation; bit-identity is asserted only under it
Canonical is one prod draw with its own spread CANONICAL_DIVERGENCE on prod noise epsilon ≥ ε_prod, measured under prod's real config + conservative tolerance bound, D0-gated (§3.E)
Env delta replay↔prod (image / lib / CPU microarch) a divergence over-read as a logic fidelity gap; bit-identity unachievable cross-microarch precondition P0 (§3.F): parity gates FIDELITY_OK; non-parity → env_parity_gap / env_parity_unverified
Cross-process cost (replay run ≥ 2× + D0 ×N) slow job minimal cell for H1 wiring; D0 / classification batched nightly or off-hours; cap via max_cells
Canonical absent in PG no H2 INCONCLUSIVE_TOOLING reason=canonical_absentafter the determinism event (referee #4)
Cause not found (real divergence, opaque source) A6 stays open deliverable is detection + characterization, not the fix; fix scoped as a follow-up on the cause
Plan↔code drift (deciding value invented at test time) invalidates pre-registration every deciding value (epsilon=§3.E, thresholds, ordering) is in this plan; the test verifies the rule, it does not define it

§6 Success criteria + ops

Success criteria: 1. Events emitted per the rule (referee #5): s18_replay_determinism (status PASS/FAIL/ERROR) on every cell — including error paths, so the operator never sees a hole; s18_replay_divergence and s18_replay_verdict per §1. All three in the Loki catalogue. 2. The job returns exactly one verdict ∈ {FIDELITY_OK, NON_DETERMINISTIC, CANONICAL_DIVERGENCE, INCONCLUSIVE_TOOLING} with a cause/reason code (Fig 1 exhaustiveness test green, incl. the P0 parity split and the H1-FAIL cause sub-branch). 2b. Env parity P0 recorded (referee #2): the replay's build_sha + instance_type and the canonical run's (via ADR-92) are captured and compared; FIDELITY_OK is emitted only under parity. 3. Scope = all 5 defi_top5/fold-3 cells for cause classification (referee #5): H1 may be proven on a subset, but classification must cover the 5 to distinguish a uniform gap (systematic) from a cell-dependent gap (itself diagnostic — e.g. a data-window issue). On the historically-FAIL s42 cells, the job independently reproduces the divergence (outside the Phase A anchor) and classifies it (NON_DETERMINISTIC vs CANONICAL_DIVERGENCE) — the answer to A6. 4. ε_prod (D0) measured and epsilon resolved per §3.E before any decisional H2 verdict. 5. Cause diagnosis documented, with WA1's candidate causes mapped to what _replay_cell controls (referee #7): applied env (_set_env_for_cell / BASE_ENV overlay), LdP loop order, data window (calendar fold-3), artefact/model version — for each, "pinned" vs "free." 6. No-crash: every error path → structured INCONCLUSIVE_TOOLING (never a UI raise, ADR-25/31).

Ops: readable verdict (JSON artefact + Loki); operator runbook (trigger, read, partial-pod-failure troubleshooting); ADR-92 (build SHA surfaced) if a DAG.

Fig 2 — downstream decision routing (the "why").

flowchart TD
  V{s18_replay_verdict}
  V -->|FIDELITY_OK| OK["Lift A6 for the covered surface
Authorize skip_phase_a=true default
ONLY within covered (family,fold,config)"] V -->|NON_DETERMINISTIC| ND["Fix seed / order / window / threads
A6 stays open · no fidelity claim"] V -->|"CANONICAL_DIVERGENCE (cause=logic_fidelity_gap)"| CD["Real fidelity gap (env controlled by P0)
Diagnose via WA1 → _replay_cell map
Bound s18 verdicts incl. S05/s43 asterisk · feed D2 (EK)"] V -->|"CANONICAL_DIVERGENCE (cause=env_parity_gap)"| CE["Attributed to env/version, NOT logic
Re-generate canonical under a parity-matched image, or match it"] V -->|INCONCLUSIVE_TOOLING| IT["Fix machinery / measure ε_prod / supply canonical
or establish env parity P0 · re-run"]

Gate generalization is bounded (referee #6): a FIDELITY_OK on the covered cells authorizes skip_phase_a=true only for that (family, fold, config) surface; a programme-wide default flip requires coverage of the diagnostic surface (s41 / s42 / s43), tracked separately.

Fig 3 — repeatability (H1 in-process pre-check) vs reproducibility (H1 cross-process, load-bearing).

sequenceDiagram
  participant P as Prod (other process/pod)
  participant A as Replay proc A
  participant B as Replay proc B
  Note over P: canonical f1_buy stored (one draw, spread = ε_prod)
  Note over A,B: pinned deterministic config (§3.C)
  A->>A: in-process repeat (R1 — repeatability, cheap pre-check)
  A-->>B: fresh process, same seed/config (R2 — reproducibility = H1)
  Note over A,B: H1 PASS ⇔ f1_buy(A)==f1_buy(B) exact
  A->>P: H2 only if H1 PASS: |replay − canonical| ≤ epsilon (epsilon ≥ ε_prod)

§7 Settled decisions (committee may revise)

  1. Job = manual DAG OR nightly CI? — proposed: manual DAG diagnostic__s18_replay_fidelity (consistent with the in-pod diagnostic pattern: capture + analyse in the same pod, ephemeral /tmp), + an optional nightly wrapper later. To be ratified.
  2. Dedicated module vs extending s18_step0_replay.py — proposed: dedicated module s18_replay_fidelity.py (pure probes + decision) importing _replay_cell; keeps the existing file stable. To be ratified.
  3. Cell scope — proposed: all 5 defi_top5/fold-3 cells for cause classification (referee #5); a single cell is acceptable only for determinism wiring / smoke. To be ratified.
  4. Determinism = exact equality (no tolerance), under the §3.C pinned config — settled (§1 tie-break 3).
  5. epsilon = max(0.005, ε_prod), ε_prod measured in D0 and gated — settled (§3.E).
  6. H1 = cross-process (reproducibility), in-process repeat = cheap pre-check only — settled (§3.A).
  7. Production runs un-capped (not under §3.C); ε_prod measured under prod's real config — settled (§3.C/§3.E, referee #1). Confirm the prod threading regime in-pod at impl.
  8. Env-parity precondition P0 (build_sha + instance_type via ADR-92) gates FIDELITY_OK; non-parity → env_parity_gap / env_parity_unverifiedsettled (§3.F, §1 tie-break 5, referee #2).
  9. ε_prod = one-sided ~95/95 upper tolerance bound over N ≥ 10 (not max-of-5) — settled (§3.E, referee #4).
  10. H1-FAIL cause sub-branch (numeric_residue vs real_instability at ε_num), pure FAIL preserved — settled (§1, referee #3).

Committee plan_review (ADR-68)

Verdict: PASSED · OK — session f5fdbf8d · OP Meeting 268 · 5 experts (crypto-trader, data-scientist, ops, ml-engineer, architect), strong consensus (all 9.0/9), 0 blockers, 0 dissent. Agreement: the determinism-first cross-process H1, the env-delta handling (P0 + ε_prod under prod's real config), the exhaustive verdict map, the pre-registration discipline, and the observability are sound and ready to implement.

Recommendation disposition (9 non-blocking):

# Recommendation Disposition
1 Feature version pinning (ADR-23) in the parity check folded into P0 (§3.F)
7 Extend P0 to OS/lib versions (glibc, BLAS, …) folded into P0 (§3.F)
9 H1-FAIL cause as a warn-level Loki event folded into §1 (s18_replay_determinism_cause severity=warn)
4 Ratify job mode (manual DAG + nightly wrapper) already proposed §7.1 — to ratify at impl
5 Validate §3.C across hardware/OS + confirm prod threading in-pod already §7.7 — impl-time validation
2 Continuous ε_prod drift monitoring follow-up (ops; post-impl)
3 Resource/scheduling isolation for D0 follow-up (runbook artefact)
6 Canonical baseline lifecycle / staleness process follow-up (runbook artefact)
8 Stress-test scenarios (injected fidelity gap) + ε_num sensitivity follow-up (test-strategy artefact, at merge)

Methodological invariants — honest application to S09

S09 is a validity / tooling diagnostic, not an economic parameter-sweep diagnostic. Some template invariants apply verbatim; others are N/A with reason (not copied blindly).

Invariant (template) S09 application
Pre-registration of the rule full — rule frozen in §1 (pseudo-code + tie-break + Fig 1), epsilon frozen in §3.E
No-crash / fail-loud (ADR-25/31) full — structured INCONCLUSIVE_TOOLING, no UI raise, no print, NaN guarded
Inconclusives first-class fullINCONCLUSIVE_TOOLING is a verdict
Decision keyed on significance ⚙️ adapted — H1 = exact equality (not a CI); H2 = deterministic threshold epsilon
Bootstrapped envelope statistic N/A — no swept parameter / argmax; nothing to bootstrap (a reproducibility measure, not an effect)
Multiple-comparison correction (FWER) N/A — not a family of decisional tests; two ordered checks, not a grid
Gate on unmeasured deciding inputs ⚙️ adaptedε_prod (D0) gates the decisional H2 verdict (§3.E); no economic cost here

Appendix — plan_review checklist

  • Decision rule frozen (§1) — verdict, tie-break, thresholds, H1→H2 ordering, Fig 1 total.
  • Document fully in English (referee Tier-0) — no residual French, no "(EN)" tag.
  • Hypotheses with acceptance invariant + method (A3); WA1 distinguished; terminology note (no NHST null).
  • Deterministic execution config specified (§3.C, ADR-0096 + LightGBM) — H1 cannot fire on thread noise.
  • H1 cross-process (reproducibility, not in-process repeatability) (§3.A, Fig 3).
  • Canonical determinism handled: epsilon ≥ ε_prod, measured + gated (§3.E).
  • Determinism evaluated/emitted independently of the canonical, event always emitted incl. ERROR (§1, referee #4/#5).
  • Production config settled binary (un-capped) → ε_prod measured under prod's real config, conservative tolerance bound N≥10 (§3.C/§3.E, referee #1/#4).
  • Env-parity precondition P0 gates FIDELITY_OK (build_sha + instance_type via ADR-92); non-parity → env-gap classification (§3.F, referee #2).
  • H1-FAIL cause sub-branch (numeric residue vs real instability) — pure FAIL preserved (§1, referee #3).
  • Prior stated (A6 gaps ≫ epsilon → live hypothesis = gross divergence; effort → control-surface map) (§1, referee #6).
  • epsilon value + provenance frozen in the plan, ε_num justified as a fidelity threshold (§3.E, referee #8).
  • Exhaustive truth table (Fig 1) reflected in the test strategy.
  • No-crash on every error path incl. NaN (ADR-25).
  • Downstream routing explicit and gate generalization bounded (Fig 2, §6).
  • 5 artefacts planned (hub initialized; architecture / runbook / test-strategy at merge, ADR-0095).