Plan dossier (r4 — plan_review PASSED)
Review note. This is the plan for Story S09 (diagnostic template, ADR-0095). It fixes the what & why and pre-registers the decision rule before any code or run. Submitted to the committee for
plan_review. Canonical structural reference: CVN-N001-EI-S05.Story: CVN-N001-EI-S09 · OpenProject: wp#237 · GitHub issue: #1099 · plan dossier: this file (
...-plan.md).Revision journal — r1 (2026-06-09): initial draft. r2 (2026-06-09): referee round 1 incorporated — full English pass; deterministic execution config specified (ADR-0096 + LightGBM determinism); H1 made cross-process (reproducibility, not just in-process repeatability); canonical reproducibility envelope
ε_prodadded and gated; decision rule re-ordered so a missing canonical no longer masks a determinism verdict;epsilonfrozen (value + provenance); cell scope reconciled (5 cells for cause classification); gate generalization bounded; truth table + routing/sequence figures added. r3 (2026-06-09): referee round 2 — the replay↔prod environment delta closed: prod-config made binary (prod runs un-capped →ε_prodmeasured under prod's real config, not §3.C; "ideally" removed); env-parity precondition P0 added (image SHA via ADR-92 + instance type — without itFIDELITY_OKis not posable and a divergence is classified env/version gap, not fidelity gap);ε_prodestimator made conservative (one-sided tolerance bound, N≥10, not max-of-5); H1-FAIL cause sub-branch pre-registered (numeric residue vs real instability; pure FAIL preserved); determinism event always emitted (incl.status=ERROR); the prior stated (A6 gaps 0.1–0.3 ≫ any epsilon → live hypothesis is gross divergence);ε_numjustified as a fidelity threshold. r4 (2026-06-09): committeeplan_reviewPASSED·OK (session f5fdbf8d, OP Meeting 268, 5 experts, strong consensus, 0 blockers, 0 dissent). Three cheap, co-located recommendations folded — P0 extended to feature/artefact version parity (ADR-23) and OS/lib versions (§3.F); H1-FAIL cause emitted as a warn-level Loki event (§1). Remaining recommendations tracked as follow-ups (see "Committee plan_review").
Part A — Narrative (value & accessibility)¶
A1. Problem statement (non-technical)¶
The question in one sentence. When our diagnostics "replay" a production training run to inspect it (the s18 replay), do they replay the same thing production did — and do they replay it the same way twice?
Why now. At the close of S04 (LightGBM capacity ablation), the diagnostic__s42 runs came out with
s18_status=FAIL on all 5 cells: the replayed f1_buy (0.36–0.47) did not match the stored canonical
f1_buy from production. This is action item A6 of the s42 RCA. The problem is not S04's capacity
verdict (orthogonal, intra-run) — it is that until this divergence is explained, we cannot claim "our
diagnostic is faithful to the production baseline." Several diagnostics rest on this replay
(s41 / s42 / s43 — running right now for S05).
The traps / illusions. Today the divergence surfaces only incidentally, via a run's Phase A anchor. Two illusions to avoid: 1. "The replay is wrong" — maybe, but maybe it is unstable (non-deterministic): it would yield a different f1_buy each time, and any comparison to the canonical would be meaningless. 2. "FAIL = a bug in the diagnostic" — on the contrary, an honestly detected FAIL is the information; the real danger is a replay that diverges silently and is taken as faithful.
What we actually measure. Two separate things, in this order:
- Determinism — replayed twice in two fresh processes with the same pinned config and seed, does
the replay return a bit-identical f1_buy?
- Canonical divergence — does the replay (once determinism holds) depart from the stored canonical
f1_buy beyond a tolerance epsilon?
Determinism must be established first: without it, talking about "divergence vs production" is meaningless (we would be comparing noise).
Honesty of the verdict. If we cannot decide (machinery failure, incomplete cell), the verdict is
INCONCLUSIVE_TOOLING — an owned "we don't know," never a false PASS.
Why it matters (the decisions it feeds). Three concrete decisions:
- Immediate downstream — this check bounds the asterisk on the in-flight S05/s43 verdict
("faithful to the prod baseline" vs "fidelity not established").
- Programme downstream — it gates the right to make skip_phase_a=true a default (skipping
the expensive capture) — today forbidden because the Phase A anchor is what reveals A6.
- Feeds D2 of the tradability decision protocol (Epic CVN-N001-EK), which requires a substrate of
validated fidelity.
A2. User stories¶
| # | As a… | I want… | so that… | Section |
|---|---|---|---|---|
| US1 | quant / reviewer | to know whether the s18 replay is deterministic (cross-process) | I do not compare noise to a canonical | A1, §3.A, H1 |
| US2 | quant | to know whether the replay diverges from the prod baseline, independently of a run | the fidelity claim of any s18 diagnostic is bounded | A1, §3.B, H2 |
| US3 | operator | a job (nightly / manual) emitting a readable verdict + Loki events | I can monitor fidelity without launching a full diagnostic | §3.D, runbook |
| US4 | programme DRI | the cause of the divergence (non-determinism vs fidelity gap) | I can decide whether skip_phase_a=true may become the default (gate) |
A1, §6, decision routing |
| US5 | dev2 (Epic EK) | a substrate whose fidelity is validated or bounded | I can feed the D2 pre-study (dependency model) | A1, §6 |
A3. Hypotheses (pre-registered)¶
Each hypothesis = statement + acceptance criterion tested + method. We distinguish tested hypotheses from surfaced working assumptions (confirmed in-pod, not statistically tested).
Terminology note (referee #12). These are acceptance invariants, not NHST nulls: the held criterion is PASS (the inverse of the usual "reject the null" convention), and there is no α / power — determinism is an exact-equality invariant, divergence a fixed-threshold criterion.
H1 — Replay determinism (cross-process). Under the pinned deterministic config (§3.C), the
s18_step0_replay path is reproducible: two executions in two fresh processes (same seed, same cell,
same window) return a bit-identical (IEEE-754 exact-equal) f1_buy.
- Acceptance invariant (H1): f1_buy(proc_a) == f1_buy(proc_b) exactly.
- Why cross-process (referee #2): H2 compares the replay to a canonical produced by production, in a
different process/pod. An in-process double-call only proves repeatability (Gauge R&R: the first
"R"); the fidelity comparison of H2 needs reproducibility (the second "R") — i.e. determinism that
survives a fresh process. So H1 is cross-process. An in-process double-call is kept only as a cheap
pre-check (R1) that does not, by itself, license H2.
- Method: run _replay_cell in two fresh subprocesses under the pinned config; compare f1_buy.
event=s18_replay_determinism status=PASS|FAIL abs_delta_runs=<>. No statistics — a reproducibility
assertion (golden master).
H2 — Fidelity to the canonical. The replay's f1_buy matches the canonical f1_buy stored in
production (finetune_results) within epsilon.
- Acceptance invariant (H2): abs(observed_f1_buy − canonical_f1_buy) ≤ epsilon.
- Method: read the canonical out-of-run (query_canonical_from_pg), compare to the replay after
H1 has established determinism. event=s18_replay_divergence severity=warn|info abs_delta=<>.
Independent of skip_phase_a (the check does not depend on a diagnostic run).
- epsilon is frozen here — see §3.E. Its floor is the production-path reproducibility envelope
(referee #3): a perfectly faithful, deterministic replay still differs from a single canonical draw
by production's own run-to-run spread, so epsilon must dominate that spread or H2 confounds infidelity
with the canonical generator's noise.
Working assumption WA1 (surfaced, not statistically tested) — if H2 is rejected while H1 PASSes,
the cause is a real fidelity gap in the replay path (applied env, LdP loop order, data window,
artefact version), not noise. WA1 steers the cause diagnosis (§6 deliverable); it is not a decisional
test. WA1's candidate causes are mapped to what _replay_cell actually controls in §6 (referee #7).
The disambiguation is the diagnostic — it is the Cartesian product (H1 × H2) that names the cause (Fig 1), not H1 or H2 alone.
A4. State of the art¶
- Reproducibility as a first-class property. Non-determinism in ML training (data order, unseeded RNG, parallel reduction order) is a known source of irreproducible metrics [Pineau2021]. The discipline is to test bit-reproducibility, not assume it — here via a same-seed double-run in fresh processes (characterization / golden-master testing [Feathers2004]).
- Repeatability vs reproducibility (Gauge R&R). Measurement-system analysis separates repeatability (same operator/instrument, repeated) from reproducibility (across operators/conditions) [AIAG-MSA, Montgomery2013]. "Replay-divergence" is a reproducibility question; an in-process repeat is the wrong "R" (referee #2).
- Capture-replay / training-serving skew. A diagnostic that replays a production training path is exposed to the same failure mode as training-serving skew: the replayed environment silently differs from the canonical one [GoogleMLRules]. The mitigation is an independent fidelity check keyed on a stored canonical, not an incidental anchor.
- Order of operations (determinism before divergence). Comparing a stochastic quantity to a reference is meaningless until the quantity is shown stable; establishing determinism precedes any divergence claim — the same logic as fixing measurement repeatability before assessing bias.
- Bit-reproducibility requires a pinned numeric environment. GBM/BLAS reductions are order-sensitive; bit-identical results require capped thread pools and library-level determinism flags [LightGBM-Determinism], which is exactly the contract of ADR-0096 (compute pods cap thread pools to cgroup) and the s43 thread-capped golden fixture (referee #1).
- Pre-registration of the decision rule. The verdict mapping (thresholds, tie-break) is frozen before the run to avoid post-hoc rationalization [Nosek2018], per ADR-0095.
| Reference | Grounds |
|---|---|
| [Pineau2021] Pineau et al., Improving Reproducibility in ML Research (JMLR 2021) | H1, determinism |
| [Feathers2004] Feathers, Working Effectively with Legacy Code (golden master) | H1, method |
| [AIAG-MSA] AIAG, Measurement Systems Analysis (MSA) — Gauge R&R | repeatability vs reproducibility (referee #2/#13) |
| [Montgomery2013] Montgomery, Introduction to Statistical Quality Control | Gauge R&R |
| [GoogleMLRules] Zinkevich, Rules of Machine Learning (training-serving skew) | H2, fidelity |
[LightGBM-Determinism] LightGBM docs — deterministic, force_row_wise |
§3.C deterministic config |
| [Nosek2018] Nosek et al., The preregistration revolution (PNAS 2018) | pre-registration invariant |
ADRs cited: ADR-0095 (diagnostic template), ADR-0096 (thread-pool cap / bit-identity), ADR-25
(no silent fallback / fail-loud), ADR-31 (no print), ADR-92 (build SHA surfaced).
A5. Consolidation & traceability¶
Problem → hypothesis → US → section → literature (no dangling thread):
| Problem | Hypothesis | US | Section | Literature |
|---|---|---|---|---|
| Replay unstable (cross-process)? | H1 | US1 | §3.A, §3.C | Pineau2021, Feathers2004, AIAG-MSA, LightGBM-Determinism |
| Replay ≠ prod? | H2 | US2 | §3.B, §3.E | GoogleMLRules |
| Cause (unstable vs gap)? | H1×H2 + WA1 | US4 | A1, §6 | AIAG-MSA |
| Operable monitoring | — | US3 | §3.D, runbook | — |
| Substrate for D2 (EK) | — | US5 | A1, §6 | — |
Decision routing (verdict → action) — pre-registered; pseudo-code in §1, truth table Fig 1, flow Fig 2:
| Verdict | Meaning | Action |
|---|---|---|
FIDELITY_OK |
H1 PASS + H2 invariant held | lift the A6 caveat for the covered surface; authorize skip_phase_a=true default only within that surface (referee #6) |
NON_DETERMINISTIC |
H1 FAIL | fix determinism (seed / order / window / threads) before any fidelity claim; A6 stays open |
CANONICAL_DIVERGENCE |
H1 PASS + H2 rejected | real fidelity gap → diagnose the cause (WA1); bounds the "faithful to prod" claim of s18 verdicts (incl. S05) |
INCONCLUSIVE_TOOLING |
machinery / incomplete cell / NaN | fix + re-run (no-crash, ADR-25) |
Part B — Technical specification¶
§0 Provenance¶
- Discovery: close of CVN-N001-EI-S04 (capacity
verdict
B_DEFAULTS_OVERFIT) —s18_status=FAILon 5/5 cells in s42 Phase A, observed_f1 0.36–0.47. - Sources: s42 RCA (A6) · s42 fix-plan §2.E · s42 experiment §4.
- Orthogonality: independent of the S04 intra-run capacity verdict; this is a substrate-validity guard (cross-cutting s41 / s42 / s43).
- Existing code:
src/commun/finetune/diagnostic/s18_step0_replay.py—Verdict(status PASS|FAIL|ERROR, expected_f1_buy, observed_f1_buy, abs_delta, epsilon, …),query_canonical_from_pg,_replay_cell. The current check is single-replay vs canonical and coupled to the Phase A anchor.
§1 Objective + pre-registered decision rule¶
Objective: deliver an independent detection (outside a diagnostic run) of (A) replay
non-determinism (cross-process) and (B) canonical divergence, with Loki events and a verdict that
names the cause — the precondition to the skip_phase_a=true-by-default gate.
Stated prior (referee #6 — allocates the diagnostic effort). The A6 deltas are gross: replayed
f1_buy 0.36–0.47 vs a canonical plausibly ~0.6–0.7, i.e. an order of 0.1–0.3 ≫ any plausible epsilon
(ε_num = 0.005). The live hypothesis is therefore CANONICAL_DIVERGENCE (a coarse fidelity/env gap);
H1 is a fast prerequisite to rule out "it's just noise," not the expected failure. Consequently the
bulk of the cause-diagnosis effort goes to the _replay_cell control-surface map (§6.5: env / order /
window / artefact version) and to the env-parity precondition P0 below — not to chasing 1e-9
determinism residues.
Decision rule (frozen — pseudo-code). Order: env-parity precondition P0 → determinism H1 →
canonical fidelity H2. Determinism is evaluated and emitted independently of the canonical
(referee #4); a missing canonical only blocks H2. FIDELITY_OK is not posable unless P0 holds
(referee #2):
# Inputs: cell (ReferenceCell), epsilon (frozen, §3.E = max(ε_num, ε_prod)), ε_num (frozen, §3.E),
# canonical row (PG, out-of-run; carries f1_buy, build_sha, instance_type via ADR-92; may be None),
# this replay's build_sha + instance_type.
# Pinned deterministic config (§3.C) applied in every replay subprocess.
# Stage A — determinism (H1), cross-process; needs neither the canonical nor P0
r_a = replay_cell_fresh_process(cell) # observed f1_buy + extended (diagnostic-only) metrics
r_b = replay_cell_fresh_process(cell) # identical pinned config / seed / window, FRESH process
if r_a is ERROR or r_b is ERROR or isnan(r_a.f1_buy) or isnan(r_b.f1_buy):
emit s18_replay_determinism status=ERROR # always emitted (referee #5, no obs. hole)
emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=replay_error
return INCONCLUSIVE_TOOLING # NaN guarded here (referee #11)
det_delta = abs(r_a.f1_buy - r_b.f1_buy)
det_status = PASS if (r_a.f1_buy == r_b.f1_buy) else FAIL # IEEE-754 exact equality (referee #14)
emit s18_replay_determinism status=det_status abs_delta_runs=det_delta
if det_status == FAIL:
# Pure FAIL preserved; cause sub-branch routed (referee #3 — anti dead-end), NOT a tolerance on H1
det_cause = "numeric_residue" if det_delta <= ε_num else "real_instability"
emit s18_replay_determinism_cause severity=warn cause=det_cause det_delta=det_delta # rec #9: operator-visible, no artefact parse
emit s18_replay_verdict verdict=NON_DETERMINISTIC det_delta=det_delta cause=det_cause
return NON_DETERMINISTIC # H1 FAILS → H2 not assessed (tie-break #1)
# H1 holds → H2 is now meaningful
if canonical is None:
emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=canonical_absent
return INCONCLUSIVE_TOOLING # canonical blocks ONLY H2 (referee #4)
if ε_prod is UNMEASURED: # §3.E gate (referee #3/#4bis)
emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=epsilon_prod_unmeasured
return INCONCLUSIVE_TOOLING
# Precondition P0 — env parity replay↔canonical-generator (referee #2). Without it, FIDELITY is not posable.
parity = (r_a.build_sha == canonical.build_sha) and (r_a.instance_type == canonical.instance_type)
canon_delta = abs(r_a.f1_buy - canonical.f1_buy)
emit s18_replay_divergence severity=(info if canon_delta <= epsilon else warn) abs_delta=canon_delta parity=parity
if not parity:
if canon_delta <= epsilon:
emit s18_replay_verdict verdict=INCONCLUSIVE_TOOLING reason=env_parity_unverified
return INCONCLUSIVE_TOOLING # matches DESPITE env delta → cannot claim clean FIDELITY_OK
else:
emit s18_replay_verdict verdict=CANONICAL_DIVERGENCE cause=env_parity_gap canon_delta=canon_delta
return CANONICAL_DIVERGENCE # divergence attributed to env/version, NOT a logic fidelity gap
# P0 holds → a within-ε result is a clean fidelity pass; a beyond-ε result is a logic fidelity gap
verdict = FIDELITY_OK if canon_delta <= epsilon else CANONICAL_DIVERGENCE
cause = None if canon_delta <= epsilon else "logic_fidelity_gap"
emit s18_replay_verdict verdict=verdict cell=<crypto/fold> det_delta=det_delta canon_delta=canon_delta cause=cause
return verdict
Tie-break / priorities (frozen):
1. Determinism first — a NON_DETERMINISTIC masks any divergence reading (we do NOT name
CANONICAL_DIVERGENCE if H1 failed — the divergence would be uninterpretable).
2. Tooling wins — any error path (replay ERROR, NaN, canonical absent for H2) → INCONCLUSIVE_TOOLING,
never a false PASS, never a UI crash. Determinism is emitted before the canonical is consulted, so a
missing canonical no longer hides a determinism failure (referee #4).
3. f1_buy(a) == f1_buy(b) exact — determinism is bit-identity (IEEE-754 exact equality), not
"close." Any inequality = FAIL. This is what separates instability from a fidelity gap (and it is only
meaningful under the pinned config §3.C).
4. epsilon is frozen in §3.E (value + provenance) — not a free parameter; this closes the
pre-registration hole the referee flagged (#4bis).
5. Env parity P0 gates FIDELITY (referee #2) — FIDELITY_OK is posable only when the replay's
build_sha + instance_type equal those of the run that produced the canonical (recorded via ADR-92).
Under non-parity: a within-ε match → INCONCLUSIVE_TOOLING reason=env_parity_unverified (matches
despite an env delta — no clean fidelity claim); a beyond-ε divergence → CANONICAL_DIVERGENCE
cause=env_parity_gap (attributed to env/version, not a logic gap). So a divergence is never
silently over-read as a fidelity-logic failure when the environments differ.
6. ε_prod unmeasured gates H2 — until D0 measures the prod envelope (§3.E), the decisional fidelity
verdict is withheld (INCONCLUSIVE_TOOLING reason=epsilon_prod_unmeasured); H1 + smoke may run.
7. Determinism event always emitted (incl. status=ERROR) — no observability hole (referee #5).
8. H1-FAIL keeps a pure FAIL but routes a cause sub-branch (numeric_residue if det_delta ≤ ε_num
else real_instability) — this is not a tolerance on H1, it prevents the diagnostic from bricking
if perfect bit-identity proves unreachable in the stack (referee #3).
Fig 1 — truth table (exactly one verdict per cell). P0 (env parity) is the gating precondition on the PASS row only (it is irrelevant when H1 FAIL/ERROR already decides). FAIL/ERROR rows are P0-independent.
H1 PASS row, split by P0 (env parity replay↔canonical-generator):
| P0 \ H2 (canonical) | ≤ ε |
> ε |
canonical absent | ε_prod unmeasured |
|---|---|---|---|---|
| parity holds | FIDELITY_OK |
CANONICAL_DIVERGENCE (cause=logic_fidelity_gap) |
INCONCLUSIVE_TOOLING (canonical_absent) |
INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured) |
| parity fails | INCONCLUSIVE_TOOLING (env_parity_unverified) |
CANONICAL_DIVERGENCE (cause=env_parity_gap) |
INCONCLUSIVE_TOOLING (canonical_absent) |
INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured) |
H1 FAIL / ERROR rows (P0- and H2-independent):
| H1 | any H2 / P0 |
|---|---|
| FAIL (diverges across processes) | NON_DETERMINISTIC (cause = numeric_residue | real_instability) |
| ERROR / NaN (replay broke) | INCONCLUSIVE_TOOLING (replay_error) |
Reading: the FAIL row collapses to
NON_DETERMINISTICregardless of H2/P0 (tie-break #1); the ERROR row collapses toINCONCLUSIVE_TOOLING; H2 and P0 are consulted only on the PASS row. The test-strategy exhaustiveness test asserts this full mapping is total.
§2 Scope¶
In scope:
- Determinism check (cross-process double-replay, bit-identity assertion) under a pinned config.
- Canonical divergence check (replay vs finetune_results, independent of skip_phase_a).
- D0 — production reproducibility envelope (ε_prod): characterize prod's own f1_buy run-to-run
spread to set epsilon's floor (§3.E). Gated input to the decisional H2 verdict.
- A job (manual DAG or nightly CI — §7) emitting the 3 events + a verdict artefact.
- Cause diagnosis (non-determinism vs fidelity gap), mapped to what _replay_cell controls (§6).
- Loki catalogue entries (3 events).
Out of scope:
- Fixing the divergence (an env/order/window fix is a follow-up, contingent on the cause found).
- Flipping skip_phase_a=true to default (that is the downstream gated decision, not this deliverable).
- Any economic / capacity verdict (orthogonal — S04).
§3 Approach (Hamilton-native, 2 layers)¶
A. Determinism layer (_probe_determinism) — pure function over two fresh-process _replay_cell
executions; returns (det_delta, status). Reuses the existing _replay_cell (no re-implementation of the
training path); the fresh-process isolation is the orchestration's job (subprocess / task).
B. Divergence layer (_probe_divergence) — pure function; reads the canonical via
query_canonical_from_pg (out-of-run), compares to the determined replay. Decoupled from the Phase A
anchor.
C. Deterministic execution config (frozen — referee #1). Every replay subprocess runs under:
- thread caps: OMP_NUM_THREADS=1, MKL_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1 (per ADR-0096;
order-sensitive BLAS/GBM reductions otherwise break bit-identity within the same process);
- LightGBM determinism: deterministic=true, force_row_wise=true, fixed seed / bagging_seed /
feature_fraction_seed, num_threads=1 ([LightGBM-Determinism]);
- PYTHONHASHSEED fixed; single, pinned data window (calendar fold-3).
Binary settlement on the production config (referee #1 — no "ideally"). §3.C governs the replay.
The production training path that produced the stored canonical runs un-capped (multi-threaded for
throughput — it is not under §3.C). We therefore do not assume prod determinism: the canonical
carries production's own run-to-run spread, which is captured by ε_prod measured under prod's real
config (§3.E), not under §3.C. (Were prod ever run under §3.C, ε_prod → 0 and epsilon = ε_num — but
that is not the operating assumption.) Without §3.C on the replay side, H1 would emit spurious
NON_DETERMINISTIC for a reason foreign to the replay (the referee's core Tier-1 point).
D. Decision + orchestration (_decide_verdict) — pure function implementing the §1 pseudo-code
(Cartesian product H1×H2 → exactly one verdict; Fig 1). A job loops over the reference cells, emits the
events, writes the verdict artefact. Hamilton-native style (named node graph, isolated I/O, decision
in pure functions — no adapter wrappers). Extended metrics are diagnostic-only, not decisional: the
rule uses det_delta/canon_delta on f1_buy only (referee #8).
E. epsilon — frozen value + provenance (referee #3, #4, #4bis, #8).
epsilon = max(ε_num, ε_prod) where
- ε_num = 0.005 — the numerical replay tolerance inherited from the s43 DAG / s18_step0_replay.
Justification as a fidelity threshold (referee #8), not merely an inherited replay knob: 0.005 in
absolute f1_buy is ~1 % of a ~0.6 operating f1_buy and is two orders of magnitude below the A6 gap
(0.1–0.3) the check must catch — i.e. small enough that a genuine fidelity gap cannot hide under it, large
enough to absorb float/repr noise of an otherwise-faithful replay. (It is not a tradability threshold;
it is the "indistinguishable-from-faithful" band on the metric.)
- ε_prod — the production-path reproducibility envelope, measured under production's real
(un-capped) config (referee #1), not §3.C: it must bound how much the stored canonical itself could
have varied. Conservative estimator (referee #4 — the cost-sensitivity under-power lesson): a
one-sided upper tolerance bound (≈95 %/95 %) on |Δf1_buy| over N ≥ 10 fresh-process re-runs of
the prod path on the reference cell — not the max-of-5 (a biased-low estimator of the true max).
Measured in D0 and gated: until measured, the decisional H2 verdict is withheld
(INCONCLUSIVE_TOOLING reason=epsilon_prod_unmeasured); H1 + smoke may run (mirrors the s43 cost gate).
F. Env-parity precondition P0 (referee #2; committee recs #1/#7 folded). FIDELITY_OK is not
posable unless the replay's execution environment matches the one that produced the canonical, on the
parity vector:
- build_sha(replay) == build_sha(canonical-run) (champollion image, ADR-92);
- instance_type(replay) == instance_type(canonical-run) (CPU microarch);
- feature/artefact version parity (ADR-23) — the OHLCV / Enrichment / FeatureEngineering output
versions consumed by _replay_cell identical to the canonical's (committee rec #1; this is also WA1's
"artefact version" cause made a precondition);
- OS/lib versions (e.g. glibc, numpy / LightGBM / BLAS build) recorded and compared where available
(committee rec #7; SIMD/math-lib reductions are environment-bound). Where a dimension cannot be recorded
from the canonical run, P0 is treated as not established (conservative → no FIDELITY_OK).
All recorded/surfaced via ADR-92 (build SHA) + ADR-23 (feature pinning) — the canonical row carries
them; the replay records its own.
Rationale: bit-identity to a canonical produced on a different image / lib set / CPU microarch is
unachievable in principle (SIMD widths, math-lib reductions) even under §3.C flags — so a divergence
under non-parity is an environment/version artefact, not a logic fidelity gap. P0 therefore gates
the fidelity claim (§1): non-parity routes a beyond-ε divergence to CANONICAL_DIVERGENCE cause=env_parity_gap
and a within-ε match to INCONCLUSIVE_TOOLING reason=env_parity_unverified — never a clean FIDELITY_OK.
This is what makes "faithful to prod" a defensible claim rather than "faithful in the diagnostic
environment."
§4 Files¶
| File | Action |
|---|---|
src/commun/finetune/diagnostic/s18_replay_fidelity.py |
new: _probe_determinism (cross-process), _probe_divergence, _decide_verdict (pure), _measure_prod_envelope (D0). Imports _replay_cell from the existing module — does not mutate it (ADR-25) |
src/commun/finetune/diagnostic/s18_step0_replay.py |
touch only if needed: expose _replay_cell for fresh-process invocation; keep Verdict stable (add, never mutate) |
dags/dag_diagnostic__s18_replay_fidelity.py |
new if §7 = manual DAG (capture+analyse in the same pod) |
.github/workflows/… |
new if §7 = nightly CI |
tests/unit/finetune/diagnostic/test_s18_replay_fidelity.py |
new: the Fig 1 exhaustiveness test (3×3 → verdict total), no-crash, NaN/None handling, determinism-before-canonical ordering |
documentation/stories/CVN-N001-EI-S09/{index.md,test_strategy.md} · design/… · runbooks/… |
hub initialized; architecture / runbook / test-strategy at merge (ADR-0095) |
| Loki observability catalogue (doc) | extend: s18_replay_determinism, s18_replay_divergence, s18_replay_verdict |
§5 Risks¶
| Risk | Impact | Mitigation |
|---|---|---|
| Float equality too strict under multi-thread reductions | spurious NON_DETERMINISTIC |
the §3.C pinned config (thread caps + LightGBM determinism) is the mitigation; bit-identity is asserted only under it |
| Canonical is one prod draw with its own spread | CANONICAL_DIVERGENCE on prod noise |
epsilon ≥ ε_prod, measured under prod's real config + conservative tolerance bound, D0-gated (§3.E) |
| Env delta replay↔prod (image / lib / CPU microarch) | a divergence over-read as a logic fidelity gap; bit-identity unachievable cross-microarch | precondition P0 (§3.F): parity gates FIDELITY_OK; non-parity → env_parity_gap / env_parity_unverified |
| Cross-process cost (replay run ≥ 2× + D0 ×N) | slow job | minimal cell for H1 wiring; D0 / classification batched nightly or off-hours; cap via max_cells |
| Canonical absent in PG | no H2 | INCONCLUSIVE_TOOLING reason=canonical_absent — after the determinism event (referee #4) |
| Cause not found (real divergence, opaque source) | A6 stays open | deliverable is detection + characterization, not the fix; fix scoped as a follow-up on the cause |
| Plan↔code drift (deciding value invented at test time) | invalidates pre-registration | every deciding value (epsilon=§3.E, thresholds, ordering) is in this plan; the test verifies the rule, it does not define it |
§6 Success criteria + ops¶
Success criteria:
1. Events emitted per the rule (referee #5): s18_replay_determinism (status PASS/FAIL/ERROR) on
every cell — including error paths, so the operator never sees a hole; s18_replay_divergence and
s18_replay_verdict per §1. All three in the Loki catalogue.
2. The job returns exactly one verdict ∈ {FIDELITY_OK, NON_DETERMINISTIC, CANONICAL_DIVERGENCE,
INCONCLUSIVE_TOOLING} with a cause/reason code (Fig 1 exhaustiveness test green, incl. the P0
parity split and the H1-FAIL cause sub-branch).
2b. Env parity P0 recorded (referee #2): the replay's build_sha + instance_type and the canonical
run's (via ADR-92) are captured and compared; FIDELITY_OK is emitted only under parity.
3. Scope = all 5 defi_top5/fold-3 cells for cause classification (referee #5): H1 may be proven on a
subset, but classification must cover the 5 to distinguish a uniform gap (systematic) from a
cell-dependent gap (itself diagnostic — e.g. a data-window issue). On the historically-FAIL s42
cells, the job independently reproduces the divergence (outside the Phase A anchor) and classifies
it (NON_DETERMINISTIC vs CANONICAL_DIVERGENCE) — the answer to A6.
4. ε_prod (D0) measured and epsilon resolved per §3.E before any decisional H2 verdict.
5. Cause diagnosis documented, with WA1's candidate causes mapped to what _replay_cell controls
(referee #7): applied env (_set_env_for_cell / BASE_ENV overlay), LdP loop order, data window
(calendar fold-3), artefact/model version — for each, "pinned" vs "free."
6. No-crash: every error path → structured INCONCLUSIVE_TOOLING (never a UI raise, ADR-25/31).
Ops: readable verdict (JSON artefact + Loki); operator runbook (trigger, read, partial-pod-failure troubleshooting); ADR-92 (build SHA surfaced) if a DAG.
Fig 2 — downstream decision routing (the "why").
flowchart TD
V{s18_replay_verdict}
V -->|FIDELITY_OK| OK["Lift A6 for the covered surface
Authorize skip_phase_a=true default
ONLY within covered (family,fold,config)"]
V -->|NON_DETERMINISTIC| ND["Fix seed / order / window / threads
A6 stays open · no fidelity claim"]
V -->|"CANONICAL_DIVERGENCE (cause=logic_fidelity_gap)"| CD["Real fidelity gap (env controlled by P0)
Diagnose via WA1 → _replay_cell map
Bound s18 verdicts incl. S05/s43 asterisk · feed D2 (EK)"]
V -->|"CANONICAL_DIVERGENCE (cause=env_parity_gap)"| CE["Attributed to env/version, NOT logic
Re-generate canonical under a parity-matched image, or match it"]
V -->|INCONCLUSIVE_TOOLING| IT["Fix machinery / measure ε_prod / supply canonical
or establish env parity P0 · re-run"]
Gate generalization is bounded (referee #6): a FIDELITY_OK on the covered cells authorizes
skip_phase_a=true only for that (family, fold, config) surface; a programme-wide default flip
requires coverage of the diagnostic surface (s41 / s42 / s43), tracked separately.
Fig 3 — repeatability (H1 in-process pre-check) vs reproducibility (H1 cross-process, load-bearing).
sequenceDiagram
participant P as Prod (other process/pod)
participant A as Replay proc A
participant B as Replay proc B
Note over P: canonical f1_buy stored (one draw, spread = ε_prod)
Note over A,B: pinned deterministic config (§3.C)
A->>A: in-process repeat (R1 — repeatability, cheap pre-check)
A-->>B: fresh process, same seed/config (R2 — reproducibility = H1)
Note over A,B: H1 PASS ⇔ f1_buy(A)==f1_buy(B) exact
A->>P: H2 only if H1 PASS: |replay − canonical| ≤ epsilon (epsilon ≥ ε_prod)
§7 Settled decisions (committee may revise)¶
- Job = manual DAG OR nightly CI? — proposed: manual DAG
diagnostic__s18_replay_fidelity(consistent with the in-pod diagnostic pattern: capture + analyse in the same pod, ephemeral/tmp), + an optional nightly wrapper later. To be ratified. - Dedicated module vs extending
s18_step0_replay.py— proposed: dedicated modules18_replay_fidelity.py(pure probes + decision) importing_replay_cell; keeps the existing file stable. To be ratified. - Cell scope — proposed: all 5 defi_top5/fold-3 cells for cause classification (referee #5); a single cell is acceptable only for determinism wiring / smoke. To be ratified.
- Determinism = exact equality (no tolerance), under the §3.C pinned config — settled (§1 tie-break 3).
epsilon = max(0.005, ε_prod),ε_prodmeasured in D0 and gated — settled (§3.E).- H1 = cross-process (reproducibility), in-process repeat = cheap pre-check only — settled (§3.A).
- Production runs un-capped (not under §3.C);
ε_prodmeasured under prod's real config — settled (§3.C/§3.E, referee #1). Confirm the prod threading regime in-pod at impl. - Env-parity precondition P0 (build_sha + instance_type via ADR-92) gates
FIDELITY_OK; non-parity →env_parity_gap/env_parity_unverified— settled (§3.F, §1 tie-break 5, referee #2). ε_prod= one-sided ~95/95 upper tolerance bound over N ≥ 10 (not max-of-5) — settled (§3.E, referee #4).- H1-FAIL cause sub-branch (
numeric_residuevsreal_instabilityatε_num), pureFAILpreserved — settled (§1, referee #3).
Committee plan_review (ADR-68)¶
Verdict: PASSED · OK — session f5fdbf8d · OP Meeting 268 · 5 experts (crypto-trader, data-scientist,
ops, ml-engineer, architect), strong consensus (all 9.0/9), 0 blockers, 0 dissent. Agreement: the
determinism-first cross-process H1, the env-delta handling (P0 + ε_prod under prod's real config), the
exhaustive verdict map, the pre-registration discipline, and the observability are sound and ready to
implement.
Recommendation disposition (9 non-blocking):
| # | Recommendation | Disposition |
|---|---|---|
| 1 | Feature version pinning (ADR-23) in the parity check | ✅ folded into P0 (§3.F) |
| 7 | Extend P0 to OS/lib versions (glibc, BLAS, …) | ✅ folded into P0 (§3.F) |
| 9 | H1-FAIL cause as a warn-level Loki event | ✅ folded into §1 (s18_replay_determinism_cause severity=warn) |
| 4 | Ratify job mode (manual DAG + nightly wrapper) | already proposed §7.1 — to ratify at impl |
| 5 | Validate §3.C across hardware/OS + confirm prod threading in-pod | already §7.7 — impl-time validation |
| 2 | Continuous ε_prod drift monitoring |
follow-up (ops; post-impl) |
| 3 | Resource/scheduling isolation for D0 | follow-up (runbook artefact) |
| 6 | Canonical baseline lifecycle / staleness process | follow-up (runbook artefact) |
| 8 | Stress-test scenarios (injected fidelity gap) + ε_num sensitivity |
follow-up (test-strategy artefact, at merge) |
Methodological invariants — honest application to S09¶
S09 is a validity / tooling diagnostic, not an economic parameter-sweep diagnostic. Some template invariants apply verbatim; others are N/A with reason (not copied blindly).
| Invariant (template) | S09 application |
|---|---|
| Pre-registration of the rule | ✅ full — rule frozen in §1 (pseudo-code + tie-break + Fig 1), epsilon frozen in §3.E |
| No-crash / fail-loud (ADR-25/31) | ✅ full — structured INCONCLUSIVE_TOOLING, no UI raise, no print, NaN guarded |
| Inconclusives first-class | ✅ full — INCONCLUSIVE_TOOLING is a verdict |
| Decision keyed on significance | ⚙️ adapted — H1 = exact equality (not a CI); H2 = deterministic threshold epsilon |
| Bootstrapped envelope statistic | ❌ N/A — no swept parameter / argmax; nothing to bootstrap (a reproducibility measure, not an effect) |
| Multiple-comparison correction (FWER) | ❌ N/A — not a family of decisional tests; two ordered checks, not a grid |
| Gate on unmeasured deciding inputs | ⚙️ adapted — ε_prod (D0) gates the decisional H2 verdict (§3.E); no economic cost here |
Appendix — plan_review checklist¶
- Decision rule frozen (§1) — verdict, tie-break, thresholds, H1→H2 ordering, Fig 1 total.
- Document fully in English (referee Tier-0) — no residual French, no "(EN)" tag.
- Hypotheses with acceptance invariant + method (A3); WA1 distinguished; terminology note (no NHST null).
- Deterministic execution config specified (§3.C, ADR-0096 + LightGBM) — H1 cannot fire on thread noise.
- H1 cross-process (reproducibility, not in-process repeatability) (§3.A, Fig 3).
- Canonical determinism handled:
epsilon ≥ ε_prod, measured + gated (§3.E). - Determinism evaluated/emitted independently of the canonical, event always emitted incl.
ERROR(§1, referee #4/#5). - Production config settled binary (un-capped) →
ε_prodmeasured under prod's real config, conservative tolerance bound N≥10 (§3.C/§3.E, referee #1/#4). - Env-parity precondition P0 gates
FIDELITY_OK(build_sha + instance_type via ADR-92); non-parity → env-gap classification (§3.F, referee #2). - H1-FAIL cause sub-branch (numeric residue vs real instability) — pure FAIL preserved (§1, referee #3).
- Prior stated (A6 gaps ≫ epsilon → live hypothesis = gross divergence; effort → control-surface map) (§1, referee #6).
-
epsilonvalue + provenance frozen in the plan,ε_numjustified as a fidelity threshold (§3.E, referee #8). - Exhaustive truth table (Fig 1) reflected in the test strategy.
- No-crash on every error path incl. NaN (ADR-25).
- Downstream routing explicit and gate generalization bounded (Fig 2, §6).
- 5 artefacts planned (hub initialized; architecture / runbook / test-strategy at merge, ADR-0095).