Architecture (r2 — for review)

How it's built. This is the architecture artefact (ADR-0095) for S09. It realizes the pre-registered decision rule of the plan dossier r4 (committee plan_review PASSED·OK, Meeting 268) — it does not re-decide the rule. Hamilton-native, two layers (Airflow orchestration / pure compute), I/O isolated, decision in pure functions.

Changelog — r1 (2026-06-09): initial architecture, traces plan r4 (cross-process H1, env-parity P0, ε_prod D0, frozen verdict map). r2 (2026-06-09): architecture referee round 1 — #1 §3 component view fixed (removed the false canonical → determinism edge; H1 is canonical-independent, §5 is truth); #2 P0 made a precondition with an INFEASIBLE-until-schema gate (canonical-side provenance may not exist in finetune_results); #3 replay_env fingerprint captured in the subprocess + §3.C split into spawn-time (PYTHONHASHSEED + thread caps) vs worker-time (LightGBM flags); #4 D0/ε_prod separated to its own cadence (computed once, keyed by prod image SHA, stored + read — not per-verdict); #5 exhaustiveness re-axed to the true cartesian product H1×P0×H2×canonical_present×eps_prod_measured (order H1→P0→H2); #6 Verdict schema completed (both-sides provenance + ε values); #7 parity made tri-state {equal, differs, unverifiable}; #8 deployment + sequence views added; #9/#10/#11 wording/ADR-scope/justification fixes.

§0 — Provenance & scope¶

Realizes plan r4: detect (A) replay non-determinism (cross-process) and (B) canonical divergence, independently of any diagnostic run, and emit a verdict that names the cause — the precondition to the skip_phase_a=true-by-default gate. Reuses the existing _replay_cell (training-path replay) and query_canonical_from_pg from src/commun/finetune/diagnostic/s18_step0_replay.py; it does not re-implement the training path. Orthogonal to the S04 capacity verdict.

1. Context & scope¶

In: a determinism probe (two fresh-process replays under the pinned config §3.C of the plan), a divergence probe (replay vs stored canonical), the env-parity precondition P0, the production reproducibility envelope ε_prod (D0), a pure decision, and the Loki events.
Out: fixing a divergence; flipping skip_phase_a; any economic/capacity verdict.
Units: the 5 defi_top5 / fold-3 reference cells (cause classification needs the 5; H1 wiring may use 1).

2. Architectural style — Hamilton-native, two layers¶

Layer	Responsibility	Purity
Airflow orchestration (`dag_diagnostic__s18_replay_fidelity.py`)	resolve cells → spawn the fresh-process replays (H1) and the D0 envelope runs → call the pure decision → emit events / write artefact	side-effecting (I/O, subprocess, PG, Loki)
Pure compute (`commun.finetune.diagnostic.s18_replay_fidelity`)	`_probe_determinism`, `_probe_divergence`, `_probe_parity`, `_measure_prod_envelope` (D0), `_decide_verdict`	pure — no I/O, no `os.environ` writes, no PG

Rationale: the decision is the auditable core (the frozen rule + truth table Fig 1 of the plan); keeping it a pure function over already-collected results makes the exhaustiveness test trivially checkable and immune to I/O flakiness (Hamilton-native — no adapter wrappers).

2bis. Pinned-config application — spawn-time vs worker-time (#3)¶

The plan's §3.C config cannot all be set inside the worker — some knobs are read before any worker code runs. The architecture splits them by when they take effect:

Knob	When read	How applied
`PYTHONHASHSEED`	interpreter start	spawn-time — injected into the child's `env` at process creation (cannot be set after the interpreter is up)
`OMP_/MKL_/OPENBLAS_NUM_THREADS`	first BLAS/OMP import	spawn-time — injected into the child `env` before numpy/LightGBM import (a worker that imports numpy then sets them is too late)
LightGBM `deterministic` / `force_row_wise` / seeds / `num_threads`	at `train()`	worker-time — passed as training params inside `_replay_cell`

Consequence: the determinism subprocess is launched with start-method spawn and an explicit child env carrying the spawn-time knobs (see §7). A fork-ed or env-after-import worker would believe it is pinned without being so — H1 would silently test an un-pinned path.

3. Component view¶

flowchart TD
  subgraph D0DAG["D0 — ε_prod characterization (separate gated DAG, RARE cadence — #4)"]
    M0["measure_prod_envelope
N≥10 prod-config (UN-capped) re-runs"]
    M0 --> EPS["ε_prod = one-sided 95/95 upper tol. bound
stored, KEY = prod image SHA"]
  end
  subgraph ORCH["Verdict job — Airflow orchestration (DAG, side-effecting)"]
    RC[resolve_cells: defi_top5/fold-3]
    SPA["spawn replay proc A (fresh, §3.C-spawn pinned)
returns f1_buy + env_fingerprint"]
    SPB["spawn replay proc B (fresh, §3.C-spawn pinned)"]
    CAN[load canonical row from PG
f1_buy + provenance (build_sha,instance_type,feature/lib vers)]
    RDE["read ε_prod from store (by prod image SHA)"]
    EMIT[emit Loki events + write verdict artefact]
  end
  subgraph PURE["Pure compute (Hamilton nodes, no I/O)"]
    PD[_probe_determinism]
    PP[_probe_parity P0 — tri-state]
    PV[_probe_divergence]
    DV[_decide_verdict]
  end
  EPS -. "consumed by reference (NOT re-measured per verdict)" .-> RDE
  RC --> SPA & SPB & CAN & RDE
  SPA --> PD
  SPB --> PD
  SPA --> PV
  SPA --> PP
  CAN --> PP
  CAN --> PV
  RDE --> PV
  PD --> DV
  PP --> DV
  PV --> DV
  RDE --> DV
  DV --> EMIT

#1 fix: there is no canonical → _probe_determinism edge — H1 is canonical-independent (plan r4 tie-break: a missing canonical never masks NON_DETERMINISTIC). §5 is the graph of truth. #4 fix: ε_prod is not measured per verdict; the D0 DAG computes it once, keyed by the prod image SHA, and the verdict job reads it.

4. Modules & responsibilities¶

Module	Responsibility
`dag_diagnostic__s18_replay_fidelity.py`	orchestration: resolve cells, spawn fresh-process replays, run D0, call `_decide_verdict`, emit events, write artefact. ADR-92 build SHA in `doc_md` + first log line.
`commun/finetune/diagnostic/s18_replay_fidelity.py`	the pure probes + decision (below). Imports `_replay_cell`, `query_canonical_from_pg` from `s18_step0_replay` — does not mutate them (ADR-25).
`s18_step0_replay.py` (existing)	reused: `_replay_cell` (training-path replay), `query_canonical_from_pg`. Touched only to expose `_replay_cell` for fresh-process invocation; `Verdict` kept stable (add, never mutate).

4bis. Split decision (pure, independently testable)¶

_probe_determinism(replay_a, replay_b) -> DeterminismResult{status: PASS|FAIL|ERROR, det_delta, det_cause} — det_cause ∈ {numeric_residue, real_instability} on FAIL (det_delta ≤ ε_num vs >). IEEE-754 exact equality on f1_buy. NaN/ERROR → status=ERROR.
_probe_parity(replay_fingerprint, canonical_provenance) -> ParityResult{state, dims} — P0 over the vector build_sha (ADR-92) + instance_type + feature/artefact versions (ADR-23) + OS/lib versions. Tri-state (#7): state ∈ {equal, differs, unverifiable} — equal (all dims present & match), differs (a dim present & differs → env_parity_gap), unverifiable (a dim not recordable on either side → env_parity_unverified). The boolean is gone: _decide_verdict must distinguish "verifiably different" from "cannot tell" (they route to different verdicts). replay_fingerprint is the env captured inside the replay subprocess and returned with the result (#3), not sampled in the orchestrator pod.
_probe_divergence(replay_a, canonical_f1_buy, epsilon) -> DivergenceResult{canon_delta, within: bool}.
_decide_verdict(determinism, parity, divergence, canonical_present, eps_prod_measured) -> Verdict — implements the frozen rule, decision order H1 → P0 → H2 (#5). No I/O.

P0 canonical-provenance precondition + INFEASIBLE-until-schema gate (#2). P0 can only be evaluated if finetune_results (or a side table) actually persists the canonical run's build_sha, instance_type, feature/artefact versions and lib versions. Given the S12/S13 persistence saga (it took two stories to land feature_hash/model_hyperparams), these dimensions likely do not exist yet — so this is a hard precondition, not a runtime detail: until the canonical persistence carries the P0 vector, P0 is non-evaluable and the work is gated INFEASIBLE-until-schema (mirror of the ε_prod gate) — a schema/persistence change ships first. Absent it, _probe_parity returns unverifiable on every cell → INCONCLUSIVE_TOOLING everywhere (the same dead-end class the exact-equality guard avoids). The plan-side follow-up is a persistence story for the canonical provenance columns.

Each is unit-tested in isolation; _decide_verdict carries the exhaustiveness test (§5bis).

5. Named Hamilton graph (the real architecture)¶

flowchart LR
  cell[cell] --> ra[replay_a]
  cell --> rb[replay_b]
  ra --> det["determinism (H1)
NO canonical input (#1)"]
  rb --> det
  enum[ε_num] --> det
  epstore["ε_prod (read from store,
key=prod image SHA) (#4)"] --> eps["epsilon = max ε_num, ε_prod"]
  enum --> eps
  ra --> div[divergence]
  canon[canonical_row + provenance] --> div
  eps --> div
  ra -->|env_fingerprint| par["parity_P0 (tri-state: equal/differs/unverifiable) (#7)"]
  canon --> par
  det --> verdict["decide_verdict (order H1→P0→H2)"]
  par --> verdict
  div --> verdict
  cpresent[canonical_present] --> verdict
  epsm[eps_prod_measured] --> verdict
  verdict --> events[loki_events + artefact]

det (H1) has no canonical edge (#1). ε_prod enters as a stored, SHA-keyed read (#4), never a per-verdict measurement. replay_a carries its own env_fingerprint into P0 (#3). parity_P0 is tri-state (#7).

replay_a / replay_b are produced out-of-graph by the orchestration (fresh subprocesses) and fed in as node inputs — the compute graph stays pure (no subprocess inside a node).

5bis. Decision order + exhaustiveness (#5)¶

_decide_verdict depends on five inputs, not two — the exhaustiveness test covers the true cartesian product H1{PASS,FAIL,ERROR} × P0{equal,differs,unverifiable} × H2{≤ε,>ε} × canonical_present{T,F} × eps_prod_measured{T,F}, not a 3×3. The frozen decision order is H1 → P0 → H2 (a divergence under a non-equal P0 is an env artefact, never a logic gap):

if H1 == ERROR:                 -> INCONCLUSIVE_TOOLING (replay_error)
elif H1 == FAIL:                -> NON_DETERMINISTIC (cause = numeric_residue | real_instability)
elif not canonical_present:     -> INCONCLUSIVE_TOOLING (canonical_absent)
elif not eps_prod_measured:     -> INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured)
elif P0 == unverifiable:        -> INCONCLUSIVE_TOOLING (env_parity_unverified)   # P0 BEFORE H2
elif P0 == differs:             -> CANONICAL_DIVERGENCE (cause = env_parity_gap)  # env, not logic
else:  # P0 == equal
    -> FIDELITY_OK              if H2 ≤ ε
    -> CANONICAL_DIVERGENCE     (cause = logic_fidelity_gap) if H2 > ε

This re-axes the plan r4 truth table (Fig 1) with the explicit P0 axis; the test asserts the mapping is total over the 5-way product (each combination → exactly one verdict). H1=FAIL/ERROR short-circuit (P0/H2-independent); P0 is consulted only on the H1-PASS, canonical-present, eps-measured path, and before H2.

6. Interface contracts (output schemas)¶

Verdict (artefact JSON + event=s18_replay_verdict):

field	type	notes
`verdict`	enum	`FIDELITY_OK \\| NON_DETERMINISTIC \\| CANONICAL_DIVERGENCE \\| INCONCLUSIVE_TOOLING`
`cause`	enum/null	`numeric_residue \\| real_instability \\| logic_fidelity_gap \\| env_parity_gap \\| null`
`reason`	enum/null	`replay_error \\| canonical_absent \\| epsilon_prod_unmeasured \\| env_parity_unverified \\| null`
`cell`	str	`crypto/fold`
`det_delta`	float	`\\|f1_buy(a) − f1_buy(b)\\|`
`canon_delta`	float/null	`\\|f1_buy(a) − canonical\\|` (null if H2 not reached)
`epsilon`	float	`max(ε_num, ε_prod)` resolved
`ε_num`	float	the frozen numerical threshold (audit)
`ε_prod`	float/null	value used + …
`eps_prod_image_sha`	str/null	the prod image SHA `ε_prod` was measured against (#6 — staleness check)
`parity_state`	enum	`equal \\| differs \\| unverifiable` (#7, not a bool)
`parity_dims`	obj	per-dimension equal/differ/missing (audit)
`replay_build_sha`,`replay_instance_type`	str	replay env (ADR-92, from subprocess fingerprint)
`canonical_run_id`	str	the canonical row identity
`canonical_build_sha`,`canonical_instance_type`	str/null	canonical env — the other half of P0 (#6)

Audit-complete (#6): both sides of P0 (replay and canonical provenance), ε_prod and the image SHA it was measured against, and ε_num are all in the artefact — so why P0/H2 decided is reconstructable from the verdict alone.

Loki events (catalogue): s18_replay_determinism status=PASS\|FAIL\|ERROR abs_delta_runs=… (always emitted) · s18_replay_determinism_cause severity=warn cause=… (on FAIL) · s18_replay_divergence severity=info\|warn abs_delta=… parity_state=equal\|differs\|unverifiable (on the H2 path) · s18_replay_verdict verdict=… cause=… reason=….

7. Integration points¶

_replay_cell (training-path replay) — invoked in a fresh subprocess launched with start-method spawn (not fork) so the child gets a fresh interpreter + RNG state and the spawn-time knobs (PYTHONHASHSEED, thread caps) take effect before any import (§2bis); the worker re-setups sys.path + dotenv (this is a Linux KPO pod, spawn is chosen for the determinism guarantee — #9, not a macOS artefact). The child returns its env_fingerprint alongside f1_buy (the P0 vector observed inside the replay, #3).
query_canonical_from_pg — reads finetune_results out-of-run; must also surface the canonical run's build_sha / instance_type / feature & lib versions + run_id for P0 (ADR-92 / ADR-23). #2 precondition: if those columns are absent (likely, given S12/S13 history), the work is INFEASIBLE-until-schema — the canonical-provenance persistence ships first (§4bis).
PG finetune_results (+ a P0-provenance side table if needed) — canonical source (read-only).
D0 / ε_prod store (#4) — _measure_prod_envelope (N≥10 prod-config, un-capped, one-sided ~95/95 upper tolerance bound) runs in a separate gated DAG at rare cadence, writing ε_prod keyed by the prod image SHA to a store (PG/artefact). The verdict job reads that value (and records the SHA it was measured against, §6); it re-measures only when the prod image SHA changes. This also resolves the resource conflict (un-capped D0 ≠ capped replays in the same pod).
ADR-92 — DAG surfaces its champollion build SHA (doc_md + event=dag_loaded); ADR-92 is also the source of the P0 build_sha dimension.

8. Observability & failure architecture¶

No-crash / fail-loud (ADR-25/31): every error path → structured INCONCLUSIVE_TOOLING with a reason; never a UI raise, never print, NaN guarded (→ status=ERROR determinism event + verdict INCONCLUSIVE_TOOLING reason=replay_error).
No observability hole: s18_replay_determinism is emitted on every cell incl. status=ERROR.
Operator-visible cause: the H1-FAIL cause is a severity=warn event (no artefact parsing needed).
Partial-pod failure (D0 / multi-cell): a missing cell → that cell INCONCLUSIVE_TOOLING, the others proceed (no silent pool collapse).

8bis. Deployment & resource model (#8)¶

Workload	Cadence	Threads	Resource model (KPO pod)
Verdict job (per cell)	on-demand / nightly	capped 1/replay (§3.C)	2 replay subprocesses + pure decide. Each replay ≈ 1 CPU (single-thread, §3.C); peak ≈ 2 CPU if A∥B, ≈ 1 CPU if sequential. Memory ≈ 1× training working-set per concurrent replay.
D0 (ε_prod)	rare (on prod image-SHA change)	un-capped (prod-faithful)	N≥10 prod-config trainings; full prod CPU/mem profile. Separate pod/DAG — must not co-schedule with capped replays (resource-model conflict).

A∥B vs sequential has no determinism impact (both capped under §3.C) — it is purely a sizing/latency choice: parallel halves wall-clock at ≈2× CPU; sequential caps at 1 CPU. Settled at impl per pod budget.

8ter. Sequence view (#8)¶

sequenceDiagram
  participant DAG as Verdict DAG (orchestration)
  participant PA as replay subprocess A (spawn, capped)
  participant PB as replay subprocess B (spawn, capped)
  participant PG as PG finetune_results (+ provenance)
  participant ST as ε_prod store (key=prod image SHA)
  participant DEC as _decide_verdict (pure)
  DAG->>PA: spawn(env: hashseed+thread-caps)
  DAG->>PB: spawn(env: hashseed+thread-caps)
  PA-->>DAG: f1_buy_a + env_fingerprint
  PB-->>DAG: f1_buy_b
  DAG->>PG: read canonical f1_buy + provenance + run_id
  DAG->>ST: read ε_prod (by prod image SHA)
  DAG->>DEC: determinism, parity(P0), divergence, gates
  DEC-->>DAG: verdict + cause/reason
  DAG->>DAG: emit Loki events + write artefact

9. ADR conformance¶

ADR	How
ADR-0095	diagnostic-story template (this is artefact 2/5)
ADR-0096	thread-pool cap (the spawn-time half of §3.C). NB (#10): ADR-0096 covers the cap only — the LightGBM `deterministic`/`force_row_wise` flags + `PYTHONHASHSEED` are §3.C in addition (not ADR-0096)
ADR-23	feature/artefact version pinning enforced inside P0
ADR-25	no silent fallback; all error paths → `INCONCLUSIVE_TOOLING` (shape-checked)
ADR-31	`event=key=value` logs, no `print`
ADR-92	DAG build SHA surfaced (`doc_md` + `event=dag_loaded`); also the P0 parity dimension

Files to create / modify¶

File	Action
`commun/finetune/diagnostic/s18_replay_fidelity.py`	new — pure probes + `_decide_verdict` + `_measure_prod_envelope`
`s18_step0_replay.py`	touch — expose `_replay_cell` for fresh-process call; `Verdict` stable
`dags/dag_diagnostic__s18_replay_fidelity.py`	new — verdict-job orchestration (or nightly CI per plan §7.1)
`dags/dag_diagnostic__s18_eps_prod.py` (D0)	new — separate gated ε_prod characterization DAG (rare cadence, SHA-keyed store) (#4)
canonical-provenance persistence (migration + writer)	new (precondition, #2) — persist `build_sha`/`instance_type`/feature+lib versions for the canonical run; `INFEASIBLE-until-schema` until shipped
`tests/unit/finetune/diagnostic/test_s18_replay_fidelity.py`	new — 5-way cartesian exhaustiveness (§5bis) + no-crash + H1→P0→H2 ordering
Loki catalogue (doc)	extend — the 4 events

Open items (architecture)¶

Fresh process (same pod) is the settled choice for H1, and it is justified, not a compromise (#11): the canonical itself is produced cross-pod, so cross-pod determinism is what _decide_verdict needs — but that residual is already covered: P0 pins instance_type (so a microarch delta is caught, not silently tolerated) and ε_prod bounds the prod cross-pod run-to-run spread. Same-pod H1 thus tests the determinism that is not otherwise bounded (in-image reproducibility), while P0 + ε_prod cover the cross-pod/image axis. Cross-pod H1 would be redundant with P0+ε_prod. Confirm at impl.
Job mode (manual DAG vs nightly CI) — plan §7.1 proposes manual DAG; ratify at impl.
D0 placement = separate gated DAG (resolved, #4) — own DAG, rare cadence, ε_prod stored keyed by prod image SHA; the verdict job reads it. (Was "same-pod sub-step vs separate DAG"; settled to separate for the resource-model conflict — committee rec #3.)
Canonical-provenance persistence (#2) — the P0 columns (build_sha, instance_type, feature/lib versions) must land in the canonical persistence first (INFEASIBLE-until-schema); follow-up persistence story to scope.