Architecture (r2 — for review)
How it's built. This is the architecture artefact (ADR-0095) for S09. It realizes the pre-registered decision rule of the plan dossier r4 (committee
plan_reviewPASSED·OK, Meeting 268) — it does not re-decide the rule. Hamilton-native, two layers (Airflow orchestration / pure compute), I/O isolated, decision in pure functions.Changelog — r1 (2026-06-09): initial architecture, traces plan r4 (cross-process H1, env-parity P0,
ε_prodD0, frozen verdict map). r2 (2026-06-09): architecture referee round 1 — #1 §3 component view fixed (removed the falsecanonical → determinismedge; H1 is canonical-independent, §5 is truth); #2 P0 made a precondition with an INFEASIBLE-until-schema gate (canonical-side provenance may not exist infinetune_results); #3replay_envfingerprint captured in the subprocess + §3.C split into spawn-time (PYTHONHASHSEED + thread caps) vs worker-time (LightGBM flags); #4 D0/ε_prodseparated to its own cadence (computed once, keyed by prod image SHA, stored + read — not per-verdict); #5 exhaustiveness re-axed to the true cartesian product H1×P0×H2×canonical_present×eps_prod_measured (order H1→P0→H2); #6Verdictschema completed (both-sides provenance + ε values); #7 parity made tri-state{equal, differs, unverifiable}; #8 deployment + sequence views added; #9/#10/#11 wording/ADR-scope/justification fixes.
§0 — Provenance & scope¶
Realizes plan r4: detect (A) replay non-determinism (cross-process) and (B) canonical divergence,
independently of any diagnostic run, and emit a verdict that names the cause — the precondition to the
skip_phase_a=true-by-default gate. Reuses the existing _replay_cell (training-path replay) and
query_canonical_from_pg from src/commun/finetune/diagnostic/s18_step0_replay.py; it does not
re-implement the training path. Orthogonal to the S04 capacity verdict.
1. Context & scope¶
- In: a determinism probe (two fresh-process replays under the pinned config §3.C of the plan), a
divergence probe (replay vs stored canonical), the env-parity precondition P0, the production
reproducibility envelope
ε_prod(D0), a pure decision, and the Loki events. - Out: fixing a divergence; flipping
skip_phase_a; any economic/capacity verdict. - Units: the 5
defi_top5/ fold-3 reference cells (cause classification needs the 5; H1 wiring may use 1).
2. Architectural style — Hamilton-native, two layers¶
| Layer | Responsibility | Purity |
|---|---|---|
Airflow orchestration (dag_diagnostic__s18_replay_fidelity.py) |
resolve cells → spawn the fresh-process replays (H1) and the D0 envelope runs → call the pure decision → emit events / write artefact | side-effecting (I/O, subprocess, PG, Loki) |
Pure compute (commun.finetune.diagnostic.s18_replay_fidelity) |
_probe_determinism, _probe_divergence, _probe_parity, _measure_prod_envelope (D0), _decide_verdict |
pure — no I/O, no os.environ writes, no PG |
Rationale: the decision is the auditable core (the frozen rule + truth table Fig 1 of the plan); keeping it a pure function over already-collected results makes the exhaustiveness test trivially checkable and immune to I/O flakiness (Hamilton-native — no adapter wrappers).
2bis. Pinned-config application — spawn-time vs worker-time (#3)¶
The plan's §3.C config cannot all be set inside the worker — some knobs are read before any worker code runs. The architecture splits them by when they take effect:
| Knob | When read | How applied |
|---|---|---|
PYTHONHASHSEED |
interpreter start | spawn-time — injected into the child's env at process creation (cannot be set after the interpreter is up) |
OMP_/MKL_/OPENBLAS_NUM_THREADS |
first BLAS/OMP import | spawn-time — injected into the child env before numpy/LightGBM import (a worker that imports numpy then sets them is too late) |
LightGBM deterministic / force_row_wise / seeds / num_threads |
at train() |
worker-time — passed as training params inside _replay_cell |
Consequence: the determinism subprocess is launched with start-method spawn and an explicit child
env carrying the spawn-time knobs (see §7). A fork-ed or env-after-import worker would believe it is
pinned without being so — H1 would silently test an un-pinned path.
3. Component view¶
flowchart TD
subgraph D0DAG["D0 — ε_prod characterization (separate gated DAG, RARE cadence — #4)"]
M0["measure_prod_envelope
N≥10 prod-config (UN-capped) re-runs"]
M0 --> EPS["ε_prod = one-sided 95/95 upper tol. bound
stored, KEY = prod image SHA"]
end
subgraph ORCH["Verdict job — Airflow orchestration (DAG, side-effecting)"]
RC[resolve_cells: defi_top5/fold-3]
SPA["spawn replay proc A (fresh, §3.C-spawn pinned)
returns f1_buy + env_fingerprint"]
SPB["spawn replay proc B (fresh, §3.C-spawn pinned)"]
CAN[load canonical row from PG
f1_buy + provenance (build_sha,instance_type,feature/lib vers)]
RDE["read ε_prod from store (by prod image SHA)"]
EMIT[emit Loki events + write verdict artefact]
end
subgraph PURE["Pure compute (Hamilton nodes, no I/O)"]
PD[_probe_determinism]
PP[_probe_parity P0 — tri-state]
PV[_probe_divergence]
DV[_decide_verdict]
end
EPS -. "consumed by reference (NOT re-measured per verdict)" .-> RDE
RC --> SPA & SPB & CAN & RDE
SPA --> PD
SPB --> PD
SPA --> PV
SPA --> PP
CAN --> PP
CAN --> PV
RDE --> PV
PD --> DV
PP --> DV
PV --> DV
RDE --> DV
DV --> EMIT
#1 fix: there is no
canonical → _probe_determinismedge — H1 is canonical-independent (plan r4 tie-break: a missing canonical never masksNON_DETERMINISTIC). §5 is the graph of truth. #4 fix:ε_prodis not measured per verdict; the D0 DAG computes it once, keyed by the prod image SHA, and the verdict job reads it.
4. Modules & responsibilities¶
| Module | Responsibility |
|---|---|
dag_diagnostic__s18_replay_fidelity.py |
orchestration: resolve cells, spawn fresh-process replays, run D0, call _decide_verdict, emit events, write artefact. ADR-92 build SHA in doc_md + first log line. |
commun/finetune/diagnostic/s18_replay_fidelity.py |
the pure probes + decision (below). Imports _replay_cell, query_canonical_from_pg from s18_step0_replay — does not mutate them (ADR-25). |
s18_step0_replay.py (existing) |
reused: _replay_cell (training-path replay), query_canonical_from_pg. Touched only to expose _replay_cell for fresh-process invocation; Verdict kept stable (add, never mutate). |
4bis. Split decision (pure, independently testable)¶
_probe_determinism(replay_a, replay_b) -> DeterminismResult{status: PASS|FAIL|ERROR, det_delta, det_cause}—det_cause ∈ {numeric_residue, real_instability}on FAIL (det_delta ≤ ε_numvs>). IEEE-754 exact equality onf1_buy. NaN/ERROR →status=ERROR._probe_parity(replay_fingerprint, canonical_provenance) -> ParityResult{state, dims}— P0 over the vectorbuild_sha(ADR-92) +instance_type+ feature/artefact versions (ADR-23) + OS/lib versions. Tri-state (#7):state ∈ {equal, differs, unverifiable}—equal(all dims present & match),differs(a dim present & differs →env_parity_gap),unverifiable(a dim not recordable on either side →env_parity_unverified). The boolean is gone:_decide_verdictmust distinguish "verifiably different" from "cannot tell" (they route to different verdicts).replay_fingerprintis the env captured inside the replay subprocess and returned with the result (#3), not sampled in the orchestrator pod._probe_divergence(replay_a, canonical_f1_buy, epsilon) -> DivergenceResult{canon_delta, within: bool}._decide_verdict(determinism, parity, divergence, canonical_present, eps_prod_measured) -> Verdict— implements the frozen rule, decision order H1 → P0 → H2 (#5). No I/O.
P0 canonical-provenance precondition + INFEASIBLE-until-schema gate (#2). P0 can only be evaluated if
finetune_results (or a side table) actually persists the canonical run's build_sha, instance_type,
feature/artefact versions and lib versions. Given the S12/S13 persistence saga (it took two stories to land
feature_hash/model_hyperparams), these dimensions likely do not exist yet — so this is a hard
precondition, not a runtime detail: until the canonical persistence carries the P0 vector, P0 is
non-evaluable and the work is gated INFEASIBLE-until-schema (mirror of the ε_prod gate) — a
schema/persistence change ships first. Absent it, _probe_parity returns unverifiable on every cell
→ INCONCLUSIVE_TOOLING everywhere (the same dead-end class the exact-equality guard avoids). The plan-side
follow-up is a persistence story for the canonical provenance columns.
Each is unit-tested in isolation; _decide_verdict carries the exhaustiveness test (§5bis).
5. Named Hamilton graph (the real architecture)¶
flowchart LR
cell[cell] --> ra[replay_a]
cell --> rb[replay_b]
ra --> det["determinism (H1)
NO canonical input (#1)"]
rb --> det
enum[ε_num] --> det
epstore["ε_prod (read from store,
key=prod image SHA) (#4)"] --> eps["epsilon = max ε_num, ε_prod"]
enum --> eps
ra --> div[divergence]
canon[canonical_row + provenance] --> div
eps --> div
ra -->|env_fingerprint| par["parity_P0 (tri-state: equal/differs/unverifiable) (#7)"]
canon --> par
det --> verdict["decide_verdict (order H1→P0→H2)"]
par --> verdict
div --> verdict
cpresent[canonical_present] --> verdict
epsm[eps_prod_measured] --> verdict
verdict --> events[loki_events + artefact]
det(H1) has nocanonicaledge (#1).ε_prodenters as a stored, SHA-keyed read (#4), never a per-verdict measurement.replay_acarries its ownenv_fingerprintinto P0 (#3).parity_P0is tri-state (#7).
replay_a / replay_b are produced out-of-graph by the orchestration (fresh subprocesses) and fed in
as node inputs — the compute graph stays pure (no subprocess inside a node).
5bis. Decision order + exhaustiveness (#5)¶
_decide_verdict depends on five inputs, not two — the exhaustiveness test covers the true cartesian
product H1{PASS,FAIL,ERROR} × P0{equal,differs,unverifiable} × H2{≤ε,>ε} × canonical_present{T,F} ×
eps_prod_measured{T,F}, not a 3×3. The frozen decision order is H1 → P0 → H2 (a divergence under a
non-equal P0 is an env artefact, never a logic gap):
if H1 == ERROR: -> INCONCLUSIVE_TOOLING (replay_error)
elif H1 == FAIL: -> NON_DETERMINISTIC (cause = numeric_residue | real_instability)
elif not canonical_present: -> INCONCLUSIVE_TOOLING (canonical_absent)
elif not eps_prod_measured: -> INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured)
elif P0 == unverifiable: -> INCONCLUSIVE_TOOLING (env_parity_unverified) # P0 BEFORE H2
elif P0 == differs: -> CANONICAL_DIVERGENCE (cause = env_parity_gap) # env, not logic
else: # P0 == equal
-> FIDELITY_OK if H2 ≤ ε
-> CANONICAL_DIVERGENCE (cause = logic_fidelity_gap) if H2 > ε
This re-axes the plan r4 truth table (Fig 1) with the explicit P0 axis; the test asserts the mapping is total over the 5-way product (each combination → exactly one verdict). H1=FAIL/ERROR short-circuit (P0/H2-independent); P0 is consulted only on the H1-PASS, canonical-present, eps-measured path, and before H2.
6. Interface contracts (output schemas)¶
Verdict (artefact JSON + event=s18_replay_verdict):
| field | type | notes |
|---|---|---|
verdict |
enum | FIDELITY_OK \| NON_DETERMINISTIC \| CANONICAL_DIVERGENCE \| INCONCLUSIVE_TOOLING |
cause |
enum/null | numeric_residue \| real_instability \| logic_fidelity_gap \| env_parity_gap \| null |
reason |
enum/null | replay_error \| canonical_absent \| epsilon_prod_unmeasured \| env_parity_unverified \| null |
cell |
str | crypto/fold |
det_delta |
float | \|f1_buy(a) − f1_buy(b)\| |
canon_delta |
float/null | \|f1_buy(a) − canonical\| (null if H2 not reached) |
epsilon |
float | max(ε_num, ε_prod) resolved |
ε_num |
float | the frozen numerical threshold (audit) |
ε_prod |
float/null | value used + … |
eps_prod_image_sha |
str/null | the prod image SHA ε_prod was measured against (#6 — staleness check) |
parity_state |
enum | equal \| differs \| unverifiable (#7, not a bool) |
parity_dims |
obj | per-dimension equal/differ/missing (audit) |
replay_build_sha,replay_instance_type |
str | replay env (ADR-92, from subprocess fingerprint) |
canonical_run_id |
str | the canonical row identity |
canonical_build_sha,canonical_instance_type |
str/null | canonical env — the other half of P0 (#6) |
Audit-complete (#6): both sides of P0 (replay and canonical provenance),
ε_prodand the image SHA it was measured against, andε_numare all in the artefact — so why P0/H2 decided is reconstructable from the verdict alone.
Loki events (catalogue): s18_replay_determinism status=PASS\|FAIL\|ERROR abs_delta_runs=…
(always emitted) · s18_replay_determinism_cause severity=warn cause=… (on FAIL) ·
s18_replay_divergence severity=info\|warn abs_delta=… parity_state=equal\|differs\|unverifiable (on the H2 path) ·
s18_replay_verdict verdict=… cause=… reason=….
7. Integration points¶
_replay_cell(training-path replay) — invoked in a fresh subprocess launched with start-methodspawn(notfork) so the child gets a fresh interpreter + RNG state and the spawn-time knobs (PYTHONHASHSEED, thread caps) take effect before any import (§2bis); the worker re-setupssys.path+ dotenv (this is a Linux KPO pod,spawnis chosen for the determinism guarantee — #9, not a macOS artefact). The child returns itsenv_fingerprintalongsidef1_buy(the P0 vector observed inside the replay, #3).query_canonical_from_pg— readsfinetune_resultsout-of-run; must also surface the canonical run'sbuild_sha/instance_type/ feature & lib versions +run_idfor P0 (ADR-92 / ADR-23). #2 precondition: if those columns are absent (likely, given S12/S13 history), the work isINFEASIBLE-until-schema— the canonical-provenance persistence ships first (§4bis).- PG
finetune_results(+ a P0-provenance side table if needed) — canonical source (read-only). - D0 /
ε_prodstore (#4) —_measure_prod_envelope(N≥10 prod-config, un-capped, one-sided ~95/95 upper tolerance bound) runs in a separate gated DAG at rare cadence, writingε_prodkeyed by the prod image SHA to a store (PG/artefact). The verdict job reads that value (and records the SHA it was measured against, §6); it re-measures only when the prod image SHA changes. This also resolves the resource conflict (un-capped D0 ≠ capped replays in the same pod). - ADR-92 — DAG surfaces its champollion build SHA (
doc_md+event=dag_loaded); ADR-92 is also the source of the P0build_shadimension.
8. Observability & failure architecture¶
- No-crash / fail-loud (ADR-25/31): every error path → structured
INCONCLUSIVE_TOOLINGwith areason; never a UI raise, neverprint, NaN guarded (→status=ERRORdeterminism event + verdictINCONCLUSIVE_TOOLING reason=replay_error). - No observability hole:
s18_replay_determinismis emitted on every cell incl.status=ERROR. - Operator-visible cause: the H1-FAIL cause is a
severity=warnevent (no artefact parsing needed). - Partial-pod failure (D0 / multi-cell): a missing cell → that cell
INCONCLUSIVE_TOOLING, the others proceed (no silent pool collapse).
8bis. Deployment & resource model (#8)¶
| Workload | Cadence | Threads | Resource model (KPO pod) |
|---|---|---|---|
| Verdict job (per cell) | on-demand / nightly | capped 1/replay (§3.C) | 2 replay subprocesses + pure decide. Each replay ≈ 1 CPU (single-thread, §3.C); peak ≈ 2 CPU if A∥B, ≈ 1 CPU if sequential. Memory ≈ 1× training working-set per concurrent replay. |
| D0 (ε_prod) | rare (on prod image-SHA change) | un-capped (prod-faithful) | N≥10 prod-config trainings; full prod CPU/mem profile. Separate pod/DAG — must not co-schedule with capped replays (resource-model conflict). |
A∥B vs sequential has no determinism impact (both capped under §3.C) — it is purely a sizing/latency choice: parallel halves wall-clock at ≈2× CPU; sequential caps at 1 CPU. Settled at impl per pod budget.
8ter. Sequence view (#8)¶
sequenceDiagram
participant DAG as Verdict DAG (orchestration)
participant PA as replay subprocess A (spawn, capped)
participant PB as replay subprocess B (spawn, capped)
participant PG as PG finetune_results (+ provenance)
participant ST as ε_prod store (key=prod image SHA)
participant DEC as _decide_verdict (pure)
DAG->>PA: spawn(env: hashseed+thread-caps)
DAG->>PB: spawn(env: hashseed+thread-caps)
PA-->>DAG: f1_buy_a + env_fingerprint
PB-->>DAG: f1_buy_b
DAG->>PG: read canonical f1_buy + provenance + run_id
DAG->>ST: read ε_prod (by prod image SHA)
DAG->>DEC: determinism, parity(P0), divergence, gates
DEC-->>DAG: verdict + cause/reason
DAG->>DAG: emit Loki events + write artefact
9. ADR conformance¶
| ADR | How |
|---|---|
| ADR-0095 | diagnostic-story template (this is artefact 2/5) |
| ADR-0096 | thread-pool cap (the spawn-time half of §3.C). NB (#10): ADR-0096 covers the cap only — the LightGBM deterministic/force_row_wise flags + PYTHONHASHSEED are §3.C in addition (not ADR-0096) |
| ADR-23 | feature/artefact version pinning enforced inside P0 |
| ADR-25 | no silent fallback; all error paths → INCONCLUSIVE_TOOLING (shape-checked) |
| ADR-31 | event=key=value logs, no print |
| ADR-92 | DAG build SHA surfaced (doc_md + event=dag_loaded); also the P0 parity dimension |
Files to create / modify¶
| File | Action |
|---|---|
commun/finetune/diagnostic/s18_replay_fidelity.py |
new — pure probes + _decide_verdict + _measure_prod_envelope |
s18_step0_replay.py |
touch — expose _replay_cell for fresh-process call; Verdict stable |
dags/dag_diagnostic__s18_replay_fidelity.py |
new — verdict-job orchestration (or nightly CI per plan §7.1) |
dags/dag_diagnostic__s18_eps_prod.py (D0) |
new — separate gated ε_prod characterization DAG (rare cadence, SHA-keyed store) (#4) |
| canonical-provenance persistence (migration + writer) | new (precondition, #2) — persist build_sha/instance_type/feature+lib versions for the canonical run; INFEASIBLE-until-schema until shipped |
tests/unit/finetune/diagnostic/test_s18_replay_fidelity.py |
new — 5-way cartesian exhaustiveness (§5bis) + no-crash + H1→P0→H2 ordering |
| Loki catalogue (doc) | extend — the 4 events |
Open items (architecture)¶
- Fresh process (same pod) is the settled choice for H1, and it is justified, not a compromise (#11):
the canonical itself is produced cross-pod, so cross-pod determinism is what
_decide_verdictneeds — but that residual is already covered: P0 pinsinstance_type(so a microarch delta is caught, not silently tolerated) andε_prodbounds the prod cross-pod run-to-run spread. Same-pod H1 thus tests the determinism that is not otherwise bounded (in-image reproducibility), while P0 + ε_prod cover the cross-pod/image axis. Cross-pod H1 would be redundant with P0+ε_prod. Confirm at impl. - Job mode (manual DAG vs nightly CI) — plan §7.1 proposes manual DAG; ratify at impl.
- D0 placement = separate gated DAG (resolved, #4) — own DAG, rare cadence,
ε_prodstored keyed by prod image SHA; the verdict job reads it. (Was "same-pod sub-step vs separate DAG"; settled to separate for the resource-model conflict — committee rec #3.) - Canonical-provenance persistence (#2) — the P0 columns (
build_sha,instance_type, feature/lib versions) must land in the canonical persistence first (INFEASIBLE-until-schema); follow-up persistence story to scope.