Skip to content

Architecture (r2 — for review)

How it's built. This is the architecture artefact (ADR-0095) for S09. It realizes the pre-registered decision rule of the plan dossier r4 (committee plan_review PASSED·OK, Meeting 268) — it does not re-decide the rule. Hamilton-native, two layers (Airflow orchestration / pure compute), I/O isolated, decision in pure functions.

Changelog — r1 (2026-06-09): initial architecture, traces plan r4 (cross-process H1, env-parity P0, ε_prod D0, frozen verdict map). r2 (2026-06-09): architecture referee round 1 — #1 §3 component view fixed (removed the false canonical → determinism edge; H1 is canonical-independent, §5 is truth); #2 P0 made a precondition with an INFEASIBLE-until-schema gate (canonical-side provenance may not exist in finetune_results); #3 replay_env fingerprint captured in the subprocess + §3.C split into spawn-time (PYTHONHASHSEED + thread caps) vs worker-time (LightGBM flags); #4 D0/ε_prod separated to its own cadence (computed once, keyed by prod image SHA, stored + read — not per-verdict); #5 exhaustiveness re-axed to the true cartesian product H1×P0×H2×canonical_present×eps_prod_measured (order H1→P0→H2); #6 Verdict schema completed (both-sides provenance + ε values); #7 parity made tri-state {equal, differs, unverifiable}; #8 deployment + sequence views added; #9/#10/#11 wording/ADR-scope/justification fixes.

§0 — Provenance & scope

Realizes plan r4: detect (A) replay non-determinism (cross-process) and (B) canonical divergence, independently of any diagnostic run, and emit a verdict that names the cause — the precondition to the skip_phase_a=true-by-default gate. Reuses the existing _replay_cell (training-path replay) and query_canonical_from_pg from src/commun/finetune/diagnostic/s18_step0_replay.py; it does not re-implement the training path. Orthogonal to the S04 capacity verdict.

1. Context & scope

  • In: a determinism probe (two fresh-process replays under the pinned config §3.C of the plan), a divergence probe (replay vs stored canonical), the env-parity precondition P0, the production reproducibility envelope ε_prod (D0), a pure decision, and the Loki events.
  • Out: fixing a divergence; flipping skip_phase_a; any economic/capacity verdict.
  • Units: the 5 defi_top5 / fold-3 reference cells (cause classification needs the 5; H1 wiring may use 1).

2. Architectural style — Hamilton-native, two layers

Layer Responsibility Purity
Airflow orchestration (dag_diagnostic__s18_replay_fidelity.py) resolve cells → spawn the fresh-process replays (H1) and the D0 envelope runs → call the pure decision → emit events / write artefact side-effecting (I/O, subprocess, PG, Loki)
Pure compute (commun.finetune.diagnostic.s18_replay_fidelity) _probe_determinism, _probe_divergence, _probe_parity, _measure_prod_envelope (D0), _decide_verdict pure — no I/O, no os.environ writes, no PG

Rationale: the decision is the auditable core (the frozen rule + truth table Fig 1 of the plan); keeping it a pure function over already-collected results makes the exhaustiveness test trivially checkable and immune to I/O flakiness (Hamilton-native — no adapter wrappers).

2bis. Pinned-config application — spawn-time vs worker-time (#3)

The plan's §3.C config cannot all be set inside the worker — some knobs are read before any worker code runs. The architecture splits them by when they take effect:

Knob When read How applied
PYTHONHASHSEED interpreter start spawn-time — injected into the child's env at process creation (cannot be set after the interpreter is up)
OMP_/MKL_/OPENBLAS_NUM_THREADS first BLAS/OMP import spawn-time — injected into the child env before numpy/LightGBM import (a worker that imports numpy then sets them is too late)
LightGBM deterministic / force_row_wise / seeds / num_threads at train() worker-time — passed as training params inside _replay_cell

Consequence: the determinism subprocess is launched with start-method spawn and an explicit child env carrying the spawn-time knobs (see §7). A fork-ed or env-after-import worker would believe it is pinned without being so — H1 would silently test an un-pinned path.

3. Component view

flowchart TD
  subgraph D0DAG["D0 — ε_prod characterization (separate gated DAG, RARE cadence — #4)"]
    M0["measure_prod_envelope
N≥10 prod-config (UN-capped) re-runs"] M0 --> EPS["ε_prod = one-sided 95/95 upper tol. bound
stored, KEY = prod image SHA"] end subgraph ORCH["Verdict job — Airflow orchestration (DAG, side-effecting)"] RC[resolve_cells: defi_top5/fold-3] SPA["spawn replay proc A (fresh, §3.C-spawn pinned)
returns f1_buy + env_fingerprint"] SPB["spawn replay proc B (fresh, §3.C-spawn pinned)"] CAN[load canonical row from PG
f1_buy + provenance (build_sha,instance_type,feature/lib vers)] RDE["read ε_prod from store (by prod image SHA)"] EMIT[emit Loki events + write verdict artefact] end subgraph PURE["Pure compute (Hamilton nodes, no I/O)"] PD[_probe_determinism] PP[_probe_parity P0 — tri-state] PV[_probe_divergence] DV[_decide_verdict] end EPS -. "consumed by reference (NOT re-measured per verdict)" .-> RDE RC --> SPA & SPB & CAN & RDE SPA --> PD SPB --> PD SPA --> PV SPA --> PP CAN --> PP CAN --> PV RDE --> PV PD --> DV PP --> DV PV --> DV RDE --> DV DV --> EMIT

#1 fix: there is no canonical → _probe_determinism edge — H1 is canonical-independent (plan r4 tie-break: a missing canonical never masks NON_DETERMINISTIC). §5 is the graph of truth. #4 fix: ε_prod is not measured per verdict; the D0 DAG computes it once, keyed by the prod image SHA, and the verdict job reads it.

4. Modules & responsibilities

Module Responsibility
dag_diagnostic__s18_replay_fidelity.py orchestration: resolve cells, spawn fresh-process replays, run D0, call _decide_verdict, emit events, write artefact. ADR-92 build SHA in doc_md + first log line.
commun/finetune/diagnostic/s18_replay_fidelity.py the pure probes + decision (below). Imports _replay_cell, query_canonical_from_pg from s18_step0_replay — does not mutate them (ADR-25).
s18_step0_replay.py (existing) reused: _replay_cell (training-path replay), query_canonical_from_pg. Touched only to expose _replay_cell for fresh-process invocation; Verdict kept stable (add, never mutate).

4bis. Split decision (pure, independently testable)

  • _probe_determinism(replay_a, replay_b) -> DeterminismResult{status: PASS|FAIL|ERROR, det_delta, det_cause}det_cause ∈ {numeric_residue, real_instability} on FAIL (det_delta ≤ ε_num vs >). IEEE-754 exact equality on f1_buy. NaN/ERROR → status=ERROR.
  • _probe_parity(replay_fingerprint, canonical_provenance) -> ParityResult{state, dims} — P0 over the vector build_sha (ADR-92) + instance_type + feature/artefact versions (ADR-23) + OS/lib versions. Tri-state (#7): state ∈ {equal, differs, unverifiable}equal (all dims present & match), differs (a dim present & differs → env_parity_gap), unverifiable (a dim not recordable on either side → env_parity_unverified). The boolean is gone: _decide_verdict must distinguish "verifiably different" from "cannot tell" (they route to different verdicts). replay_fingerprint is the env captured inside the replay subprocess and returned with the result (#3), not sampled in the orchestrator pod.
  • _probe_divergence(replay_a, canonical_f1_buy, epsilon) -> DivergenceResult{canon_delta, within: bool}.
  • _decide_verdict(determinism, parity, divergence, canonical_present, eps_prod_measured) -> Verdict — implements the frozen rule, decision order H1 → P0 → H2 (#5). No I/O.

P0 canonical-provenance precondition + INFEASIBLE-until-schema gate (#2). P0 can only be evaluated if finetune_results (or a side table) actually persists the canonical run's build_sha, instance_type, feature/artefact versions and lib versions. Given the S12/S13 persistence saga (it took two stories to land feature_hash/model_hyperparams), these dimensions likely do not exist yet — so this is a hard precondition, not a runtime detail: until the canonical persistence carries the P0 vector, P0 is non-evaluable and the work is gated INFEASIBLE-until-schema (mirror of the ε_prod gate) — a schema/persistence change ships first. Absent it, _probe_parity returns unverifiable on every cell → INCONCLUSIVE_TOOLING everywhere (the same dead-end class the exact-equality guard avoids). The plan-side follow-up is a persistence story for the canonical provenance columns.

Each is unit-tested in isolation; _decide_verdict carries the exhaustiveness test (§5bis).

5. Named Hamilton graph (the real architecture)

flowchart LR
  cell[cell] --> ra[replay_a]
  cell --> rb[replay_b]
  ra --> det["determinism (H1)
NO canonical input (#1)"] rb --> det enum[ε_num] --> det epstore["ε_prod (read from store,
key=prod image SHA) (#4)"] --> eps["epsilon = max ε_num, ε_prod"] enum --> eps ra --> div[divergence] canon[canonical_row + provenance] --> div eps --> div ra -->|env_fingerprint| par["parity_P0 (tri-state: equal/differs/unverifiable) (#7)"] canon --> par det --> verdict["decide_verdict (order H1→P0→H2)"] par --> verdict div --> verdict cpresent[canonical_present] --> verdict epsm[eps_prod_measured] --> verdict verdict --> events[loki_events + artefact]

det (H1) has no canonical edge (#1). ε_prod enters as a stored, SHA-keyed read (#4), never a per-verdict measurement. replay_a carries its own env_fingerprint into P0 (#3). parity_P0 is tri-state (#7).

replay_a / replay_b are produced out-of-graph by the orchestration (fresh subprocesses) and fed in as node inputs — the compute graph stays pure (no subprocess inside a node).

5bis. Decision order + exhaustiveness (#5)

_decide_verdict depends on five inputs, not two — the exhaustiveness test covers the true cartesian product H1{PASS,FAIL,ERROR} × P0{equal,differs,unverifiable} × H2{≤ε,>ε} × canonical_present{T,F} × eps_prod_measured{T,F}, not a 3×3. The frozen decision order is H1 → P0 → H2 (a divergence under a non-equal P0 is an env artefact, never a logic gap):

if H1 == ERROR:                 -> INCONCLUSIVE_TOOLING (replay_error)
elif H1 == FAIL:                -> NON_DETERMINISTIC (cause = numeric_residue | real_instability)
elif not canonical_present:     -> INCONCLUSIVE_TOOLING (canonical_absent)
elif not eps_prod_measured:     -> INCONCLUSIVE_TOOLING (epsilon_prod_unmeasured)
elif P0 == unverifiable:        -> INCONCLUSIVE_TOOLING (env_parity_unverified)   # P0 BEFORE H2
elif P0 == differs:             -> CANONICAL_DIVERGENCE (cause = env_parity_gap)  # env, not logic
else:  # P0 == equal
    -> FIDELITY_OK              if H2 ≤ ε
    -> CANONICAL_DIVERGENCE     (cause = logic_fidelity_gap) if H2 > ε

This re-axes the plan r4 truth table (Fig 1) with the explicit P0 axis; the test asserts the mapping is total over the 5-way product (each combination → exactly one verdict). H1=FAIL/ERROR short-circuit (P0/H2-independent); P0 is consulted only on the H1-PASS, canonical-present, eps-measured path, and before H2.

6. Interface contracts (output schemas)

Verdict (artefact JSON + event=s18_replay_verdict):

field type notes
verdict enum FIDELITY_OK \| NON_DETERMINISTIC \| CANONICAL_DIVERGENCE \| INCONCLUSIVE_TOOLING
cause enum/null numeric_residue \| real_instability \| logic_fidelity_gap \| env_parity_gap \| null
reason enum/null replay_error \| canonical_absent \| epsilon_prod_unmeasured \| env_parity_unverified \| null
cell str crypto/fold
det_delta float \|f1_buy(a) − f1_buy(b)\|
canon_delta float/null \|f1_buy(a) − canonical\| (null if H2 not reached)
epsilon float max(ε_num, ε_prod) resolved
ε_num float the frozen numerical threshold (audit)
ε_prod float/null value used + …
eps_prod_image_sha str/null the prod image SHA ε_prod was measured against (#6 — staleness check)
parity_state enum equal \| differs \| unverifiable (#7, not a bool)
parity_dims obj per-dimension equal/differ/missing (audit)
replay_build_sha,replay_instance_type str replay env (ADR-92, from subprocess fingerprint)
canonical_run_id str the canonical row identity
canonical_build_sha,canonical_instance_type str/null canonical env — the other half of P0 (#6)

Audit-complete (#6): both sides of P0 (replay and canonical provenance), ε_prod and the image SHA it was measured against, and ε_num are all in the artefact — so why P0/H2 decided is reconstructable from the verdict alone.

Loki events (catalogue): s18_replay_determinism status=PASS\|FAIL\|ERROR abs_delta_runs=… (always emitted) · s18_replay_determinism_cause severity=warn cause=… (on FAIL) · s18_replay_divergence severity=info\|warn abs_delta=… parity_state=equal\|differs\|unverifiable (on the H2 path) · s18_replay_verdict verdict=… cause=… reason=….

7. Integration points

  • _replay_cell (training-path replay) — invoked in a fresh subprocess launched with start-method spawn (not fork) so the child gets a fresh interpreter + RNG state and the spawn-time knobs (PYTHONHASHSEED, thread caps) take effect before any import (§2bis); the worker re-setups sys.path + dotenv (this is a Linux KPO pod, spawn is chosen for the determinism guarantee — #9, not a macOS artefact). The child returns its env_fingerprint alongside f1_buy (the P0 vector observed inside the replay, #3).
  • query_canonical_from_pg — reads finetune_results out-of-run; must also surface the canonical run's build_sha / instance_type / feature & lib versions + run_id for P0 (ADR-92 / ADR-23). #2 precondition: if those columns are absent (likely, given S12/S13 history), the work is INFEASIBLE-until-schema — the canonical-provenance persistence ships first (§4bis).
  • PG finetune_results (+ a P0-provenance side table if needed) — canonical source (read-only).
  • D0 / ε_prod store (#4)_measure_prod_envelope (N≥10 prod-config, un-capped, one-sided ~95/95 upper tolerance bound) runs in a separate gated DAG at rare cadence, writing ε_prod keyed by the prod image SHA to a store (PG/artefact). The verdict job reads that value (and records the SHA it was measured against, §6); it re-measures only when the prod image SHA changes. This also resolves the resource conflict (un-capped D0 ≠ capped replays in the same pod).
  • ADR-92 — DAG surfaces its champollion build SHA (doc_md + event=dag_loaded); ADR-92 is also the source of the P0 build_sha dimension.

8. Observability & failure architecture

  • No-crash / fail-loud (ADR-25/31): every error path → structured INCONCLUSIVE_TOOLING with a reason; never a UI raise, never print, NaN guarded (→ status=ERROR determinism event + verdict INCONCLUSIVE_TOOLING reason=replay_error).
  • No observability hole: s18_replay_determinism is emitted on every cell incl. status=ERROR.
  • Operator-visible cause: the H1-FAIL cause is a severity=warn event (no artefact parsing needed).
  • Partial-pod failure (D0 / multi-cell): a missing cell → that cell INCONCLUSIVE_TOOLING, the others proceed (no silent pool collapse).

8bis. Deployment & resource model (#8)

Workload Cadence Threads Resource model (KPO pod)
Verdict job (per cell) on-demand / nightly capped 1/replay (§3.C) 2 replay subprocesses + pure decide. Each replay ≈ 1 CPU (single-thread, §3.C); peak ≈ 2 CPU if A∥B, ≈ 1 CPU if sequential. Memory ≈ 1× training working-set per concurrent replay.
D0 (ε_prod) rare (on prod image-SHA change) un-capped (prod-faithful) N≥10 prod-config trainings; full prod CPU/mem profile. Separate pod/DAG — must not co-schedule with capped replays (resource-model conflict).

A∥B vs sequential has no determinism impact (both capped under §3.C) — it is purely a sizing/latency choice: parallel halves wall-clock at ≈2× CPU; sequential caps at 1 CPU. Settled at impl per pod budget.

8ter. Sequence view (#8)

sequenceDiagram
  participant DAG as Verdict DAG (orchestration)
  participant PA as replay subprocess A (spawn, capped)
  participant PB as replay subprocess B (spawn, capped)
  participant PG as PG finetune_results (+ provenance)
  participant ST as ε_prod store (key=prod image SHA)
  participant DEC as _decide_verdict (pure)
  DAG->>PA: spawn(env: hashseed+thread-caps)
  DAG->>PB: spawn(env: hashseed+thread-caps)
  PA-->>DAG: f1_buy_a + env_fingerprint
  PB-->>DAG: f1_buy_b
  DAG->>PG: read canonical f1_buy + provenance + run_id
  DAG->>ST: read ε_prod (by prod image SHA)
  DAG->>DEC: determinism, parity(P0), divergence, gates
  DEC-->>DAG: verdict + cause/reason
  DAG->>DAG: emit Loki events + write artefact

9. ADR conformance

ADR How
ADR-0095 diagnostic-story template (this is artefact 2/5)
ADR-0096 thread-pool cap (the spawn-time half of §3.C). NB (#10): ADR-0096 covers the cap only — the LightGBM deterministic/force_row_wise flags + PYTHONHASHSEED are §3.C in addition (not ADR-0096)
ADR-23 feature/artefact version pinning enforced inside P0
ADR-25 no silent fallback; all error paths → INCONCLUSIVE_TOOLING (shape-checked)
ADR-31 event=key=value logs, no print
ADR-92 DAG build SHA surfaced (doc_md + event=dag_loaded); also the P0 parity dimension

Files to create / modify

File Action
commun/finetune/diagnostic/s18_replay_fidelity.py new — pure probes + _decide_verdict + _measure_prod_envelope
s18_step0_replay.py touch — expose _replay_cell for fresh-process call; Verdict stable
dags/dag_diagnostic__s18_replay_fidelity.py new — verdict-job orchestration (or nightly CI per plan §7.1)
dags/dag_diagnostic__s18_eps_prod.py (D0) new — separate gated ε_prod characterization DAG (rare cadence, SHA-keyed store) (#4)
canonical-provenance persistence (migration + writer) new (precondition, #2) — persist build_sha/instance_type/feature+lib versions for the canonical run; INFEASIBLE-until-schema until shipped
tests/unit/finetune/diagnostic/test_s18_replay_fidelity.py new5-way cartesian exhaustiveness (§5bis) + no-crash + H1→P0→H2 ordering
Loki catalogue (doc) extend — the 4 events

Open items (architecture)

  1. Fresh process (same pod) is the settled choice for H1, and it is justified, not a compromise (#11): the canonical itself is produced cross-pod, so cross-pod determinism is what _decide_verdict needs — but that residual is already covered: P0 pins instance_type (so a microarch delta is caught, not silently tolerated) and ε_prod bounds the prod cross-pod run-to-run spread. Same-pod H1 thus tests the determinism that is not otherwise bounded (in-image reproducibility), while P0 + ε_prod cover the cross-pod/image axis. Cross-pod H1 would be redundant with P0+ε_prod. Confirm at impl.
  2. Job mode (manual DAG vs nightly CI) — plan §7.1 proposes manual DAG; ratify at impl.
  3. D0 placement = separate gated DAG (resolved, #4) — own DAG, rare cadence, ε_prod stored keyed by prod image SHA; the verdict job reads it. (Was "same-pod sub-step vs separate DAG"; settled to separate for the resource-model conflict — committee rec #3.)
  4. Canonical-provenance persistence (#2) — the P0 columns (build_sha, instance_type, feature/lib versions) must land in the canonical persistence first (INFEASIBLE-until-schema); follow-up persistence story to scope.