Skip to content

Design — CVN-N001-EI-S07 Lever #1 : the captured fold as a reproducibility artifact (managed via the cache infra)

Story: CVN-N001-EI-S07 — wp#232 · GH #1071 · Epic CVN-N001-EI (#1055) Status: Part A (Lever #1) — committee-validated (ca64c4d8) + operator-waived drift delta; Parts B/C (Levers #2/#3) — DRAFT v0.1 (design only) Date: 2026-05-26 Scope: the 3-lever diagnostic-harness-perf plan. Part A (§1–§17) = Lever #1 (captured-fold reproducibility artifact, committee-validated). Part B = Lever #2 (optional run_s22a1 anchor). Part C = Lever #3 (dataset-only capture; diagnostic-phase only). Parts B/C are appended as separate chapters (added 2026-05-26; design-only, lighter maturity than Part A). Relates to: plan dossier documentation/reviews/2026-05-26-cvn-n001-ei-s07-diagnostic-harness-perf-plan.md (plan_review 9e590907 PASSED). Materially extends lever #1 → plan-dossier addendum + committee re-confirmation (§16).

Review history: - 2026-05-27 — Gate-3 knee raised 25 → 30 MB (5 MB/fold safety buffer). The validation runs (documentation/design/CVN-N001-EI-S07-validation-runs-2026-05-27.md) measured ~21–24 MB/fold (4–8× the §9 parametric estimate; AAVEUSDC at 24.37 MB hugged the old 25 MB knee). Per-fold knee bumped to 30 MB so Gate 3 doesn't start at the limit; audit budget unchanged (≤ 1 GB ⇒ ~30 folds × 30 MB ≈ 900 MB). It's a judgment knee, not a hard cliff. §9 compression/dedup still flagged for revisit. - v1 — initial. - v2 — reframe as reproducibility artifact; concurrency; storage economics; 3-layer version; semantic-not-byte parity. - v3 — folds in the read-only finding that resolves the central open question (Q2): the frozen cell does NOT pin the input data, it can drift (§2bis). This corrects the cache key (§4), the parity claim (§6), and the retention model (§9/§10), and closes all v2 open questions into fixed specs (§13). Adds S3 retry-before-cold (§8) and a concrete integration runbook (§14). - v3.1 — adds a consolidated Risks & Mitigations table (§15bis), version-conflict + backward-compat tests (§14), operator-facing impact of the DAG param rename (§11), and post-deploy validation steps (§17). (Most other round-4 review items were already addressed in v2/v3 — open questions closed in §13, error handling in §8, concurrency in §7.) - v3.2 — folds in the round-5 approval-with-required-changes: a versioned input_data_sha canonicalization contract + INPUT_DATA_HASH_VERSION (§4a, D7); realistic perf split (component artifact-load < 10 s vs full warm path < 2 min, §14); get_feature_store lookup cost bounded + metered (§4a, §8); concrete renewable lease w/ heartbeat (§7); force_recapture = regenerate-and-compare not blind overwrite + a repro-divergence protocol (§6); storage-size approval gate (§9); softened cross-version hash claim (§14). - 2026-05-26 — added Part B (Lever #2 — optional run_s22a1 anchor) + Part C (Lever #3) as separate chapters; doc now covers all three S07 levers (Part A = Lever #1). - 2026-05-26 (round-11) — Part B: B7 reframed from "secondary finding" to the value gate (quantify run_s22a1 p50/p95 before wiring — 7 s vs 865 s decides if the skip is worth the certification footgun); SKIPPED must propagate to all verdict consumers (B6bis); skip's drift-safety depends on Part-A integrity being deployed (B5). Part C: Gate B doubled with a code-level structural argument + periodic prod invariance assertion (empirical ≠ structural — the catastrophic-silent risk); new Gate C prep/train-seam (does dataset_only target even exist?); registry given a runtime enforcement guard (§C0); #1↔#3 reframed as inversely correlated (priority branches on Gate-1 reuse). Part A: D11 waiver gets an explicit reopen-on-Gate-2 condition (a correctness call must not stand on a cost waiver). - 2026-05-26 (round-10) — Part B: certification-level downgrade made explicit (s22a1_crossref_status=SKIPPED, §B6bis), audit-type skip rule (§B4), INCONCLUSIVE_TOOLING as sole fallback + 5 tests (§B6). Part C: reframed to dataset-only (single-fit demoted to fallback), two blocking diagnostic gates added — consumer-classification registry (§C0, shared with Lever #2) + trial-invariance proof (§C4); implementation NOT approved, diagnostic phase only (§C9). - v4.4 — round-9 corrections to the gates: (1) drift-rate promoted to a first-class entry gate (Gate 2) — Model B made silent staleness more critical, not less; it reopens the --check-drift opt-in/default-on/auto-age decision (D11, blocking for committee re-confirmation), §6 wording withdrawn. (2) Gate 1 gains an induced-demand clause (low historical reuse may reflect the 2h48 cost barrier the lever removes → < 5 % is a discussion trigger, not auto-no-go). (3) Gate 3 (size) piggybacks the next cold audit (zero marginal compute) instead of dedicated captures. (4) Gate 4 merges reproducibility + warm-perf + a single-flight-under-contention assertion into one cold→warm cycle (no double cold cost; the risky §7 PG lock now has a release gate). Hard cliffs (<2min/<10s) vs judgment knees (20%/25MB) made explicit; Gate1→TTL→Gate3 dependency noted. - v4.3 — round-8: reworked §9bis into 4 pragmatic binary gates — 2 entry (Gate 1 value/reuse — the real entry question, replacing the confusing A/B framing; Gate 2 artifact size) + 2 release (Gate 3 reproducibility, Gate 4 warm-perf < 2 min in-cluster), each with concrete thresholds (≥20%/5% reuse, ≤25 MB/≤1 GB/≤10 s, content+verdict parity, <2 min). Live-drift demoted to a secondary TTL input. §9 size/TTL now defer to §9bis. - v4.2 — added §9bis: a self-contained specification of the two pre-implementation entry gates (superseded by v4.3's 4-gate model). - v4.1 — round-7 blocking fixes: (§7) split the PG advisory lock (brief, session-level, committed immediately) from the long generation guarded by claim-row+heartbeat — pg_advisory_xact_lock + a 5-min heartbeat would starve the connection pool; (§4a) clarified the Model-B pin = the Phase-A output captured_data decreed immutable for cell_ref, NOT a second copy of the raw Feast input (input_data_sha is kept only as a drift reference). - v4 — folds in the round-6 value-risk findings. (1) Hit-rate is now the existential gate: §2bis (live re-derivation) is confirmed in code (folds re-slice the live store, ETL re-pulls on a Redis TTL — no pinned snapshot), so a live-derived input_data_sha would miss on every regen. → Model B (pinned content-addressed snapshot) is now the recommended core (§4a), keyed on cell_ref, reusing the pin instead of re-calling get_feature_store on the warm path; drift becomes an explicit check, not a silent bust. Precedent: feature-selection _train_window_fingerprint. (2) New §2ter — what the audit certifies (inter-re-audit consistency from the pinned first capture, NOT fidelity to the original FTF data — promoted from D6). (3) §7 → PostgreSQL advisory lock + claim-row (not Redis): pg_advisory_xact_lock is already in-use in the cache index; no new infra. (4) capture_meta bulky JSON → S3 artifact, not the PG index row (the manager already routes large payloads to S3). (5) Hit-rate measurement gate added at §9, equal billing to the size gate.


1. Context & problem

The s40 audit (S02) took 2h48 wall on defi_top5 (5 cells). Loki attributes the cost entirely to the inherited Phase A capture (~10.5 CPU-h, 255 trials): serialize_capture() (src/commun/finetune/diagnostic/s18_step1_capture.py:669) writes the fold parquet to /tmp/s18-diagnostic/ephemeral per pod — so every run re-replays the full original FTF training. The skip_phase_a=True path (dags/dag_diagnostic__s18_step1_4_chain.py:262) expects a pre-existing local parquet (RuntimeError if missing) — unusable across runs.

Goal: a re-audit of an already-captured (crypto, fold) runs in < 2 min (vs ~2h48 cold), same verdict (perf-only).

2. Central reframe — reproducibility artifact, not a plain cache entity

A plain cache entity is evictable because cheaply recomputable. The captured fold is the opposite: ~10 CPU-h to regenerate, audit-sensitive (evidentiary basis of a verdict), large, and — per §2bis — not reliably reproducible later. So it is an immutable reproducibility snapshot implemented through the cache infrastructure: mechanism is cache-like (PG index, MLflow/S3 store, content-addressing, observability, lookup-before-work); lifecycle is artifact-grade (immutable, policy-gated retention, audit-pin). The cache is the mechanism; S3 is only the L3 backend.

2bis. Decisive finding (resolves v2-Q2) — the cell does NOT pin the input data

Read-only investigation (step (a)) established that a frozen finetune_results cell does not pin the exact input X/y:

  • the cell is keyed by run_id + crypto + fold_id + variant only (src/commun/finetune/diagnostic/s18_step0_replay.py:182);
  • replay reloads data via cache.get_feature_store(crypto, timeframe) (src/commun/finetune/diagnostic/s18_step0_replay.py:275), and get_feature_store keys on data_date = datetime.now() (src/commun/cache/cvntrade_cache_interface.py:210) — a live, mutable source (vendor backfill, ETL/Feast regen);
  • finetune_results stores no input-data hash: feature_hash exists but defaults to "unknown" (src/commun/finetune/persistence.py:113) and is never set.

Consequence: a cold re-capture of the same cell_ref later can yield a different fold. Therefore the lookup key must bind to the input-data content (§4) — otherwise the cache silently serves a stale fold on data drift (ADR-25 violation). This single fact drives the rest of v3.

Goals / Non-goals

Goals: persist immutably across runs, transparent warm reuse, verdict parity bounded by the data-content hash, fail-safe + no silent loss of the warm path. Non-goals: lever #2/#3; a general experiment-tracking system.

2ter. What this audit actually certifies (promoted from D6 — equal billing to §2bis)

A consequence of §2bis that must be explicit, not buried: the original FTF ran on data D_ftf which is unrecoverable (feature_hash="unknown"). The first diagnostic capture loads D_capture1 = get_feature_store(now₁), which — by §2bis — may already differ from D_ftf. The pin (§4) freezes D_capture1, not D_ftf.

Therefore the audit certifies inter-re-audit consistency from the first pinned capture — NOT fidelity to the original FTF run. Even a cold capture audits against D_capture1, which may already diverge from D_ftf. The "perf-only, same verdict" guarantee is relative to the first diagnostic capture's verdict, not the original FTF verdict the audit is nominally about.

This is a pre-existing reality (it predates this design — feature_hash was never populated), but the "reproducibility primitive" framing would mask it. Committee decision required: accept that this audit certifies re-audit consistency, and that fidelity to the original FTF requires the upstream D6 fix (finetune_results.feature_hash populated at FTF time). Until D6, no one should read a warm verdict as validating the original FTF run.

3. The artifact entity — DIAGNOSTIC_CAPTURE

src/commun/cache/cvntrade_cache_entities.py:15 — enum + dataclass + factory branch:

class EntityType(Enum):
    ...
    DIAGNOSTIC_CAPTURE = "diagnostic_capture"   # immutable captured-fold snapshot (S18 Phase A)

@dataclass
class DiagnosticCapture(BaseEntity):
    crypto_symbol: str
    fold_id: int
    cell_ref: str                  # f"{finetune_run_id}_{config_hash}" (§4)
    input_data_sha: str            # canonical hash of the fold's input slice (§4) — correctness-bearing
    content_hash: str              # canonical hash of captured X/y (§6) — integrity + physical dedup id
    captured_data: pd.DataFrame    # X + _label + _weight + _split  (the ONLY DataFrame persisted)
    capture_meta: Dict[str, Any]   # frozen params + lib versions + snapshot pointer (scalars); bulky trials JSON → S3, not PG (§3)

Secondary-artifact scope (closes v2-Q5, corrected v4): only the main DataFrame (X + _label + _weight + _split) is the artifact body. The CVNTrade_CacheManager already routes large entity_data to the S3/MLflow artifact and only scalar metadata to the PG index row (src/commun/cache/cvntrade_cache_manager.py:470, src/commun/cache/cvntrade_cache_index.py:219). So the bulky forensic JSON (255 trials / manifest) is written as a sidecar JSON inside the same S3 artifact run, NOT into a PG row (big JSON in the index = anti-pattern). capture_meta in PG holds only scalars (n_trials, lib versions, the snapshot pointer).

Autonomy: the entity requires no prior execution of the active feature-engineering or labelling modules — it is self-contained (it does not register live-chain dependencies in EntityRegistry).

4. Cache key (the crux, corrected by §2bis)

Two-level keying — logical lookup vs physical content (the latter bounds duplication):

4a. Two keying models — the choice IS the hit-rate gate (§9)

§2bis is confirmed in code: historical windows are re-derived live on every fold (the WF engine re-slices the live feature store; the ETL re-pulls Binance on a Redis TTL — src/backtest/cvntrade_wf_backtest_engine.py:220, src/ETL/cvntrade_etl_pipeline.py:412). No pinned historical snapshot exists. That forks the key:

Model A — live-derived key. input_data_sha recomputed at lookup from get_feature_store(...) over [train_start … test_end], included in the key. Correct (drift → miss → recapture, never stale) but only hits if that live slice is stable day-to-day for a closed window. Given TTL regen + ETL re-pull, it likely changes on every regen → the cache rarely hits, and the warm path still pays the (metered) get_feature_store load every time. This model makes the lever's value contingent on an unverified stability assumption.

Model B — pinned content-addressed snapshot (RECOMMENDED). What is pinned (round-7 clarification): the Phase-A output captured fold (captured_data), NOT a second copy of the raw Feast input. Storing the raw input slice too would double the S3 footprint for no gain. So the first successful capture of a cell becomes the immutable truth for that cell_ref: we persist captured_data (per §3), record input_data_sha (the canonical hash of the input slice that produced it — kept as a drift reference, not as a stored blob) + the artifact pointer against cell_ref (precedent: feature-selection _train_window_fingerprint, src/commun/finetune/feature_selection/dispatch.py:159). Re-audits look up by cell_ref (stable → reliable hits) and reuse the pinned output — they do NOT call get_feature_store on the warm path. This fixes the hit-rate, the warm latency, and the get_feature_store hot-dependency in one move. Drift is no longer a silent cache-bust: it is an explicit, optional op (--check-drift recomputes the live slice hash vs the recorded input_data_sha → emits capture_input_drift; operator decides whether to re-pin).

# Model B logical lookup key (stable, no live data load):
lookup_key = { "crypto_symbol": crypto, "fold_id": fold_id,
               "cell_ref": f"{run_id}_{config_hash}",
               "capture_schema_version": CAPTURE_SCHEMA_VERSION,
               "input_data_hash_version": INPUT_DATA_HASH_VERSION }
# input_data_sha + snapshot pointer = recorded ATTRIBUTES of the artifact (the pin), not live lookup inputs.

The decision between A and B is exactly the §9 hit-rate gate: measure whether the live slice is stable for historical folds (A viable) or drifts (B required). The findings above make B the expected outcome — but the measurement is the formal decider, with equal billing to the size gate.

Either way, config_hash and the snapshot/input_data_sha MUST use the canonical serialization below; the §8 metrics bound any data load that does occur (Model B: only at first capture; Model A: every lookup).

input_data_sha — canonicalization contract (correctness-bearing, versioned)

Because input_data_sha is now correctness-bearing, it is a contract, not an intention. INPUT_DATA_HASH_VERSION versions this protocol; any change to it bumps the version (which is itself a key component → old entries naturally miss):

INPUT_DATA_SHA_V1 = sha256( canonical_table(
    rows      : ordered by [timestamp, asset, row_id] (deterministic, stable)
    columns   : sorted lexicographically; runtime/metadata columns excluded
    included  : feature columns + _label + _weight + _split  (explicitly)
    dtypes    : normalized (float64 little-endian or fixed-precision decimal; ints fixed-width)
    NaN/inf/-0.0 : explicit sentinels (not platform-dependent bit patterns)
    timezone  : UTC-normalized
    index     : excluded unless semantically meaningful (then promoted to a named column)
    boundaries: train/val/test slice boundaries included as part of the hashed payload
) )

The same canonical-serialization discipline applies to config_hash (sorted keys, fixed float repr, numpy dtypes normalized) — otherwise two identical prod runs hash differently and the cache never hits. The §14 key-parity test runs on real prod params (numpy types), not a toy dict.

4b. Physical content addressing + de-duplication

The logical key resolves (PG index) to a physical blob addressed by content_hash (§6). Distinct logical keys with identical canonical content → one physical blob. N logical rows → 1 artifact.

4c. Remaining upstream dependency (honest limitation)

The original FTF data is unrecoverable (feature_hash="unknown"). So input_data_sha is recorded at the first diagnostic capture and re-audits compare to that — it guarantees the cache never serves across drift, but it cannot certify fidelity to the original FTF data. True end-to-end reproducibility requires populating finetune_results.feature_hash at FTF time (upstream change, §13 D6).

5. Access path — get_diagnostic_capture()

Mirrors get_labels (src/commun/cache/cvntrade_cache_interface.py:418) with claim-based concurrency (§7), immutability, and the --force-recapture override (§6). The L0 session cache is not part of this lever's value (the cost is inter-run); kept only as inherited behaviour, and it stores a lazy ref + metadata, not the full DataFrame, to avoid intra-worker RAM bloat across multiple folds.

def get_diagnostic_capture(self, crypto, fold_id, *, cell_ref, input_data_sha,
                           capture_fn, force_recapture=False) -> DiagnosticCapture:
    logical_key = {... "input_data_sha": input_data_sha,
                       "capture_schema_version": CAPTURE_SCHEMA_VERSION}
    if not force_recapture:
        entity = self._try_cache_read(logical_key, crypto, fold_id)   # §8 (retry + integrity)
        if entity is not None:
            return entity
    with self._claim_generation(logical_key) as claim:                 # §7 distributed lease
        if claim.lost_race:
            return self._await_generation(logical_key)                 # no double replay
        return self._execute_cold_capture(capture_fn, logical_key, crypto, fold_id)

6. Integrity & parity (corrected)

  • Integrity = a canonical content hash of captured_data (columns sorted, dtypes normalized, row order fixed, file metadata excluded), recomputed on read; mismatch → corrupt → regenerate (loud). Not raw parquet bytes (writer/codec/Arrow versions perturb bytes without changing meaning).
  • Parity = verdict parity against the pinned snapshot (Model B). A re-audit reuses the pin → its fold is identical to the first pinned capture → identical verdict, deterministically (the §14 test asserts canonical_hash equality). Parity is to D_capture1 (the pin), per §2ter — not to the original FTF D_ftf. Live data drift no longer silently busts the cache (which would defeat the lever); it is surfaced by the --check-drift op (§4a) as capture_input_drift. But that optionality is itself a risk (round-9): if drift occurs and --check-drift is never run, the pin goes silently stale and §2ter consistency degrades to "consistency with an obsolete snapshot". So whether --check-drift is opt-in, default-on, or auto-above-a-pin-age (plus the max pin age) is NOT assumed optional — it is set by §9bis Gate 2 (drift-rate). (Under the rejected Model A, drift was self-correcting via miss→recapture; Model B trades that for reliable hits at the cost of needing this explicit guard.)

Reproducibility divergence (same logical key → different content_hash)

A regenerated artifact under the same logical key MUST produce the same content_hash. If it does not, that is evidence of non-determinism (canonicalization bug, an uncaptured runtime/library dependency, row-order instability) — it is NOT something to paper over by silently repointing the key. Behaviour:

Situation Behaviour
same logical key, same content_hash idempotent (no-op / metadata refresh)
same logical key, different content_hash (normal mode) emit capture_repro_divergence severity=error; do not silently repoint; surface for investigation
same logical key, different content_hash (force mode, explicitly allowed) write a new generation record attaching both hashes + the divergence event (audit trail), never overwrite

A recurring divergence means the logical key is incomplete (a real dependency is missing from INPUT_DATA_HASH_VERSION / CAPTURE_SCHEMA_VERSION) — it is a design signal, not a transient.

7. Concurrency — single-flight generation

Two pods missing the same (crypto, fold) must not both pay ~10 CPU-h. Use PostgreSQL, not Redispg_advisory_xact_lock is already used by the cache index to serialize cold init (src/commun/cache/cvntrade_cache_index.py:209), and the manager already holds the PG connection. Reusing it keeps single-flight inside the existing index — no new infra dependency (consistent with §12's rejection of the PVC; Redis exists in-stack for Feast/ETL but coupling the diagnostic path to it would re-introduce surface we deliberately avoid).

Critical correction (round-7): the advisory lock guards only the brief claim, never the 10-CPU-h capture. A transaction-scoped pg_advisory_xact_lock cannot coexist with a 5-min heartbeat: committing to bump updated_at releases the xact lock; not committing holds a transaction open for the whole capture → connection-pool starvation. So we split the two concerns:

  • Atomic claim (short, lock-protected) — a session-level pg_advisory_lock(key) around a single upsert of the claim-row to status=generating, committed immediately, then pg_advisory_unlock(key). The lock is held for milliseconds; the connection returns to the pool at once.
  • Long generation (no held lock) — guarded purely by the claim-row state + heartbeat (updated_at), NOT by any held DB lock.
conn.execute("SELECT pg_advisory_lock(%s)", (cell_ref_key,))   # session lock, brief
try:
    won = self._upsert_claim_generating(cell_ref)              # INSERT ... ON CONFLICT; sets status+updated_at
    conn.commit()                                              # free the pooled connection immediately
finally:
    conn.execute("SELECT pg_advisory_unlock(%s)", (cell_ref_key,))
# no lock held during the ~10 CPU-h capture below:
Claim       : pg_advisory_lock → upsert status=generating → COMMIT → pg_advisory_unlock  (winner)
Heartbeat   : winner bumps claim_row.updated_at every ~5 min in its OWN short txns (capture_lease_heartbeat)
Stale lease : updated_at older than 30 min ⇒ reclaimable (dead generator)
Max wall    : 4 h (configurable) → abort + clear claim + capture_lease_timeout
Loser wait  : bounded poll of the claim-row/index; on stale lease → re-claim

Winner generates+stores+clears the claim; loser _await_generation polls (bounded) then reads. Late double-write idempotent on the same content_hash (divergent → §6). Prevents warm-start storms without holding a DB transaction (or a Redis lock) for the capture duration.

8. Fail-safe + degraded-warm-path observability (ADR-25, ADR-26, [[feedback_no_python_crash_visible]])

A transient S3 read failure on a HIT must not trigger a 2h48 cold recapture: retry with backoff first, only then fall back — loudly.

def _try_cache_read(self, logical_key, crypto, fold_id):
    for attempt in _backoff(retries=3):                      # transient S3/MLflow blip ≠ cold cost
        try:
            result = self.manager.get_cached_entity(EntityType.DIAGNOSTIC_CAPTURE, logical_key)
            if result.found:
                if self._verify_canonical_hash(result.entity):
                    log_event(self.logger, "capture_cache_hit", crypto=crypto, fold=fold_id)
                    return result.entity
                log_event(self.logger, "capture_cache_corrupt", severity="warn", crypto=crypto)
                return None                                  # corrupt → regenerate (caller)
            return None                                      # genuine MISS → caller generates
        except (Boto3Error, MlflowException) as e:
            last = e; continue
    # exhausted retries on a backend fault: fall back to cold, but ALERTABLE (never silent)
    log_event(self.logger, "cache_backend_failure_fallback_to_cold",
              severity="error", error=str(last), crypto=crypto)
    return None
Situation Behaviour
MISS claim + generate, loud capture_cache_miss
Integrity mismatch regenerate, capture_cache_corrupt severity=warn
cell_ref/cell absent from PG STOP + clear error (no orphan)
Transient S3 error retry/backoff → then cold fallback, cache_backend_failure_fallback_to_cold severity=error

Degraded-mode is observable: counter capture_cache_fallback_total + a warm-hit-rate SLO; alert (Grafana, ADR-26) when fallbacks exceed a window threshold — a broken warm path surfaces instead of silently costing 10 CPU-h.

Metrics (the warm path is only as fast as its slowest step, so meter each): capture_input_hash_duration_sec, capture_input_hash_rows, capture_input_hash_bytes, capture_lookup_duration_sec, capture_artifact_read_duration_sec, capture_canonical_verify_duration_sec, capture_cache_fallback_total. Guardrail: if capture_input_hash_duration_sec exceeds a threshold (the get_feature_store load dominating the lookup), emit a warning and fold it into the warm-path SLO — so "the hash computation silently ate the warm budget" is visible, not assumed away.

9. Storage economics + retention

Parametric (anchor: ~10k×320 fold): raw float32 ≈ 13 MB → parquet/snappy ~3–6 MB/fold; one full defi_top5 audit (~30 folds) ≈ ~100–200 MB; annual growth = audits × distinct cells × ~5 MB minus dedup (§4b). Trade-off: ~10.5 CPU-h avoided/audit vs ~100–200 MB stored — favourable if warm-hits materialize.

The size budget X and the retention TTL are set by the gates in §9bis (Gate 2 size, Gate 1 value/reuse) — not assumed here.

Retention (closes v2-Q4) — an S3 lifecycle policy on the dedicated prefix:

{ "ID": "DeleteOldDiagnosticCaptures", "Prefix": "mlflow/diagnostic-captures/",
  "Status": "Enabled", "Expiration": { "Days": 14 } }
TTL must be aligned to the observed re-audit inter-arrival, not the sprint length (round-6): if a cell is re-audited at > TTL intervals (e.g. a quarterly revisit), the pin has expired → cold again → hit-rate ≈ 0. So the Expiration.Days is set from the measured median gap between re-audits of the same cell (a number to capture, not "14 = one sprint"); 14 d is a placeholder until that gap is measured. Reconciliation with §2/§2ter: the snapshot is not a forever-reproducible anchor anyway, so default-ephemeral is consistent — but expiring the pin means the next re-audit re-derives a (possibly different, §2bis) slice, so expiry is a re-pin event, logged. A snapshot cited by a published diagnostic verdict is copied to an audit prefix (mlflow/diagnostic-audit/, exempt from expiry) at verdict-publication time — preserving the evidentiary chain.

9bis. Gates — specification (self-contained)

Four pragmatic, binary, fast-to-run gates — measurement experiments, not runtime components (each yields a number or a go/no-go, then is discarded). Sequenced cheap-first so a cheap result spares the expensive runs. Three before implementation (value, drift-rate, size), one before rollout / legacy retirement (a single cold→warm experiment covering reproducibility + perf + single-flight).

Threshold nature (round-9): < 2 min / < 10 s are product-goal-derived → hard cliffs; ≥ 20 % reuse and ≤ 25 MB are judgment knees → discussion triggers, not auto-cliffs. Inter-gate dependency: Gate 1 produces the justified TTL, consumed by §9 and by Gate 3's annual-cost re-validation. Otherwise independent.

Gate 1 — Value / reuse · entry · blocking

Field Specification
Role The headline gate: is the warm path used often enough to justify the machinery?
Question Over the last 30–60 days, are enough diagnostics re-run on the same cell_ref to justify persisting captures?
Measure (no workload — read history) From PG finetune_results / Airflow / Loki: # diagnostics, # distinct cell_ref, re-runs per cell_ref, median + p75/p90 gap between re-runs, # captures reusable within the proposed retention window.
Decision 20 % reuse → Go · 5–20 % → Go-limited (short retention, controlled rollout) · < 5 %discussion trigger, NOT auto-no-go (see induced-demand). Median gap > planned TTL → raise TTL or reconsider.
⚠ Induced-demand clause (round-9) Historical reuse was produced in a world where re-audit cost 2h48 — people avoided re-running precisely because of the cost this lever removes. So low historical reuse may reflect the cost barrier, not low value (counting swimmers to decide whether to build the bridge). A < 5 % result therefore triggers a qualitative questionis the rarity from cost or from no need? — cross-checked against declared operator intent, not Loki alone. Historical reuse is a signal, not a ceiling.
Execution Pure read of existing history — no capture/replay; can run now.

Gate 2 — Drift-rate · entry · blocking · touches audit correctness

Field Specification
Role (round-9 — promoted, not demoted) Model B made drift more critical, not less. Under Model A drift was self-correcting (miss→recapture). Under Model B the pin is reused by cell_ref and drift is caught only by the explicit --check-drift; if the operator never runs it, the pin silently goes stale and §2ter's "inter-re-audit consistency" degrades to "consistency with an increasingly obsolete snapshot" — a silent failure mode. So drift frequency is first-class.
Question For a fully-historical window [train_start … test_end], how often does the canonical INPUT_DATA_SHA_V1 of the live slice change?
Measure K historical folds, recompute the slice hash over successive observations (light workload — load + hash, no replay; can piggyback successive cold audits).
Decision it gates (REOPENED) the runtime policy --check-drift = opt-in vs default-on vs auto-above-pin-age, plus the max pin age and the §9 TTL: rare drift → opt-in + long TTL suffices; frequent drift → default-on (or auto above an age) + short TTL, else silent-stale risk.
Status This is the only gate touching audit correctness (not just efficiency) → blocking for committee re-confirmation (it reopens the --check-drift default decision the v3 design left as "optional").

Gate 3 — Artifact-size · entry · blocking · ~zero marginal compute

Field Specification
Role Keep storage sane before persisting every fold.
Measure (piggyback — round-9) No dedicated captures. serialize_capture() already writes the parquet to /tmp on every normal cold run, so instrument the next scheduled cold audit to record captured_data parquet + sidecar-JSON size + read/verify time at zero marginal cost. Only S3 upload/read latency needs one real artifact (one, not ten).
Decision (knees) p95 ≤ 30 MB/fold · audit ≤ 1 GB · read+verify ≤ 10 s · 0 blocking S3 issue. > 30 MB → re-validate retention/compression/dedup/annual cost (not auto-no-go). (Knee raised 25 → 30 MB on 2026-05-27: the validation runs measured ~21–24 MB/fold — 4–8× the §9 estimate, AAVEUSDC hugging the old 25 MB knee — so a 5 MB/fold safety buffer keeps the gate from starting at the limit. Audit budget unchanged: 30 MB × ~30 folds ≈ 900 MB ≤ 1 GB.)
Outputs p95 size + budget X (§9).

Gate 4 — Release: one cold→warm in-cluster experiment · release · blocking

Merges the former reproducibility + warm-perf gates (same cold→warm cycle on 3–5 folds → both) and adds the single-flight assertion — the §7 PG advisory-lock + claim-row + heartbeat is the most novel/risky code, and pool starvation (round-7) only shows under real load (the §14 concurrency test mocks it). One experiment, three distinct pass criteria:

Criterion Pass
Reproducibility cold content_hash == warm · verdict cold == warm · no capture_repro_divergence · canonical verify stable. Divergence under the same logical key → STOP (incomplete key / unstable canonicalization, §6).
Performance warm path < 2 min · artifact-load < 10 s · verdict parity · 0 fallback-to-cold nominal · 0 operator-visible crash.
Single-flight under contention two simultaneous in-cluster MISSes → exactly one capture + one upload · connection pool not saturated · heartbeat (capture_lease_heartbeat) observed.

Must pass before retiring skip_phase_a (§17). In-cluster, operator-triggered.

Gate positioning

Gate Moment Blocking Role
1 — value / reuse before impl yes worth building? (+ induced-demand caveat)
2 — drift-rate before impl yes --check-drift default + max pin age (audit correctness)
3 — artifact size before impl yes storage acceptable? (piggyback, ~free)
4 — cold→warm release before retiring legacy yes repro + < 2 min + single-flight under load
[1 value]+[2 drift]+[3 size]  ──►  IMPLEMENTATION  ──►  prod rollout  ──►  [4 cold→warm: repro + <2min + single-flight]  ──►  retire skip_phase_a
 (read hist)(light)(piggyback)                                                 (the only heavy dedicated in-cluster run)
  • Not runtime artifacts: gates only inform constants (X, Expiration.Days, --check-drift default, max pin age) and authorize the next phase.
  • Cheap-first: Gate 1 = pure history read (run now); Gate 3 = piggyback on the next cold audit (~free); Gate 2 = light load+hash; Gate 4 is the only heavy dedicated run, and it now produces three results from one cold→warm cycle.
  • The agent builds the harness; the operator launches in-cluster runs (no-autonomous-launch policy).

10. Lifecycle summary

Immutable (no in-place overwrite; new logical key ⇒ new pointer). Invalidation is implicit via the key (§4a): a data-drift or schema bump yields a new key and (likely) a new blob; old snapshots age out under the 14-day rule. Eviction is policy-gated (§9), not "free because recomputable" — but the policy is ephemeral-by-default + pin-on-cite, calibrated to §2bis (limited reproducibility) rather than to a false "forever anchor".

11. Wiring into the diagnostic flow

Component Today After
serialize_capture() writes /tmp parquet the MISS generator (capture_fn); /tmp is intra-pod scratch, store_entity persists to L3
skip_phase_a bool + /tmp check manual precondition, RuntimeError if missing removed (closes v2-Q3). Replaced by --force-recapture (default False), mirroring FORCE_FEATURE_STORE (src/commun/cache/cvntrade_cache_interface.py:233): Truebypass the read path, regenerate, then compare content_hash (§6 divergence) — not a blind overwrite of the immutable artifact
s40_io.py (src/commun/finetune/diagnostic/hamilton/s40_io.py:75) loads from /tmp parquet loads from the artifact; integrity = canonical hash

Operator-facing impact: replacing the skip_phase_a Airflow Param with --force-recapture is an operator-visible DAG change — it MUST update the DAG doc_md banner (ADR-92) and documentation/OPERATIONS.md (the diagnostic-run runbook), and be called out in the PR. Operators triggering a re-audit no longer pre-stage a /tmp parquet; they get a transparent warm hit (or a --force-recapture toggle).

12. Alternatives considered

Alternative Verdict
Bespoke "upload parquet to S3" helper keyed by SHA Rejected — raw S3 storage forgoes the platform's native PostgreSQL indexing, TTL/lifecycle management, and observability that CVNTradeCache provides.
PVC mounted into diagnostic pods Rejected — no PVC mount exists; infra surface; not cleanly content-addressed/shared.
Output-parquet SHA as the lookup key Rejected — not computable before capture → can't decide "skip".
Single version + cell identity = capture identity Rejected — fragile invalidation + duplication explosion.
cell_ref-only key (CAPTURE_CODE_VERSION only) Rejected on evidence (§2bis) — data drifts under a fixed cell → silent stale serve.
Model A — live-derived input_data_sha in the key (recomputed each lookup) Rejected (conditional) — correct but, per the code findings (live re-derivation, TTL/ETL regen), it misses on every regen → near-zero hit-rate, and reloads the slice every warm call. Viable only if the §9 hit-rate gate shows day-to-day stability.
Model B (CHOSEN) — artifact via cache infra; pinned content-addressed snapshot keyed by cell_ref; pin reused on the warm path; drift = explicit check; PG advisory-lock single-flight; logical/physical dedup Chosen (pending the §9 hit-rate measurement confirming Model A is not viable, the expected outcome).

13. Closed decisions (was "open questions")

# Decision
D1 Key = cell_ref + input_data_sha + CAPTURE_SCHEMA_VERSION. No live-FE env vars (§4a).
D2 cell_ref = f"{finetune_run_id}_{config_hash}"; canonical serialization for both hashes (§4a).
D3 skip_phase_a removed--force-recapture (§11).
D4 Retention: S3 lifecycle 14-day expiry on diagnostic-captures/; pin-on-cite copy to diagnostic-audit/ for published verdicts (§9).
D5 Secondary artifacts: only the main DataFrame is the body; JSON inlined in capture_meta (§3).
D6 (residual dependency) True original-FTF fidelity needs finetune_results.feature_hash populated at FTF time (currently "unknown"). Until then, input_data_sha is recorded at first capture; re-audits compare to that (§4c). Tracked as a follow-up on the FTF runner.
D7 input_data_sha is a versioned canonicalization contract (INPUT_DATA_HASH_VERSION), spec'd in §4a; force-recapture is regenerate-and-compare, divergence is an error event, never a silent repoint (§6).
D8 Keying = Model B (pinned content-addressed snapshot), keyed on cell_ref; pin reused on the warm path, drift = explicit --check-drift. Model A dropped (committee-approved B); the entry gate is now value/reuse (§9bis Gate 1), not an A/B decision.
D9 Single-flight = PostgreSQL advisory-lock + claim-row (already in-pattern, §7), not Redis — no new infra in the diagnostic path.
D10 (committee decision) §2ter: this audit certifies inter-re-audit consistency from the pinned first capture, not fidelity to the original FTF run (which needs D6). Must be explicitly accepted, not implied by the "reproducibility" framing.
D11 (round-9 — operator-waived 2026-05-26) Under Model B a reused pin can go silently stale if drift occurs and --check-drift is never run. So --check-drift = opt-in vs default-on vs auto-above-pin-age (+ max pin age) is resolved by §9bis Gate 2 (drift-rate) measurement. The v3 "drift is optional" wording is withdrawn. Committee re-confirmation of this delta was waived by the operator on 2026-05-26 (waiver logged on OP wp#232; rationale: scoped drift-policy delta + committee consolidator degraded/Gemini spend-capped). The v3-era design retains its committee PASS (ca64c4d8). Reopen condition (round-11): a correctness decision must not stand on a cost/tooling waiver. This waiver covers only the absence of a re-confirmation now — it does not pre-approve the outcome. Once §9bis Gate 2 produces the drift-rate, if it implies --check-drift default-on / short TTL (a material change to the operator behaviour nominally validated under ca64c4d8), the waiver is explicitly reopened and the delta returns to committee.

14. Test plan (ADR-58 — guardrail + integration test)

  • Unit: logical-key parity on real prod params (numpy dtypes) — same inputs → same key; each key component bumps independently; canonical-content-hash independent of parquet byte-encoding + metadata (tested via two serialization paths) — not a literal "stable across all pandas versions" guarantee; corrupt-blob → regenerate; same logical key + regenerate → same content_hash (the §6 non-divergence invariant).
  • Concurrency: two simultaneous MISSes → exactly one replay + one upload; both readers get the same artifact.
  • Integration (tests/integration/test_diagnostic_cache.py):
  • Cold run — clear cache for a test cell_ref → run diagnostic → assert capture_cache_miss emitted → assert artifact present on S3.
  • Warm run — two tiers: a component test asserts the artifact load (lookup + read + canonical-verify) < 10 s; the full warm diagnostic path asserts < 2 min (the product goal — includes get_feature_store + input_data_sha + the rest of the diagnostic). p50/p95 recorded as a non-blocking perf benchmark (avoid a flaky hard < 10 s on the full path). Assert capture_cache_hit emitted.
  • Parity assertionassert canonical_hash(cold_df) == canonical_hash(warm_df) (blocking).
  • Drift — mutate the input slice → assert input_data_sha changes → assert a miss (re-capture), not a stale hit.
  • Version conflict — store under CAPTURE_SCHEMA_VERSION="1.0.0", bump code to "1.1.0" → assert a miss (new logical key), the old-layout artifact is never served under the new schema.
  • Backward-compat / migration — with the cache backend unavailable (or force_recapture), assert the legacy generator path (serialize_capture/tmp) still produces a valid fold and a correct verdict (no regression while both paths coexist).
  • MLOps readiness (ADR-70): touches src/commun/cache/ + capture → fill TEMPLATE_mlops_readiness.md.

15. Files touched

cvntrade_cache_entities.py (+entity), cache_versioning.py (+CAPTURE_SCHEMA_VERSION), cvntrade_cache_interface.py (+get_diagnostic_capture, retry/claim), cvntrade_cache_manager.py (claim/lease + logical→physical dedup), cvntrade_entity_registry.py (register), s18_step1_capture.py (MISS generator + emit input_data_sha), hamilton/s40_io.py (load from artifact), dag_diagnostic__s18_step1_4_chain.py (--force-recapture, degraded events), infra (Grafana SLO + S3 lifecycle), tests/unit|integration/.... Follow-up (separate): finetune/persistence.py + FTF runner to populate feature_hash (D6).

15bis. Risks & mitigations (consolidated)

Risk Impact Mitigation (§) Residual
Data drift serves a stale fold (the core §2bis risk) wrong verdict, silently input_data_sha in the lookup key → drift = miss, not stale hit (§4a, §6) original-FTF fidelity unprovable until feature_hash is populated upstream (D6)
Thundering herd — N pods miss the same cell → N×10 CPU-h cluster cost blow-up single-flight lease; losers await, don't replay (§7) lease TTL mis-sizing → rare benign double-capture (idempotent on content_hash)
Transient S3 blip on a HIT → cold recapture ~2h48 wasted retry/backoff before any cold fallback (§8) persistent S3 outage → cold path, but alerted (cache_backend_failure_fallback_to_cold)
Storage growth S3 cost content-dedup (§4b) + 14-day lifecycle (§9) audit-pinned set grows slowly, bounded by cited verdicts
Key non-determinism (numpy dtypes / float repr) → cache never hits perf win lost silently canonical serialization of config_hash/input_data_sha (§4a) + key-parity test on real prod params (§14)
L0 session RAM bloat (many folds/worker) OOM risk session stores a lazy ref + metadata, not the full DataFrame (§5)
Schema/version conflict serves wrong layout corrupt read downstream CAPTURE_SCHEMA_VERSION in key; version-conflict test (§14) relies on manual bump discipline
cell_ref absent (cell deleted from PG) crash / orphan STOP + clear error, never guess (§8)
Corrupt/partial blob wrong data canonical-hash verify on read → regenerate loud (§6, §8)
Repro divergence (same logical key → different content_hash) audit integrity; hidden non-determinism capture_repro_divergence event, no silent repoint; both hashes attached in force mode (§6) signals an incomplete key → needs a *_HASH_VERSION bump
Lookup cost eats the warm budget (get_feature_store load) warm path slower than the < 2 min goal Model B removes the warm-path load entirely (pin reuse); else bounded + metered + guardrail (§4a, §8) Model A only — confirmed by the §9 measurement
Lever delivers no value (cache rarely reused, or pins expire before reuse) — the existential risk machinery built for almost nothing Model B (pin keyed on cell_ref) + the §9bis Gate 1 value/reuse go/no-go before coding (≥20% reuse) low reuse → minimal-only build or no-go
Audit certifies the wrong thing (warm verdict read as validating the original FTF) misleading diagnostics §2ter explicit committee decision; D6 upstream fix for true FTF fidelity bounded to "inter-re-audit consistency" until D6
Silent stale pin (Model B: data drifts, --check-drift never run → reused pin obsolete) — correctness, round-9 re-audits consistent with an obsolete snapshot, unnoticed §9bis Gate 2 (drift-rate) sets --check-drift default-on / auto-above-age + max pin age (D11) residual until the drift-rate is measured

Priorities: the stale-serve (data drift) and thundering-herd risks are the two that change the design's correctness/cost profile — both are first-class (§4a, §7). The rest are standard cache hygiene.

16. ADRs + governance

ADR-25 (no silent fallback), ADR-26 (Grafana/SLO), ADR-58 (guardrail+integration test), ADR-70 (MLOps readiness), ADR-88 (versioned test data + provenance), ADR-89 (harness). Because v3 materially extends lever #1 beyond the plan_review (9e590907) — new entity, key model driven by §2bis, removed skip_phase_a — it was taken to committee: plan_review ca64c4d8 PASSED (Mistral 8.0, OP Meeting #231). The later v4.4 drift-policy delta (D11) reopened one decision; its committee re-confirmation was waived by the operator on 2026-05-26 (logged on OP wp#232) — so the design stands as: committee-PASSED (ca64c4d8) + operator-waived drift-policy delta, resolved at implementation by §9bis Gate 2.

17. Rollout / migration

Backward-compatible: /tmp + the generator path keep working. Ship behind the existing flow; once a real re-audit is observed < 2 min with verdict parity and the warm-hit SLO holds, the old skip_phase_a is fully retired (D3).

Post-deploy validation (in-cluster, before retiring the legacy path): 1. Cold→warm in prod — trigger a real defi_top5 re-audit; assert the 2nd run hits the cache < 2 min with verdict parity vs the cold run (not just the local test). 2. Warm-hit SLO — confirm the capture_cache_* metrics + warm-hit-rate SLO are green on Grafana (ADR-26) over ≥ one review-sprint. 3. Backward-compat gate — keep both paths live for one sprint; retire skip_phase_a (D3) only after (1)+(2) hold and no cache_backend_failure_fallback_to_cold spikes are observed. 4. Operator docsOPERATIONS.md + DAG doc_md updated and verified live before the param rename ships (§11).


One-line essence

Not "caching a parquet". A reproducibility primitive — immutable, content-addressed, expensively-generated, drift-aware (§2bis), audit-pinnable — using the cache as its plumbing. The key binds to the data content (not the cell alone, not the live FE env), single-flight on generation, retries before falling back to cold, and never serves a stale fold silently.

---

Part B — Lever #2 design: optional run_s22a1 cross-ref anchor

Maturity: DRAFT v0.1 · design-only · resolver already merged (aa3140d0), wiring pending. Lighter than Part A.

B1. Problem

In the s40 validation-integrity path, run_s22a1 — the 300-round LightGBM re-proof that the captured fold still reproduces the original shallow-convergence symptom — runs unconditionally (src/commun/finetune/diagnostic/hamilton/s40_io.py:142), costing 7–865 s per cell (the variance is the §B7 secondary finding). But the four s40 probes only consume the captured X/y (loaded at src/commun/finetune/diagnostic/hamilton/s40_io.py:178) — they do not need the re-proof. A statistically light leakage/label audit therefore pays a symptom-reproduction-heavy preamble it doesn't use.

B2. Current state (on the branch)

  • Done (aa3140d0): resolver _resolve_s40_skip_s22a1 (src/commun/finetune/diagnostic/hamilton/s40_io.py:82, reads CVN_S40_SKIP_S22A1_CROSSREF, fail-safe parse) + 14 unit tests.
  • Missing: the resolver is never consulted at the call siterun_s22a1 still runs every time.
  • Out-of-scope sibling: s27_io.py:90 also calls run_s22a1 (the CVN_S40_* flag is s40-scoped — see B8).

B3. Design — gate the anchor, keep the data load

The s40 flow is: (1) run_s22a1v1c (status + parquet_sha256) ; (2) recompute parquet SHA, compare to v1c.parquet_sha256 (drift anchor) ; (3) _load_captured_parquetX_tr,y_tr,X_va,y_va (what the probes need). Lever #2 makes (1) optional:

if _resolve_s40_skip_s22a1():
    # SKIP the 300-round anchor (1) + the s22a1-derived drift check (2)
    require parquet exists  → else INCONCLUSIVE_TOOLING "captured parquet absent; run with anchor on / capture first" (ADR-25, no silent skip)
    parquet_sha256 = _sha256_file(parquet)        # informational only (no s22a1 anchor to compare against)
    s22a1_status  = "SKIPPED"
    emit event=s40_s22a1_skipped severity=info
    → _load_captured_parquet → probes → verdict   # identical to the non-skip path from (3) on
else:
    current behaviour (run_s22a1 → drift anchor → load)

The probes run on the same X/y either way ⇒ the validation-integrity probe verdict is unchanged by the skip. What does change is the certification level: the cross-reference is downgraded to SKIPPED (see B6bis) — skip=ON does not provide the same level of proof as skip=OFF. Verdict-parity on the X/y probes is the acceptance test (B6).

B4. Default policy (ADR-59 — PG ftf_config.base_env, Console UI only)

Mission type CVN_S40_SKIP_S22A1_CROSSREF default Rationale
Reproduction diagnostics (S22→S28) OFF (anchor runs) they want the re-proof that the symptom reproduces
Light validation-integrity audit (s40) ON (anchor skipped) only needs X/y → the preamble is pure cost

The default lives in PG (not git, ADR-59); a run-level Airflow param can override per-run.

The real criterion is the audit type, not "s40" per se (round-10) — to stop a future dev widening the skip "because it's faster":

Audit type May skip the anchor?
X/y-only audit (leakage / label integrity) yes
reproduction / convergence audit no
any diagnostic publishing a causal claim about LightGBM no
smoke validation-integrity yes

Rule: skip is allowed only for X/y-only consumers. Any diagnostic whose conclusion references the S22A1 symptom MUST keep the anchor ON. (Operationalized by the Part-C consumer registry §C0, shared between levers.)

B5. What is lost when skipped (and why it's acceptable here)

Skipping drops the cross-reference that the captured fold still reproduces the original symptom and the s22a1-anchored parquet-drift check (step 2). Acceptable for an X/y-only audit (leakage/label integrity); not acceptable for the reproduction diagnostics → hence default-OFF there. In skip mode we still recompute + log the parquet SHA (informational), so the artifact is self-described even without the anchor.

⚠ Dependency (round-11): skipping removes the s22a1-anchored parquet-drift check. This is safe only if Lever #1's canonical-hash integrity (§6) is deployed to cover it — but Part A is parked. So in standalone skip mode (no Lever #1), skip removes the only parquet-drift detection. Until Lever #1 ships, the B7 value-gate weighs that loss too: a skip that also blinds drift detection raises the bar for "worth it".

B6. Guardrail + tests (ADR-58)

INCONCLUSIVE_TOOLING is the only fallback in skip mode (acceptance condition): parquet absent · unreadable · hash-uncomputable · invalid schema ⇒ all → INCONCLUSIVE_TOOLING. Never an auto cold-recapture, an auto anchor-run, a silent pass, or a partial verdict.

Tests: 1. Parity (functional) — same parquet, skip ON vs OFF → identical s40 probe outputs. 2. Metadata downgrade — skip ON ⇒ s22a1_crossref_status=SKIPPED + s22a1_anchor_available=false emitted (B6bis). 3. No accidental anchor — skip ON ⇒ run_s22a1 not called, no s22a1 event. 4. Absent parquet — skip ON + parquet absent ⇒ INCONCLUSIVE_TOOLING (no silent pass, ADR-25). 5. Default policy — s40 light audit ⇒ default skip ON; reproduction diagnostic ⇒ default skip OFF.

B6bis. Output semantics (round-10 — make the downgrade explicit)

When run_s22a1 is skipped, the s40 X/y-only probe verdict stays comparable, but the run no longer certifies that the captured fold reproduces the S22A1 shallow-convergence symptom. The output MUST therefore carry, so no reader mistakes skip=ON for the same proof level as skip=OFF:

s22a1_crossref_status = SKIPPED
s22a1_anchor_available = false
parquet_sha256        = <informational only — no anchor to compare against>

Skip mode is valid only for diagnostics classified X/y-only (§C0 registry).

Propagation (round-11): emitting s22a1_crossref_status=SKIPPED on the s40 output is not enough — it must propagate to every verdict consumer (published report, dashboard, aggregator) and be displayed, else an aggregator shows a downgraded audit as if complete (the exact "silent" failure class this design fights). A verdict cited in a publication with crossref=SKIPPED MUST be visually marked as reduced-certification.

B7. Value gate (round-11 — this is THE entry gate, not a "secondary finding")

run_s22a1 elapsed 7 s → 865 s for the same n_rounds=300 on ~10k×320 data — a 123× unexplained variance. The entire value of Lever #2 is "skip a costly preamble", so the benefit cannot be sized without the cost distribution: - p50 ≈ 7 s ⇒ we'd add a certification-downgrade code path (B6bis) to save ~7 s — a bad trade (footgun > gain); - p50 ≈ minutes ⇒ worth it.

Gate (blocking, before wiring): measure p50/p95 of run_s22a1 wall-time on the real defi_top5 cells (read from Loki's existing elapsed_s, ~free). Wire the skip only if the cost is materially above the certification risk. (Also diagnose the 123× variance: early-stopping path? cached fit? data-size skew? — it tells us which cells the skip actually helps.) This mirrors Part-A Gate 1: don't build the lever before confirming it pays.

B8. Scope decision

s27_io.py also runs run_s22a1. The CVN_S40_* flag is s40-scoped; extending the skip to s27 (and other s2x consumers) is a separate follow-up, not bundled here — unless the operator widens scope.

B9. Files + ADRs

hamilton/s40_io.py (wire the gate at the call site) · PG ftf_config.base_env (per-mission default, Console) · tests/unit/test_s40_perf.py (extend) + an integration parity test · the verdict-consumer surfaces that must propagate SKIPPED (B6bis) · the registry-enforcement test (§C0). ADR-59 (config), ADR-58 (guardrail+test), ADR-25 (no silent fallback). Small, but gated: wire only after the B7 value-gate confirms the skip pays.


Part C — Lever #3 design: dataset-only capture (single-fit as fallback)

Maturity: PARKED (operator verdict 2026-05-26 — tight park). Diagnostic gates ran: G-C0 → only s40 is xy_only (s23/s24/s26/s27 are trial_dynamics); G-C4bNO CLEAN SEAM (dataset_only infeasible without harness surgery, single_fit suspect); with G-A1 high reuse, Lever #1 already covers the re-capture. The three legs of a useful implementation are simultaneously absent ⇒ not implemented. Re-open only if all three hold (reuse drops AND a seam appears AND scope > s40); single live tripwire = reuse on Part A's full Gate 1. Decision record: documentation/reviews/2026-05-26-cvn-n001-ei-s07-lever3-decision.md. The analysis below is retained for revival.

C1. Problem

serialize_capture() reconstitutes a fold by replaying the full original FTF training — ~51 trials/cell (255 trials for defi_top5's 5 cells), the bulk of the ~10 CPU-h. But the captured-fold parquet (X_tr,y_tr,X_va,y_va) is the data fed to trainingif it is invariant across HPO trials (trials vary hyperparameters, not the X/y), replaying 51 trials to obtain it is, for X/y-only consumers, pure waste.

C2. Hypothesis — to be proven, not asserted

The captured X/y is identical across all trials; the multi-trial replay is required only by diagnostics consuming trial dynamics (s22a2 multi-seed, s22a3 learning curve), not by X/y-only audits.

The doc must prove the code respects this separation, because X/y can in fact vary across trials via: sampling, seed, feature selection, conditional preprocessing, class weighting, filtering, train/val materialization timing, row ordering, lazy transforms, dataset-enriching callbacks, trial-specific pruning, trial-specific target prep, accidental in-place mutation. "Trials vary hyperparams not data" is the claim; C4 (invariance gate) is the proof.

C3. dataset_only over single_fit (target architecture)

Prefer dataset_only: run data preparation until the captured X/y exists, then stop — do not train. A single fit is suspect when the goal is only to materialize the dataset (it can carry hyperparam-dependent side effects and give a false impression of compatibility with dynamics diagnostics). Capture modes:

full_replay   → executes the HPO / trial dynamics (status quo; default)
dataset_only  → executes data prep until captured X/y exists; does NOT train   ← target
single_fit    → fallback ONLY if the code can't yet emit X/y before the fit

C0. Gate A — consumer-classification gate (blocking, before any code)

Every diagnostic consumer is classified; unknown ⇒ full_replay. Shared with Lever #2 (§B4 rule).

DIAGNOSTIC_CAPTURE_REQUIREMENTS = {
    "s40":   "xy_only",
    "s22a1": "trial_dynamics",
    "s22a2": "trial_dynamics",
    "s22a3": "trial_dynamics",
    # s23 / s24 / s26 / s27 : TBD by the consumer map — until classified, they are `unknown` → full_replay
}

Safety rule (blocking): dataset_only is opt-in per diagnostic and forbidden unless the diagnostic is explicitly registered xy_only. trial_dynamics and unknownfull_replay. This makes "speed up an audit that secretly needs trial dynamics" impossible by default, not just discouraged.

Enforcement (round-11) — a static manual registry is the weak link. A diagnostic classified xy_only today may grow a probe tomorrow that consumes trial dynamics; the registry won't self-update → "correct at classification time, silently wrong if the consumer evolves". So the classification needs a runtime guard, not just a table: an assertion/test that FAILS if an xy_only-registered diagnostic accesses anchor / trial-dynamics outputs (e.g. the anchor object is withheld in dataset_only mode → any access raises a clear tooling error), plus a periodic registry re-review. The same registry + guard is shared by Lever #2's skip rule (§B4).

C4. Gate B — trial-invariance gate (blocking, before any code)

On K representative folds, from an existing full-replay capture, compare per-trial snapshots:

canonical_hash(trial_i.{X_train,y_train,X_val,y_val,weights,split,schema,dtypes,row_order})
    == canonical_hash(trial_0.<same>)   for all trials i

Pass ⇒ X/y is trial-invariant → dataset_only is sound. Any single unexplained divergence → STOP and diagnose before implementation (the hypothesis C2 is false or the code violates the separation) → lever #3 abandoned or re-scoped.

⚠ Empirical ≠ structural (round-11 — the central risk). C4 proves invariance on the K folds / trials that actually ran. The C2 failure mechanisms (seed, sampling, conditional preprocessing, lazy transform…) are structural — an untested seed/crypto/config could trigger one on a fold outside K, producing silently wrong X/y with no error: the audit runs on the wrong data and returns a false-confident verdict. So C4 must be doubled with a code-level argumentprove the data-prep path is upstream of, and independent from, the trial loop (a structural read of the code, not just hash coincidence on a sample) — and dataset_only in production must keep a periodic invariance assertion (occasionally capture one trial in full_replay and compare to the dataset_only output), not rely on a one-shot gate. The gate proves "invariant on what we tested"; the risk lives outside the sample.

C4bis. Gate C — prep/train seam (does the target even exist? — round-11)

dataset_only ("run data prep until X/y exists, then stop — don't train") presupposes a clean seam between data preparation and training. But if X/y is only materialized during the first fit (a lazy transform or a fit-time callback — again C2's list), then "stop before training" is impossible without surgery, and the target collapses to the single_fit the doc itself calls suspect. So the diagnostic phase needs a third finding beside C0/C4: does a pre-training seam exist (is X/y fully materialized before any lgb.train)? This is what decides whether the clean dataset_only target is reachable or whether Lever #3 is only the degraded fallback — i.e. whether the lever is worth doing at all.

C5. Conditional implementation (only if C0 + C4 + C4bis pass)

Add CVN_DIAGNOSTIC_CAPTURE_MODE (full_replay default; dataset_only; single_fit fallback) to s18_step1_capture / s18_step0_replay, default in PG ftf_config (ADR-59), A/B-testable behind the env var (ADR-56). Override to dataset_only only for registered xy_only consumers (§C0).

C6. Interaction with Levers #1 and #2

1 caches the output (re-runs free); #3 makes the cold capture cheap; #2 drops the s22a1 audit-side preamble.

⚠ #1 and #3 are inversely correlated, not purely additive (round-11). #1 makes re-runs free; #3 makes the cold run cheaper. If Part-A Gate 1 shows high reuse, you rarely pay cold → #3's value is marginal (deprioritize). If reuse is low, #1 is worth little and #3 becomes the primary lever. They are substitutes at the margin. So #3's priority is branched on the Gate-1 number — not a fixed "they compose". Sequencing: #2 first (ready, after its B7 value-gate) → #3 diagnostic phase (C0 + C4 + C4bis) → #3 impl iff gates pass and Gate-1 reuse is low enough to justify it → #1 (parked) remains the foundation for repeated audits.

C7. Release parity gate (before default-ON)

On K folds: canonical_hash(dataset_only_parquet) == canonical_hash(full_replay_parquet) and s40_verdict(dataset_only) == s40_verdict(full_replay). Mandatory before flipping any default to dataset_only.

C8. Tests

  1. Trial-invariance diagnostic — per-trial X/y hashes from a full replay → assert invariance (Gate B).
  2. Consumer registry — unknown ⇒ full_replay · trial_dynamics ⇒ full_replay · xy_only ⇒ may use dataset_only.
  3. Dataset-only parity — dataset_only parquet == full_replay parquet (canonical hash).
  4. Verdict parity — s40 full_replay verdict == dataset_only verdict.
  5. Safety — attempting dataset_only on s22a1/s22a2/s22a3 raises a clear tooling error or forces full_replay.
  6. Performance (benchmark, non-blocking) — dataset_only cold wall-time ≪ full_replay; record p50/p95, no hard threshold initially.
  7. Periodic prod invariance (round-11) — in dataset_only prod, occasionally capture one trial in full_replay and assert canonical_hash parity (catches structural drift outside the one-shot gate, §C4).

(Fixture note: C4, C7 and Part-A Gate 4 all re-run full_replay on K representative folds — share one fixture set across the three to avoid re-paying costly captures gate-by-gate.)

C9. Verdict & status

Do NOT approve for implementation. Approve the diagnostic phase only. Next action is not "implement single_fit" — it is prove X/y invariance (Gate B) + map consumers (Gate A). If both pass, the implementation target is dataset_only (single_fit only as a fallback). Tentative files (post-gate): s18_step1_capture.py, s18_step0_replay.py, CVN_DIAGNOSTIC_CAPTURE_MODE (ftf_config), the consumer registry, tests. ADR-56 (env-var-gated, A/B), ADR-58 (guardrail+test), ADR-25.