Skip to content

Implementation plan — CVN-N001-EI-S07 diagnostic-harness perf (3 levers)

Story: CVN-N001-EI-S07 — wp#232 · GH #1071 · Epic CVN-N001-EI (#1055) Design: CVN-N001-EI-S07-lever1-captured-fold-cache-design.md (Part A Lever #1 / Part B Lever #2 / Part C Lever #3) — committee-validated (ca64c4d8) + 11 review rounds. Status: plan — design frozen, no feature code started (resolver aa3140d0 aside). Date: 2026-05-26 Branch: feat/CVN-N001-EI-S07-diagnostic-harness-perf Revisions: round-12 — G-B7 decision uses p95/tail not just p50; Lever #3 gated on C0/C4/C4b only (G-A1 = priority, not correctness); Lever #1 deferred, not parked, and next-priority if reuse is material; G-C4 gains a "snapshots-absent → emit per-trial hashes via one full_replay" branch; G-C4b "no seam → design review, not auto single_fit"; Lever #2 Done requires visible certification downgrade.

Guiding principles (from the reviews)

  1. Gates before code. No feature code ships until its blocking gate passes. Building the measurement harness is allowed; running heavy workloads is operator-triggered (no-autonomous-launch).
  2. Cheap-first. Read-only/Loki measurements run before any in-cluster capture.
  3. Lever #2 first (smallest, impl-ready after its value-gate). After that, the next lever (#1 or #3) is selected by G-A1 + the C-gates, not fixed: high reuse → #1 (it delivers the re-audit <2 min promise); low reuse + C-gates pass → #3 (it cuts the cold path). #1 is deferred pending its gates, not parked indefinitely. G-A1 sets priority; it does not gate #3's correctness (#3 is gated by C0/C4/C4b).
  4. No silent failure (ADR-25): every skip/degradation surfaced; every gate produces a recorded number/decision.

Phase map

PHASE 0  measurement & diagnostic gates (read-only / light / one in-cluster) ── go/no-go numbers
   ├─ G-B7   run_s22a1 cost p50/p95            (Loki read)            → is Lever #2 worth wiring?
   ├─ G-A1   re-audit reuse per cell_ref       (PG/Airflow/Loki read) → Lever #1 value + Lever #3 priority
   ├─ G-C0   consumer classification map       (code read)            → which diagnostics may skip / dataset_only
   ├─ G-C4   trial-invariance proof            (code read + 1 capture)→ is dataset_only sound?
   └─ G-C4b  prep/train-seam check             (code read)            → is dataset_only reachable (vs single_fit)?
PHASE 1  Lever #2 impl   (iff G-B7 shows material p50 OR p95/tail cost)   → small PR
PHASE 2  Next lever, SELECTED by G-A1 + C-gates:
            high reuse              → Lever #1 (directly delivers re-audit <2 min)
            low reuse + C0/C4/C4b   → Lever #3 (cuts the cold path)
            both                    → choose by ROI
         (G-A1 informs PRIORITY; it does NOT gate Lever #3 correctness)
PHASE 3  the remaining validated lever, once its gates pass + the prior lever is stable

PHASE 0 — Measurement & diagnostic gates

No feature code. Deliverable = a short gate report per item (numbers + go/no-go) appended to wp#232 / the design §9bis. The agent builds the read/harness; in-cluster runs are operator-triggered.

G-B7 — Lever #2 value gate (Loki read, ~free)

  • Task: query Loki for run_s22a1 elapsed_s across the defi_top5 cells (the data already exists from past runs) → p50/p95 + the 123× variance breakdown.
  • Tool: /loki-query (read-only) — no workload.
  • Decision (round-12 — p95/tail matters more than p50): if either p50 or p95 materially affects the s40 wall time → GO. Low p50 but p95 > ~60–120 s (or 100×+ unexplained variance causing unpredictable latency) → GO / limited rollout (the skip stabilizes the tail). Only if both p50 and p95 are negligible → defer. A low p50 alone does not defer the lever.
  • Owner: agent (read).

G-A1 — value/reuse gate (PG/Airflow/Loki read, ~free)

  • Task: over 30–60 d, count diagnostics, distinct cell_ref, re-runs/cell_ref, median+p75/p90 gap → reuse %.
  • Decision: feeds Lever #1's go/no-go and Lever #3 priority (high reuse → deprioritize #3; low → #3 primary, §C6). Apply the induced-demand clause (low reuse may be the 2h48 cost barrier — qualitative cross-check, not auto-no-go).
  • Owner: agent (read).

G-C0 — consumer-classification map (code read)

  • Task: classify every diagnostic consumer xy_only | trial_dynamics | unknown (start the DIAGNOSTIC_CAPTURE_REQUIREMENTS registry). s40=xy_only, s22a1/a2/a3=trial_dynamics, s23/s24/s26/s27=TBD→unknown until read.
  • Decision: defines where Lever #2 skip / Lever #3 dataset_only are permitted. Unknown ⇒ full_replay.
  • Owner: agent (read).

G-C4 — trial-invariance proof (code read + 1 capture)

  • Task: (a) code-level: prove data-prep is upstream of + independent from the trial loop (structural read of s18_step0_replay/s18_step1_capture); (b) empirical: on K folds, assert canonical_hash(trial_i.Xy) == trial_0 for all i. If per-trial X/y snapshots are not already persisted (the capture may only write the final parquet), instrument one controlled full_replay to emit per-trial canonical hashes only — do not persist every trial dataframe; the hashes suffice for the proof. (Avoids G-C4 being blocked by an observability gap.)
  • Decision: pass ⇒ X/y trial-invariant → dataset_only sound. Any unexplained divergence → STOP (lever #3 abandoned/re-scoped).
  • Owner: agent builds the hash check; the capture (if none exists) is operator-triggered in-cluster.

G-C4b — prep/train-seam check (code read)

  • Task: determine whether X/y is fully materialized before any lgb.train (clean seam) or only during the fit (lazy/callback).
  • Decision (round-12): seam exists → dataset_only target reachable. No clean seam → dataset_only not reachable → return to design review before considering single_fit (not an automatic fallback — a single fit can still trigger training side effects, §C3).
  • Owner: agent (read).

Phase 0 exit: a consolidated gate report → decides which of Phases 1/2/3 proceed and in what order.

Phase 0 — results (2026-05-26, read-only via Loki)

Gate Measurement Decision
G-B7 s22a1_verdict.elapsed_s (n=8, 7d; 30d timed out): p50 234 s (~4 min), p95 860 s (~14 min), range 5–865 s, all best_iter=1 (123× variance at fixed n_rounds=300 → the B7 secondary investigation, likely data-size skew) GO — Lever #2 worth wiring (both p50 & p95 material)
G-A1 s18_step0_verdict (crypto+fold+elapsed): the 5 defi_top5 fold-3 cells (UNIUSDC/OPUSDC/LDOUSDC/AAVEUSDC/ARBUSDC) re-captured ≥2× in a single day (2026-05-25); cold replay ~2.2–2.75 h/cell; same cells re-audited across ~8 missions (s22a1→s40) HIGH reuse

Resulting sequence (round-12 selection logic): high reuse ⇒ after Lever #2, Lever #1 is the next priority (it caches the ~2.75 h cold capture re-paid on every re-audit), with Lever #3 secondary. Measured order: Lever #2 → Lever #1 → Lever #3. (G-C0/G-C4/G-C4b still gate Lever #3's correctness when it comes; G-A1 here only sets priority.)


PHASE 1 — Lever #2: optional run_s22a1 anchor (proceed if G-B7 shows material p50 OR p95/tail; defer only if both negligible)

Smallest, impl-ready. Resolver _resolve_s40_skip_s22a1 already merged (aa3140d0).

Tasks - T1.1 — Wire the gate at the call site (src/commun/finetune/diagnostic/hamilton/s40_io.py:142): if _resolve_s40_skip_s22a1(): <skip path>. Skip path = require parquet exists → else INCONCLUSIVE_TOOLING (B6); recompute parquet SHA (informational); _load_captured_parquet directly; emit event=s40_s22a1_skipped. - T1.2 — Output semantics (B6bis): set s22a1_crossref_status=SKIPPED, s22a1_anchor_available=false on S40Verdict; thread these fields through to every verdict consumer (report/dashboard/aggregator) so a downgraded audit is never shown as complete. - T1.3 — Consumer registry + runtime guard (C0): DIAGNOSTIC_CAPTURE_REQUIREMENTS; an assertion/test that FAILS if an xy_only diagnostic accesses anchor outputs. Skip allowed only for xy_only. - T1.4 — Default policy in PG ftf_config.base_env (ADR-59, Console UI only): OFF for reproduction missions (S22→S28), ON for light s40 audits. Run-level Airflow param override. - T1.5 — Tests (B6): parity (skip ON==OFF probe outputs) · metadata downgrade · no-accidental-anchor · absent-parquet→INCONCLUSIVE_TOOLING · default-policy · registry-enforcement. - T1.6 — Docs: OPERATIONS.md note on the flag + the certification-downgrade meaning.

Files: hamilton/s40_io.py, the verdict-consumer surfaces, the registry module, tests/unit/test_s40_perf.py (+integration parity), PG ftf_config (Console). Review: touches src/commun/finetune/ → CodeRabbit + committee pr_review (ADR-68). MLOps readiness if it touches capture (ADR-70). Done: PR merged + CI green + deploy + a real s40 light audit shows crossref=SKIPPED and an unchanged probe verdict, and the downgraded certification is visibly displayed on every consuming dashboard/report (no consumer renders a skipped audit as complete — B6bis propagation). Scope note: s27_io.py also calls run_s22a1separate follow-up, not in this PR.


Lever #3: dataset-only capture — PARKED (operator verdict 2026-05-26)

PARKED, not implemented. Gates ran: G-C0 → only s40 is xy_only; G-C4b → no clean prep/train seam (dataset_only infeasible); G-A1 → high reuse ⇒ Lever #1 already covers the re-capture. Three legs of a useful impl absent at once. Re-open only if all three hold; single live tripwire = reuse on Part A's full Gate 1. Decision record: documentation/reviews/2026-05-26-cvn-n001-ei-s07-lever3-decision.md. The task breakdown below is retained for revival.

Tasks - T2.1CVN_DIAGNOSTIC_CAPTURE_MODE (full_replay default | dataset_only | single_fit fallback), default in PG ftf_config (ADR-59), A/B behind the env var (ADR-56). - T2.2dataset_only path in s18_step1_capture/s18_step0_replay: run data-prep until X/y exists, do not train (only if G-C4b confirms the seam; if no seam → back to design review, not an automatic single_fit). - T2.3 — Wire the consumer registry (from T1.3): dataset_only forbidden unless registered xy_only; unknown/trial_dynamicsfull_replay. - T2.4 — Periodic prod invariance assertion (C8.7): occasionally capture one trial in full_replay and assert canonical_hash parity (guards the empirical-vs-structural gap). - T2.5 — Tests (C8): invariance · registry · dataset-only parity · verdict parity · safety (dataset_only on trial_dynamics → error/force full_replay) · perf benchmark. - T2.6 — Release parity gate (C7): canonical_hash(dataset_only)==full_replay + s40_verdict parity on K folds before any default flip.

Files: s18_step1_capture.py, s18_step0_replay.py, registry, ftf_config (Console), tests. Review: CodeRabbit + committee pr_review + MLOps readiness (changes capture behaviour). Likely a committee touch on the design delta since it changes capture semantics. Done: release parity gate passes; dataset_only default-ON only for registered xy_only consumers.


Lever #1: captured-fold reproducibility artifact (deferred pending its gates — NOT parked indefinitely; it is the lever that directly delivers the "re-audit < 2 min" promise)

Phase position: if G-A1 shows material reuse, Lever #1 becomes the next priority right after Lever #2 (re-runs are where the win is).

Resume on its own entry gates (§9bis): Gate 1 value/reuse (= G-A1) + Gate 3 size (piggyback the next cold audit). Then implement per Part A §15 Files (new DIAGNOSTIC_CAPTURE entity, get_diagnostic_capture, PG advisory-lock single-flight, canonical hashes, S3 lifecycle), with Gate 4 (cold→warm in-cluster: reproducibility + <2 min + single-flight under contention) as the release gate before retiring skip_phase_a. D11 (--check-drift policy) resolved by Gate 2 drift-rate, and its committee waiver reopens if Gate 2 implies default-on/short-TTL.

Note (§C6): if G-A1 shows high reuse, Lever #1 is the high-value lever and #3 is marginal; if low, #3 leads and #1 is lower-value. Priority is set by the G-A1 number, not fixed.


Cross-cutting

  • Observability (ADR-26/30/32, design §8): emit the capture_*/s40_s22a1_skipped/capture_lease_* events + the warm-path metrics; Grafana SLO for Lever #1.
  • Shared fixtures: G-C4, Lever-3 C7, and Lever-1 Gate 4 all re-run full_replay on K folds — one fixture set, captured once, reused.
  • Process (CLAUDE.md): work under wp#232 (In progress); one PR per lever-phase; squash-merge on explicit operator go; CodeRabbit + committee for src/ changes; OP/issue updated.
  • Who triggers what: agent runs read-only gates (G-B7, G-A1, G-C0, G-C4b) + builds harnesses; operator launches in-cluster captures (G-C4 capture, Lever-1 Gate 4) and every merge.

Immediate next action

Phase 0, read-only first: run G-B7 (Loki run_s22a1 p50/p95) and G-A1 (re-audit reuse per cell_ref) — both pure reads, within the no-autonomous-launch policy — to decide whether Lever #2 is worth wiring and how to prioritize #1 vs #3. These two numbers gate everything downstream.