Implementation plan — CVN-N001-EI-S07 diagnostic-harness perf (3 levers)¶
Story: CVN-N001-EI-S07 — wp#232 · GH #1071 · Epic CVN-N001-EI (#1055)
Design: CVN-N001-EI-S07-lever1-captured-fold-cache-design.md (Part A Lever #1 / Part B Lever #2 / Part C Lever #3) — committee-validated (ca64c4d8) + 11 review rounds.
Status: plan — design frozen, no feature code started (resolver aa3140d0 aside).
Date: 2026-05-26
Branch: feat/CVN-N001-EI-S07-diagnostic-harness-perf
Revisions: round-12 — G-B7 decision uses p95/tail not just p50; Lever #3 gated on C0/C4/C4b only (G-A1 = priority, not correctness); Lever #1 deferred, not parked, and next-priority if reuse is material; G-C4 gains a "snapshots-absent → emit per-trial hashes via one full_replay" branch; G-C4b "no seam → design review, not auto single_fit"; Lever #2 Done requires visible certification downgrade.
Guiding principles (from the reviews)¶
- Gates before code. No feature code ships until its blocking gate passes. Building the measurement harness is allowed; running heavy workloads is operator-triggered (no-autonomous-launch).
- Cheap-first. Read-only/Loki measurements run before any in-cluster capture.
- Lever #2 first (smallest, impl-ready after its value-gate). After that, the next lever (#1 or #3) is selected by G-A1 + the C-gates, not fixed: high reuse → #1 (it delivers the re-audit <2 min promise); low reuse + C-gates pass → #3 (it cuts the cold path). #1 is deferred pending its gates, not parked indefinitely. G-A1 sets priority; it does not gate #3's correctness (#3 is gated by C0/C4/C4b).
- No silent failure (ADR-25): every skip/degradation surfaced; every gate produces a recorded number/decision.
Phase map¶
PHASE 0 measurement & diagnostic gates (read-only / light / one in-cluster) ── go/no-go numbers
├─ G-B7 run_s22a1 cost p50/p95 (Loki read) → is Lever #2 worth wiring?
├─ G-A1 re-audit reuse per cell_ref (PG/Airflow/Loki read) → Lever #1 value + Lever #3 priority
├─ G-C0 consumer classification map (code read) → which diagnostics may skip / dataset_only
├─ G-C4 trial-invariance proof (code read + 1 capture)→ is dataset_only sound?
└─ G-C4b prep/train-seam check (code read) → is dataset_only reachable (vs single_fit)?
│
▼
PHASE 1 Lever #2 impl (iff G-B7 shows material p50 OR p95/tail cost) → small PR
PHASE 2 Next lever, SELECTED by G-A1 + C-gates:
high reuse → Lever #1 (directly delivers re-audit <2 min)
low reuse + C0/C4/C4b → Lever #3 (cuts the cold path)
both → choose by ROI
(G-A1 informs PRIORITY; it does NOT gate Lever #3 correctness)
PHASE 3 the remaining validated lever, once its gates pass + the prior lever is stable
PHASE 0 — Measurement & diagnostic gates¶
No feature code. Deliverable = a short gate report per item (numbers + go/no-go) appended to wp#232 / the design §9bis. The agent builds the read/harness; in-cluster runs are operator-triggered.
G-B7 — Lever #2 value gate (Loki read, ~free)¶
- Task: query Loki for
run_s22a1elapsed_sacross the defi_top5 cells (the data already exists from past runs) → p50/p95 + the 123× variance breakdown. - Tool:
/loki-query(read-only) — no workload. - Decision (round-12 — p95/tail matters more than p50): if either p50 or p95 materially affects the s40 wall time → GO. Low p50 but p95 > ~60–120 s (or 100×+ unexplained variance causing unpredictable latency) → GO / limited rollout (the skip stabilizes the tail). Only if both p50 and p95 are negligible → defer. A low p50 alone does not defer the lever.
- Owner: agent (read).
G-A1 — value/reuse gate (PG/Airflow/Loki read, ~free)¶
- Task: over 30–60 d, count diagnostics, distinct
cell_ref, re-runs/cell_ref, median+p75/p90 gap → reuse %. - Decision: feeds Lever #1's go/no-go and Lever #3 priority (high reuse → deprioritize #3; low → #3 primary, §C6). Apply the induced-demand clause (low reuse may be the 2h48 cost barrier — qualitative cross-check, not auto-no-go).
- Owner: agent (read).
G-C0 — consumer-classification map (code read)¶
- Task: classify every diagnostic consumer
xy_only | trial_dynamics | unknown(start theDIAGNOSTIC_CAPTURE_REQUIREMENTSregistry). s40=xy_only, s22a1/a2/a3=trial_dynamics, s23/s24/s26/s27=TBD→unknown until read. - Decision: defines where Lever #2 skip / Lever #3 dataset_only are permitted. Unknown ⇒ full_replay.
- Owner: agent (read).
G-C4 — trial-invariance proof (code read + 1 capture)¶
- Task: (a) code-level: prove data-prep is upstream of + independent from the trial loop (structural read of
s18_step0_replay/s18_step1_capture); (b) empirical: on K folds, assertcanonical_hash(trial_i.Xy) == trial_0for all i. If per-trial X/y snapshots are not already persisted (the capture may only write the final parquet), instrument one controlledfull_replayto emit per-trial canonical hashes only — do not persist every trial dataframe; the hashes suffice for the proof. (Avoids G-C4 being blocked by an observability gap.) - Decision: pass ⇒ X/y trial-invariant → dataset_only sound. Any unexplained divergence → STOP (lever #3 abandoned/re-scoped).
- Owner: agent builds the hash check; the capture (if none exists) is operator-triggered in-cluster.
G-C4b — prep/train-seam check (code read)¶
- Task: determine whether X/y is fully materialized before any
lgb.train(clean seam) or only during the fit (lazy/callback). - Decision (round-12): seam exists →
dataset_onlytarget reachable. No clean seam →dataset_onlynot reachable → return to design review before consideringsingle_fit(not an automatic fallback — a single fit can still trigger training side effects, §C3). - Owner: agent (read).
Phase 0 exit: a consolidated gate report → decides which of Phases 1/2/3 proceed and in what order.
Phase 0 — results (2026-05-26, read-only via Loki)¶
| Gate | Measurement | Decision |
|---|---|---|
| G-B7 | s22a1_verdict.elapsed_s (n=8, 7d; 30d timed out): p50 234 s (~4 min), p95 860 s (~14 min), range 5–865 s, all best_iter=1 (123× variance at fixed n_rounds=300 → the B7 secondary investigation, likely data-size skew) |
GO — Lever #2 worth wiring (both p50 & p95 material) |
| G-A1 | s18_step0_verdict (crypto+fold+elapsed): the 5 defi_top5 fold-3 cells (UNIUSDC/OPUSDC/LDOUSDC/AAVEUSDC/ARBUSDC) re-captured ≥2× in a single day (2026-05-25); cold replay ~2.2–2.75 h/cell; same cells re-audited across ~8 missions (s22a1→s40) |
HIGH reuse |
Resulting sequence (round-12 selection logic): high reuse ⇒ after Lever #2, Lever #1 is the next priority (it caches the ~2.75 h cold capture re-paid on every re-audit), with Lever #3 secondary. Measured order: Lever #2 → Lever #1 → Lever #3. (G-C0/G-C4/G-C4b still gate Lever #3's correctness when it comes; G-A1 here only sets priority.)
PHASE 1 — Lever #2: optional run_s22a1 anchor (proceed if G-B7 shows material p50 OR p95/tail; defer only if both negligible)¶
Smallest, impl-ready. Resolver _resolve_s40_skip_s22a1 already merged (aa3140d0).
Tasks
- T1.1 — Wire the gate at the call site (src/commun/finetune/diagnostic/hamilton/s40_io.py:142): if _resolve_s40_skip_s22a1(): <skip path>. Skip path = require parquet exists → else INCONCLUSIVE_TOOLING (B6); recompute parquet SHA (informational); _load_captured_parquet directly; emit event=s40_s22a1_skipped.
- T1.2 — Output semantics (B6bis): set s22a1_crossref_status=SKIPPED, s22a1_anchor_available=false on S40Verdict; thread these fields through to every verdict consumer (report/dashboard/aggregator) so a downgraded audit is never shown as complete.
- T1.3 — Consumer registry + runtime guard (C0): DIAGNOSTIC_CAPTURE_REQUIREMENTS; an assertion/test that FAILS if an xy_only diagnostic accesses anchor outputs. Skip allowed only for xy_only.
- T1.4 — Default policy in PG ftf_config.base_env (ADR-59, Console UI only): OFF for reproduction missions (S22→S28), ON for light s40 audits. Run-level Airflow param override.
- T1.5 — Tests (B6): parity (skip ON==OFF probe outputs) · metadata downgrade · no-accidental-anchor · absent-parquet→INCONCLUSIVE_TOOLING · default-policy · registry-enforcement.
- T1.6 — Docs: OPERATIONS.md note on the flag + the certification-downgrade meaning.
Files: hamilton/s40_io.py, the verdict-consumer surfaces, the registry module, tests/unit/test_s40_perf.py (+integration parity), PG ftf_config (Console).
Review: touches src/commun/finetune/ → CodeRabbit + committee pr_review (ADR-68). MLOps readiness if it touches capture (ADR-70).
Done: PR merged + CI green + deploy + a real s40 light audit shows crossref=SKIPPED and an unchanged probe verdict, and the downgraded certification is visibly displayed on every consuming dashboard/report (no consumer renders a skipped audit as complete — B6bis propagation).
Scope note: s27_io.py also calls run_s22a1 — separate follow-up, not in this PR.
Lever #3: dataset-only capture — PARKED (operator verdict 2026-05-26)¶
PARKED, not implemented. Gates ran: G-C0 → only
s40isxy_only; G-C4b → no clean prep/train seam (dataset_onlyinfeasible); G-A1 → high reuse ⇒ Lever #1 already covers the re-capture. Three legs of a useful impl absent at once. Re-open only if all three hold; single live tripwire = reuse on Part A's full Gate 1. Decision record:documentation/reviews/2026-05-26-cvn-n001-ei-s07-lever3-decision.md. The task breakdown below is retained for revival.
Tasks
- T2.1 — CVN_DIAGNOSTIC_CAPTURE_MODE (full_replay default | dataset_only | single_fit fallback), default in PG ftf_config (ADR-59), A/B behind the env var (ADR-56).
- T2.2 — dataset_only path in s18_step1_capture/s18_step0_replay: run data-prep until X/y exists, do not train (only if G-C4b confirms the seam; if no seam → back to design review, not an automatic single_fit).
- T2.3 — Wire the consumer registry (from T1.3): dataset_only forbidden unless registered xy_only; unknown/trial_dynamics→full_replay.
- T2.4 — Periodic prod invariance assertion (C8.7): occasionally capture one trial in full_replay and assert canonical_hash parity (guards the empirical-vs-structural gap).
- T2.5 — Tests (C8): invariance · registry · dataset-only parity · verdict parity · safety (dataset_only on trial_dynamics → error/force full_replay) · perf benchmark.
- T2.6 — Release parity gate (C7): canonical_hash(dataset_only)==full_replay + s40_verdict parity on K folds before any default flip.
Files: s18_step1_capture.py, s18_step0_replay.py, registry, ftf_config (Console), tests.
Review: CodeRabbit + committee pr_review + MLOps readiness (changes capture behaviour). Likely a committee touch on the design delta since it changes capture semantics.
Done: release parity gate passes; dataset_only default-ON only for registered xy_only consumers.
Lever #1: captured-fold reproducibility artifact (deferred pending its gates — NOT parked indefinitely; it is the lever that directly delivers the "re-audit < 2 min" promise)¶
Phase position: if G-A1 shows material reuse, Lever #1 becomes the next priority right after Lever #2 (re-runs are where the win is).
Resume on its own entry gates (§9bis): Gate 1 value/reuse (= G-A1) + Gate 3 size (piggyback the next cold audit). Then implement per Part A §15 Files (new DIAGNOSTIC_CAPTURE entity, get_diagnostic_capture, PG advisory-lock single-flight, canonical hashes, S3 lifecycle), with Gate 4 (cold→warm in-cluster: reproducibility + <2 min + single-flight under contention) as the release gate before retiring skip_phase_a. D11 (--check-drift policy) resolved by Gate 2 drift-rate, and its committee waiver reopens if Gate 2 implies default-on/short-TTL.
Note (§C6): if G-A1 shows high reuse, Lever #1 is the high-value lever and #3 is marginal; if low, #3 leads and #1 is lower-value. Priority is set by the G-A1 number, not fixed.
Cross-cutting¶
- Observability (ADR-26/30/32, design §8): emit the
capture_*/s40_s22a1_skipped/capture_lease_*events + the warm-path metrics; Grafana SLO for Lever #1. - Shared fixtures: G-C4, Lever-3 C7, and Lever-1 Gate 4 all re-run full_replay on K folds — one fixture set, captured once, reused.
- Process (CLAUDE.md): work under wp#232 (In progress); one PR per lever-phase; squash-merge on explicit operator go; CodeRabbit + committee for
src/changes; OP/issue updated. - Who triggers what: agent runs read-only gates (G-B7, G-A1, G-C0, G-C4b) + builds harnesses; operator launches in-cluster captures (G-C4 capture, Lever-1 Gate 4) and every merge.
Immediate next action¶
Phase 0, read-only first: run G-B7 (Loki run_s22a1 p50/p95) and G-A1 (re-audit reuse per cell_ref) — both pure reads, within the no-autonomous-launch policy — to decide whether Lever #2 is worth wiring and how to prioritize #1 vs #3. These two numbers gate everything downstream.