Plan dossier — CVN-N001-EE-S22A6 leakage + walk-forward + data sanity (Axis-3, M3-M6)¶

Date : 2026-05-20 · Story : CVN-N001-EE-S22A6 (wp#186) · Issue : #968 Parent : CVN-N001-EE-S22 (wp#168) · Epic MAYDAY CVN-N001-EF Session type : plan_review Scientific basis (committee-approved) : parent S22 plan dossier v3 §S22-I (sub-tests I.1 / I.2 / I.3) + §3.2 (rows M3 / M5 / M6) + §S22-G.1 (label-noise injection) — committee 31b95c00 PASSED 9.0/10, 0 dissent. This micro-story inherits that pre-specified design + decision tree ; no post-hoc. Entry gate : S22A5 INCONCLUSIVE_UNCOVERED (wp#185 Closed) + S22A5b CONSISTENT_UNCOVERED (wp#199 Closed, multi-cell defi_top5). Chain : S22A1 REPRODUCED → S22A2 SEED_INDEPENDENT → S22A3 CURVE_DEGENERATE → S22A4 NEITHER (Axis-1 OUT) → S22A5+A5b (Axis-2 OUT) → S22A6 (Axis-3 = M3-M6, this Story).

1. Decisional question (pre-specified)¶

Axis-1 (M1+M2 hyperparameters) and Axis-2 (signal existence) are both ruled out. The convergence bug best_iter = 1 is real, seed-independent, full-curve-visible, cross-crypto on defi_top5 at fold=3, and not a class-balance / learning-rate / signal-feasibility artefact. By process of elimination, the bug lives in Axis-3 : train/val construction, leakage, or harness row-mismatch.

S22A6 isolates which mechanism by applying the S22 plan v3 §3.2 rows M3 / M5 / M6 verbatim, with the §S22-I.1 / I.2 / I.3 sub-tests + the §S22-G.1 label-noise injection on the canonical cell (AAVEUSDC fold=3, S18-captured parquet, single seed 1337).

Probe	Spec	Verdict-load-bearing
I.1 — temporal correlation	for each feature `f_i`, compute `corr(f_i[t], label[t])`, `corr(f_i[t], label[t+1])`, `corr(f_i[t], label[t-1])` ; flag if `\\|corr(f_i[t], label[t+1])\\| > 0.05` (forward-leakage suspect)	M5
I.2 — per-split MI	`MI(f_i, label)` on train AND on val separately ; flag if `MI_train > 2 × MI_val` (train-only signal suspect)	M5
I.3 — walk-forward strict-chrono	rebuild train/val by date (no `shuffle=True`, no purge cheat), strict chronological order ; re-run `_train_full_no_earlystop` on this split with the canonical hp ; record `best_iter`, `proba_std_val`, AUC	M3
G.1 — label-noise injection	inject random label flips at 0 % / 10 % / 20 % / 30 % ; re-run training each ; record `best_iter[noise_level]` (per master plan §S22-G method 1)	M6
KS — train/val feature shift	Kolmogorov-Smirnov 2-sample test on each feature, train vs val ; flag features with `p < 0.01` (descriptive only)	descriptive → context for M3
χ² — label distribution shift	χ² test on label class proportions train vs val ; report `p` (descriptive only)	descriptive → context for M3

Aggregate verdict (frozen, no post-hoc) :

Pattern	Verdict	Next
I.3 walk-forward recovers `best_iter > 30` ∧ `proba_std_val > 0.10`	M3_CONFIRMED — train/val construction bias	remediation Story : fold layout audit + window selection rule
≥ 1 feature flagged by I.1 OR I.2 (and I.3 does not recover)	M5_CONFIRMED — leak feature(s) listed in the verdict	remediation Story : feature audit + extend F4 to lower correlation thresholds
`best_iter[noise_level]` identical within ±1 across {0, 10, 20, 30}% (and I.3 does not recover, and I.1/I.2 are clean)	M6_CONFIRMED — harness row-mismatch (labels not actually consumed by the training loop)	remediation Story : deep harness audit (dtrain construction, feature ordering, label binding)
Two-or-more of M3 / M5 / M6 fire	M3_AND_OTHER / M5_AND_OTHER / M3_AND_M5_AND_M6 as applicable	open the corresponding remediation Stories in parallel
None of M3 / M5 / M6 fires exactly	M7_ESCALATE — none of M3-M6 fits ; full evidence dossier to committee (per master plan §3.2 last row, M1-M11 exhausted in spirit ; do NOT autonomously propose remediation)	committee adjudication
Any probe tooling-fail / non-finite / S22A1 precondition non-scientific	INCONCLUSIVE_TOOLING	clean `AirflowFailException` (red), never a Python crash

best_iter > 30 and proba_std > 0.10 are the §3.1 healthy-reference thresholds, identical to S22A3 / S22A4 / S22A5 conventions (no new constant introduced).

1.1 DAG verdict-severity contract — NO crash, ever (operator rule)¶

Superseded 2026-05-21. The first execution of diagnostic__s22_a6 produced verdict M5_AND_M6, which under the previous contract (below) raised an AirflowFailException → a Task failed with exception traceback in the Airflow log. The operator ruled this a crash : "il ne doit pas y avoir de crash dans l'exécution du DAG, c'est une règle que j'ai posée". An AirflowFailException IS a visible crash, even for a scientific verdict that needs attention. This operator rule overrides the committee pr_review c5617cb1 recommendation that wanted RED-via-exception (committee is advisory, ADR-68) — and the operator additionally noted that a raise can hide complementary points (an exception aborts execution before the remaining evidence is computed/logged).

Corrected contract : the DAG task ALWAYS returns the verdict dict (green) and NEVER raises to signal an outcome. Severity is expressed only via structured event=... severity=... logs (Loki → Grafana = the operator's real alert channel, ADR-26/30). The Airflow task colour is no longer the alert surface.

Verdict	Task outcome	Structured log
`M3_CONFIRMED` / `M5_CONFIRMED` / `M6_CONFIRMED`	returns dict (green)	`event=s22a6_outcome severity=info status=<...>` (+ flagged-feature idx / G.1 `by_noise` in the verdict JSON)
`M3_AND_M5` / `M3_AND_M6` / `M5_AND_M6` / `M3_AND_M5_AND_M6`	returns dict (green)	`event=s22a6_escalation_required severity=error reason=multiple_mechanisms_fire`
`M7_ESCALATE`	returns dict (green)	`event=s22a6_escalation_required severity=error reason=no_axis3_mechanism_fires` ; full evidence dossier mandatory before any remediation Story
`INCONCLUSIVE_TOOLING`	returns dict (green)	`event=s22a6_inconclusive_tooling severity=error`
contract violation (unknown status)	returns dict (green)	`event=s22a6_contract_violation severity=error`

The verdict JSON is persisted best-effort ; a persistence I/O error is logged severity=error and swallowed (the verdict still reaches XCom via the task return). Every precondition guard (artifact_dir out of bounds, stale-parquet unremovable, captured-parquet missing) likewise logs severity=error + returns an INCONCLUSIVE_TOOLING dict — no raise. Only a genuine infra fault (OOM / import error / network) may surface as a task failure.

Probe isolation (companion correction) : the six probes (I.1/I.2/I.3/G.1/KS/χ²) each run in their own try/except inside run_s22a6. A failure logs severity=error (load-bearing : I.1/I.2/I.3/G.1) or severity=warn (descriptive : KS/χ²), records the failed probe, and continues — a single probe crash must never hide the complementary evidence. If any load-bearing probe failed the verdict is INCONCLUSIVE_TOOLING, but built with the partial evidence from the probes that did compute.

Previous contract (2026-05-20, superseded — RED via AirflowFailException)

The earlier version (committee `pr_review` `c5617cb1`, OP Meeting #166) elevated combined verdicts + M7 to RED via a clean `AirflowFailException`. This was reverted on 2026-05-21 per the operator no-crash rule above.

1.2 Pre-probe data-integrity guards¶

Added 2026-05-20 following committee pr_review c5617cb1 recommendation 6 :

Parquet row-order monotonicity : I.3 walk-forward is meaningful only if the captured parquet rows are in chronological order. The S18 Step 1 capture (verbose_capture=True) preserves the original time-ordered fold rows, but the run-time guard makes this explicit. If the parquet carries a recognisable timestamp column (e.g. open_time, _timestamp, timestamp, ts), _walkforward_split_from_parquet validates strictly-non-NaT parsing + np.diff(timestamps).min() >= 0 and, on violation, raises a RuntimeError that the isolated I.3 probe catches → INCONCLUSIVE_TOOLING (returned, never a DAG crash). When no timestamp column is present (current S18 capture schema), the implicit chronological row-order is documented + logged (event=s22a6_walkforward_chronology_implicit) and a follow-up Story will add an explicit timestamp column to S18 capture for stronger validation.

1.3 ADR alignment (explicit cross-reference)¶

Per committee pr_review c5617cb1 recommendation 7 :

ADR-25 : no silent fallback. Degenerate AUC → chance + WARN ; every RED path raises AirflowFailException, never a raw Python crash.
ADR-30 : structured logging is durable — every verdict, every escalation, every WARN reaches Loki via log_event (the canonical helper).
ADR-31 / 32 / 33 : no print() ; event=key=value format ; events listed in the closed catalogue (s22a6_started, s22a6_probes_started, s22a6_walkforward_split, s22a6_verdict, s22a6_verdict_persisted, s22a6_phase_a_started, s22a6_phase_a_done, s22a6_phase_a_evidence, s22a6_phase_a_skipped, s22a6_base_env_injected, s22a6_stale_parquet_removed, s22a6_precondition_failed, s22a6_protocol_nonconformance, s22a6_degenerate_auc, s22a6_hp_resolve_failed, s22a6_probe_failed, s22a6_verdict_persist_failed, s22a6_inconclusive_tooling, s22a6_escalation_required, s22a6_contract_violation, dag_loaded).
ADR-59 / 90 : every diagnostic probe parameter (I.1 / I.2 thresholds, walk-forward val_fraction, label-noise levels) is a run-level diagnostic constant, NEVER written to PG ftf_config. The canonical LGB HP load goes through the ADR-90 fail-fast resolver commun.finetune.hyperparams.resolve ; any miss → INCONCLUSIVE_TOOLING.
ADR-92 : dag_diagnostic__s22_a6 is ADR-92-versioned from day-one (dag_doc_md banner + dag_loaded_event first-task event + make_tags build tag).

2. Design — new probes + walk-forward re-train ; ADR-92 versioned DAG ; reuse the proven trainer¶

Single-pod chain, exact mirror of merged dag_diagnostic__s22_a5 post-PR-1010 (ADR-92 versioned : dag_doc_md banner + dag_loaded_event first-task event + make_tags build tag). New module commun/finetune/diagnostic/s22_a6_leakage_walkforward.py : run_s22a6(...) -> S22A6Verdict + write_s22a6_verdict(...) ; new DAG dags/dag_diagnostic__s22_a6.py (mirror s22_a5 : BASE_ENV inject #991, sentinel-sha #981, persist-before-fail #983, AirflowFailException-not-crash, single-pod Phase A+B, path-bound + protocol-conformance WARN, atomic verdict write — all S22A3/A4/A5 R2-R6 invariants).

Reuse run_s22a1 wholesale for preconditions + S22A1 anchor cross-ref (S22A3/A4/A5 precedent — zero duplication).
Reuse _train_full_no_earlystop for I.3 walk-forward re-train (same canonical hp, return_val_proba=True to compute AUC like S22A5) and for G.1 label-noise re-trains (4 trainings at 0 / 10 / 20 / 30 % noise, seeded permutation).
Walk-forward split : new helper _walkforward_split(captured_parquet, val_fraction=0.20) that orders by the parquet's timestamp column (or the row index if the parquet was captured chronologically — S22-I.3 specifies "no shuffle"). Builds train = first 80 % by time, val = last 20 % by time. Important : per ADR-14, a single 80/20 walk-forward is the minimum viable test for M3 ; a full multi-fold walk-forward is intrinsic to any subsequent remediation Story but out of scope here (≤1-day budget, S22 plan §S22-I.3 verbatim says "Re-run S22-B on this split", singular).
Temporal correlation (I.1) : pure numpy — np.corrcoef(X[:, i], y_shifted_by_k) for k ∈ {-1, 0, +1} per feature.
Per-split MI (I.2) : sklearn.feature_selection.mutual_info_classif separately on train and val ; flag where MI_train > 2 × MI_val.
Label-noise injection (G.1) : seeded rng(1337).choice(n, k=ratio*n) rows flipped per noise level ; re-run training, record best_iter.
KS / χ² : sklearn-free implementations from scipy.stats.ks_2samp and scipy.stats.chi2_contingency.
Single seed 1337 (canonical, pre-registered ; seed-independence CLOSED by S22A2 ; a multi-seed re-run busts the ≤1-day budget).
Artifact : /tmp/s18-diagnostic/s22a6-leakage-AAVEUSDC-fold3.json + Loki event=s22a6_verdict — identical persistence model to S22A4/A5 (pod-ephemeral JSON, durable Loki).

3. Scope / non-goals¶

NOT a fix — characterisation only (which Axis-3 mechanism, M3 or M5 or M6 or escalate). Zero proven-module change (the _train_full_no_earlystop(return_val_proba=True) opt-in shipped in S22A5 covers the AUC needs ; no further helper extension expected). NO new PG hyperparameters (G.1 noise levels, I.1 / I.2 thresholds, walk-forward val_fraction are all diagnostic probes, run-level, never written to ftf_config). Single cell (AAVEUSDC fold=3) ; single seed 1337 ; the multi-cell generalisation question is not re-opened here — S22A5b already established that the convergence bug holds cross-crypto on defi_top5 at fold=3, and the M3-M6 mechanism is expected to be the same on each cell. M4 is not in the master plan §3.2 main rows ; the OP description's "M3-M6" wording is treated as "M3 + M5 + M6" — a follow-up Story would handle any M4 (label-horizon misalignment, S22-L Variant L3) or M8-M11 if S22A6 verdicts M7_ESCALATE. Multi-fold walk-forward (5-fold per ADR-14) is intrinsic to the remediation Story for whichever Mx is confirmed, not to this characterisation. The ADR-92 retrofit of dag_diagnostic__s22_a6 is built-in from day one (Phase 1 of CVN-N014-EB-S01 already shipped the helpers).

4. Risks¶

Walk-forward split definition : the captured parquet must carry a column that allows chronological ordering (timestamp index, row order, or a separately recorded date). The plan reuses the canonical parquet — if the row order is not chronological, the walk-forward split degenerates to a random split and I.3 cannot test M3. Mitigation : the captured parquet is produced by S18 Step 1 with verbose_capture=True which preserves the original time-ordered fold rows ; verify with a np.diff(timestamps).min() >= 0 assertion at the start of I.3 and emit INCONCLUSIVE_TOOLING + WARN if not (per feedback_no_python_crash_visible).
G.1 budget : 4 re-trainings × ~10 s each = ~40 s additional. Easily within the ≤ 1 day budget.
I.1 / I.2 thresholds : |corr| > 0.05 and MI_train > 2 × MI_val are the master-plan-verbatim cuts. Pre-registered here BEFORE the run. A flagged feature is suspicious, not guilty — the M5 verdict requires at least one flagged feature, AND I.3 not recovering ; otherwise M5 is masked by M3.
Multiple mechanisms firing simultaneously : if M3 and M5 both fire, the verdict is M3_AND_M5 with both remediation paths queued. The order of remediation is operator-driven, not part of this Story.
statistically_non_defensible cells : not applicable — AAVEUSDC fold=3 was already established as defensible (n_buy_train=1725 / n_buy_val=353) by S22A5.
Phase A divergence : S18 FAIL = bug reproducing, recorded as evidence (#982, proven non-blocking S22A2/A3/A4/A5).

5. Test plan¶

Unit tests/unit/test_s22_a6_leakage_walkforward.py : decision tree (every verdict row : M3_CONFIRMED, M5_CONFIRMED, M6_CONFIRMED, M3_AND_M5, M3_AND_M6, M5_AND_M6, M3_AND_M5_AND_M6, M7_ESCALATE, INCONCLUSIVE_TOOLING) ; I.1 threshold boundary (|corr| exactly at 0.05 → not flagged, strict >) ; I.2 threshold boundary (MI_train exactly at 2 × MI_val → not flagged, strict >) ; I.3 recovery threshold (best_iter exactly at 30 → not recovered, strict >) ; G.1 invariance test (best_iter identical within ±1 across noise levels → M6 candidate) ; walk-forward split determinism + chronological-order assertion ; KS / χ² degenerate-input guards ; verdict JSON round-trip ; S22A1-precondition-non-scientific → INCONCLUSIVE_TOOLING. No real LGB train / no real sklearn fit on big data — stub the trainer + tiny synthetic arrays (mirror test_s22_a5_signal_existence.py discipline).
§1.1 DAG verdict-severity (no-crash) conformance (added 2026-05-20, corrected 2026-05-21) : a dedicated block (_dag_severity_for) mirrors the DAG task body's terminal status block and locks the verdict → severity mapping, asserting (i) the helper never raises for any status (valid or garbage), (ii) single-mechanism M3/M5/M6_CONFIRMED → severity=info, (iii) combined M3_AND_* / M5_AND_* / M3_AND_M5_AND_M6, M7_ESCALATE, INCONCLUSIVE_TOOLING, contract-violation → severity=error, (iv) a combined (error-severity) verdict is a normally-returned dataclass that is JSON-persistable with no exception anywhere.
Probe isolation : (i) a descriptive-probe (KS) failure does not change the scientific verdict and is recorded as None + a failed-probe trail ; (ii) a load-bearing-probe (I.3) failure → INCONCLUSIVE_TOOLING but the I.1/I.2/G.1 evidence that did compute is retained on the verdict (no hidden complementary points).
DAG control-flow validated by the system run (no Airflow-task harness — consistent with all S22 PRs).

6. Success criteria (binary)¶

Module + DAG (ADR-92 versioned) + tests merged (committee pr_review + CR converged) ; run executed ; verdict ∈ {M3_CONFIRMED, M5_CONFIRMED, M6_CONFIRMED, M3_AND_M5, M3_AND_M6, M5_AND_M6, M3_AND_M5_AND_M6, M7_ESCALATE, INCONCLUSIVE_TOOLING, INCONCLUSIVE_UNCOVERED} persisted (Loki + dossier) ; cold-eyes sign-off (S22 §6.4).

7. Review question¶

(a) Is the scope inclusion of S22-G.1 label-noise injection (4 trainings at 0/10/20/30 % noise) inside S22A6 the correct minimal way to make a confident M6 verdict emittable — given the OP description names M6 in scope and the master plan §3.2 row 5 explicitly requires the G.1 evidence — versus deferring M6 to a tactical follow-up Story if only M3 / M5 fire ? (b) Is the single 80/20 walk-forward split (operator-tunable val_fraction Param, default 0.20) the right minimal viable I.3 test for M3 — given ADR-14 mandates multi-fold for the remediation Story but not for this characterisation Story — and is the chronological-order assertion at the start of I.3 (emit INCONCLUSIVE_TOOLING if the captured parquet row order is not monotone-in-time) the right tooling-failsafe ? (c) Is the strict-> interpretation of the master-plan thresholds (|corr| > 0.05 strict, MI_train > 2 × MI_val strict, best_iter > 30 strict, identical to S22A3/A4/A5 conventions and committee-endorsed in those Stories) the right pre-registered translation, refusing post-hoc relaxation at the knife-edge ? (d) Does the ADR-92 versioned DAG (dag_diagnostic__s22_a6 retrofitted from day-one with dag_doc_md / dag_loaded_event / make_tags build tag, per the Phase-1 helpers shipped in PR #1010) correctly inherit the operator-visible build-provenance invariant — and is the same R2-R6 robustness boilerplate (BASE_ENV inject, sentinel-sha, persist-before-fail, no-raise / always-green task per §1.1, atomic verdict write) the right zero-deviation reuse from the merged S22A5 DAG ? (e) Re-submit question 2026-05-21 (no-crash correction) : the first diagnostic__s22_a6 run produced M5_AND_M6, which under the prior contract raised an AirflowFailException (a Task failed with exception traceback). The operator ruled that a crash and set an absolute rule : the DAG task must never raise to signal an outcome — it always returns the verdict dict (green), severity goes only through event=... severity=... logs (Loki/Grafana), and each probe is isolated so one failure can't hide the others' evidence. This overrides the committee c5617cb1 RED-via-exception recommendation (committee advisory, ADR-68). Does the corrected §1.1 contract (all verdicts → green task + structured severity ; probe isolation ; precondition guards return INCONCLUSIVE_TOOLING instead of raising ; best-effort persistence) correctly and completely honour the operator no-crash rule while preserving the pre-registered §3.2 scientific decision tree and not losing any complementary diagnostic evidence ?