Cold-eyes verdict review — CVN-N001-EE-S22A6 Axis-3 leakage + walk-forward + data sanity¶

Date : 2026-05-21 · Story : CVN-N001-EE-S22A6 (wp#186) · Issue : #968 Parent : CVN-N001-EE-S22 (wp#168) · Epic MAYDAY CVN-N001-EF Session type : experiment_review (cold-eyes sign-off — S22 plan v3 §6.4) — committee 9b4ab4b2 PASSED/OK strong, 0 dissent (ADR-82 Meeting #170) Plan : committee plan_review 77b56f51 PASSED strong (Meeting #165), corrected re-submit 95b38638 PASSED strong (Meeting #167) · PR-review : 1e8e81f9 PASSED strong (Meeting #168), PR #1014 merged de04a650. No-crash hotfix PR #1015 merged c9489c8e (committee pr_review 1557d1fa PASSED, Meeting #169). Entry gate : S22A4 = NEITHER (Axis-1 OUT) + S22A5+A5b = CONSISTENT_UNCOVERED (Axis-2 OUT, wp#185 + wp#199 Closed). Chain : S22A1 REPRODUCED → S22A2 SEED_INDEPENDENT → S22A3 CURVE_DEGENERATE → S22A4 NEITHER → S22A5+A5b CONSISTENT_UNCOVERED → S22A6 = M5_AND_M6.

1. What is being signed off¶

The S22A6 Axis-3 verdict from the operator-triggered, post-no-crash-fix diagnostic__s22_a6 run (2026-05-21 13:03–13:04 UTC, canonical AAVEUSDC fold=3, seed 1337, n_rounds=300) :

Which Axis-3 mechanism causes the reproducible best_iter = 1 convergence collapse — M3 (train/val construction), M5 (leakage / covariate), or M6 (harness row-mismatch) ? Verdict : M5_AND_M6 — M3 ruled out (walk-forward does not recover) ; M5 (severe train/val covariate shift) is the PRIMARY mechanism ; M6 (G.1 label-noise invariance) fires but is CONFOUNDED by the pre-existing best_iter=1 floor and requires the train-loss-vs-noise disambiguator to confirm/refute as an independent issue.

Characterisation sign-off — not a fix decision. The §3.2 M3/M5/M6 decision tree and all cuts were pre-registered in this Story's plan dossier §1 (committee 77b56f51 + corrected 95b38638 PASSED) inheriting the S22 master plan v3 §S22-I / §S22-G verbatim, BEFORE the run.

Of equal importance for this sign-off : the operator-visible GREEN trail. The first run (2026-05-21 08:04 UTC, pre-fix DAG) produced the same M5_AND_M6 verdict but raised an AirflowFailException → a Task failed with exception traceback, which the operator ruled a crash (an absolute rule : the DAG must never crash AND never silent-fail — both hold). PR #1015 removed every raise and isolated the probes. The clean run signed off here (2026-05-21 13:04 UTC, on the fixed DAG) returned the verdict dict end-to-end with event=s22a6_escalation_required severity=error and ZERO traceback — the no-crash + no-silent-fail contract verified in production.

2. Pre-specified contract (plan dossier §1, frozen pre-run, committee `95b38638` PASSED)¶

Strict cuts (no post-hoc relaxation) :

Probe	Cut	Mechanism
I.3 walk-forward (strict-chrono 80/20)	recover iff `best_iter > 30` ∧ `proba_std > 0.10`	M3
I.1 temporal corr	`\|corr(f_i[t], y[t+1])\| > 0.05`	M5 (forward leakage)
I.2 per-split MI	`MI_train > 2 × MI_val`	M5 (train-only signal)
G.1 label-noise	`best_iter` invariant within ±1 across {0,10,20,30}%	M6 (labels not consumed)
KS / χ²	descriptive only	context for M3/M5

Verdict severity contract (§1.1, corrected 2026-05-21 — operator no-crash rule) : the DAG task ALWAYS returns the verdict dict (green) and NEVER raises ; severity via event=… severity=… (Loki → Grafana). M5_AND_M6 → event=s22a6_escalation_required severity=error.

3. Observed evidence (clean run, 2026-05-21 13:04 UTC)¶

Probe	Result	Fires?
canonical `run_s22a1`	`best_iter=1` (REPRODUCED), `proba_std_val=0.199`	bug reproduced
I.3 walk-forward	`best_iter=1`, `proba_std=0.2026`, `AUC=0.5826`	M3 NOT recover
I.1	`max_abs=0.0510`, 2 features (idx 59, 20)	M5 (marginal — knife-edge)
I.2	`max_ratio=30.48`, 26 features (idx 292,109,98,150,167,275,182,84,211,256…)	M5 (strong)
G.1	`{0%:1, 10%:1, 20%:1, 30%:1}`, range=0	M6 fires (confounded)
KS (descriptive)	87/320 features p<0.01	severe covariate shift
χ² (descriptive)	p=0.50	labels balanced train↔val

statistically_non_defensible=False (n_buy_train=1725, n_buy_val=353). All probes succeeded (failed_probes=[]).

4. Adjudicated reading (cold-eyes `9b4ab4b2`, 5/5 strong)¶

best_iter=1 is real and robust — canonical AND walk-forward both = 1 ⇒ M3 ruled out (not the fold's train/val boundary). Weak genuine signal (AUC 0.58), consistent with S22A5+A5b.
M5 is the PRIMARY mechanism — 87/320 KS-flagged features + 26 MI-divergent features (ratio up to 30), balanced labels (χ²=0.50). Severe train/val covariate shift : the model fits the train distribution ; val being distributionally different pins best_iter at 1 (every round after the first worsens val-loss).
M6 (G.1 range=0) FIRES but is CONFOUNDED — the committee endorsed verbatim : "It is impossible to observe a shift in best_iter due to label noise when the model already [collapses at round 1]." G.1 invariance here is more likely a consequence of the best_iter=1 collapse than clean proof of label non-consumption. The §3.2 tree fired M5_AND_M6 mechanically (both conditions met) ; scientifically M5 is fundamental, M6 needs disambiguation.
I.1 (2 features at the 0.05 knife-edge) is marginal, not load-bearing.

5. Complementary point surfaced by the clean run¶

G.1 records only best_iter (val-loss argmin), not the train-loss response to label noise — the decisive M6 test. If labels ARE consumed, train-loss at 30% noise ≫ train-loss at 0% ; if train-loss is ALSO invariant → labels genuinely not consumed (M6 real). Cold-eyes endorsed enriching G.1 with train-loss-vs-noise as "critical and correct" before any M6 harness audit. (This is exactly the kind of complementary evidence a crash-on-verdict would have suppressed — the no-crash + probe-isolation fix is what let the full evidence surface.)

6. Disposition¶

S22A6 closes with M5_AND_M6 — a defensible Axis-3 characterisation (cold-eyes 9b4ab4b2 PASSED 5/5). Verdict code ML_USELESS : the ML system is NOT production-usable (no further eval / backtest / deploy) until M5/M6 are resolved and best_iter=1 is eliminated.
Open M5 remediation Story (PRIORITY) : audit why train/val feature distributions diverge so severely at fold 3 — candidates : normalisation fitted-on-train leakage, non-stationary raw features drifting between train and val windows, or fold-3 val = regime-shifted period. Inspect the flagged feature indices. Validate across folds/assets (ADR-14). Ablation on the 26 MI-divergent features.
Open M6 remediation Story : add the G.1 train-loss-vs-noise disambiguator first ; then (if M6 confirmed independent) audit dtrain construction / label binding.
Strengthen upstream drift detection (post-Enrichment, pre-FeatureEngineering, Feast feature store) + document purging/embargo for future folds.
No remediation implemented in S22A6 (characterisation only, per scope §3 + feedback_stop_hacking).

7. Artefacts¶

Verdict JSON : /tmp/s18-diagnostic/s22a6-leakage-AAVEUSDC-fold3.json (pod-ephemeral) ; full payload in the run log + §3 above.
Cold-eyes session : committee/sessions/9b4ab4b2_committee.json (ADR-82 Meeting #170).
Plan dossier : 2026-05-20-cvn-n001-ee-s22a6-leakage-walkforward-plan.md.
Code : src/commun/finetune/diagnostic/s22_a6_leakage_walkforward.py, dags/dag_diagnostic__s22_a6.py (PR #1014 + no-crash hotfix #1015).