Cold-eyes verdict review — CVN-N001-EE-S22A6 Axis-3 leakage + walk-forward + data sanity¶
Date : 2026-05-21 · Story : CVN-N001-EE-S22A6 (wp#186) · Issue : #968
Parent : CVN-N001-EE-S22 (wp#168) · Epic MAYDAY CVN-N001-EF
Session type : experiment_review (cold-eyes sign-off — S22 plan v3 §6.4) — committee 9b4ab4b2 PASSED/OK strong, 0 dissent (ADR-82 Meeting #170)
Plan : committee plan_review 77b56f51 PASSED strong (Meeting #165), corrected re-submit 95b38638 PASSED strong (Meeting #167) · PR-review : 1e8e81f9 PASSED strong (Meeting #168), PR #1014 merged de04a650. No-crash hotfix PR #1015 merged c9489c8e (committee pr_review 1557d1fa PASSED, Meeting #169).
Entry gate : S22A4 = NEITHER (Axis-1 OUT) + S22A5+A5b = CONSISTENT_UNCOVERED (Axis-2 OUT, wp#185 + wp#199 Closed). Chain : S22A1 REPRODUCED → S22A2 SEED_INDEPENDENT → S22A3 CURVE_DEGENERATE → S22A4 NEITHER → S22A5+A5b CONSISTENT_UNCOVERED → S22A6 = M5_AND_M6.
1. What is being signed off¶
The S22A6 Axis-3 verdict from the operator-triggered, post-no-crash-fix diagnostic__s22_a6 run (2026-05-21 13:03–13:04 UTC, canonical AAVEUSDC fold=3, seed 1337, n_rounds=300) :
Which Axis-3 mechanism causes the reproducible
best_iter = 1convergence collapse — M3 (train/val construction), M5 (leakage / covariate), or M6 (harness row-mismatch) ? Verdict :M5_AND_M6— M3 ruled out (walk-forward does not recover) ; M5 (severe train/val covariate shift) is the PRIMARY mechanism ; M6 (G.1 label-noise invariance) fires but is CONFOUNDED by the pre-existingbest_iter=1floor and requires the train-loss-vs-noise disambiguator to confirm/refute as an independent issue.
Characterisation sign-off — not a fix decision. The §3.2 M3/M5/M6 decision tree and all cuts were pre-registered in this Story's plan dossier §1 (committee 77b56f51 + corrected 95b38638 PASSED) inheriting the S22 master plan v3 §S22-I / §S22-G verbatim, BEFORE the run.
Of equal importance for this sign-off : the operator-visible GREEN trail. The first run (2026-05-21 08:04 UTC, pre-fix DAG) produced the same M5_AND_M6 verdict but raised an AirflowFailException → a Task failed with exception traceback, which the operator ruled a crash (an absolute rule : the DAG must never crash AND never silent-fail — both hold). PR #1015 removed every raise and isolated the probes. The clean run signed off here (2026-05-21 13:04 UTC, on the fixed DAG) returned the verdict dict end-to-end with event=s22a6_escalation_required severity=error and ZERO traceback — the no-crash + no-silent-fail contract verified in production.
2. Pre-specified contract (plan dossier §1, frozen pre-run, committee 95b38638 PASSED)¶
Strict cuts (no post-hoc relaxation) :
| Probe | Cut | Mechanism |
|---|---|---|
| I.3 walk-forward (strict-chrono 80/20) | recover iff best_iter > 30 ∧ proba_std > 0.10 |
M3 |
| I.1 temporal corr | |corr(f_i[t], y[t+1])| > 0.05 |
M5 (forward leakage) |
| I.2 per-split MI | MI_train > 2 × MI_val |
M5 (train-only signal) |
| G.1 label-noise | best_iter invariant within ±1 across {0,10,20,30}% |
M6 (labels not consumed) |
| KS / χ² | descriptive only | context for M3/M5 |
Verdict severity contract (§1.1, corrected 2026-05-21 — operator no-crash rule) : the DAG task ALWAYS returns the verdict dict (green) and NEVER raises ; severity via event=… severity=… (Loki → Grafana). M5_AND_M6 → event=s22a6_escalation_required severity=error.
3. Observed evidence (clean run, 2026-05-21 13:04 UTC)¶
| Probe | Result | Fires? |
|---|---|---|
canonical run_s22a1 |
best_iter=1 (REPRODUCED), proba_std_val=0.199 |
bug reproduced |
| I.3 walk-forward | best_iter=1, proba_std=0.2026, AUC=0.5826 |
M3 NOT recover |
| I.1 | max_abs=0.0510, 2 features (idx 59, 20) |
M5 (marginal — knife-edge) |
| I.2 | max_ratio=30.48, 26 features (idx 292,109,98,150,167,275,182,84,211,256…) |
M5 (strong) |
| G.1 | {0%:1, 10%:1, 20%:1, 30%:1}, range=0 |
M6 fires (confounded) |
| KS (descriptive) | 87/320 features p<0.01 | severe covariate shift |
| χ² (descriptive) | p=0.50 | labels balanced train↔val |
statistically_non_defensible=False (n_buy_train=1725, n_buy_val=353). All probes succeeded (failed_probes=[]).
4. Adjudicated reading (cold-eyes 9b4ab4b2, 5/5 strong)¶
best_iter=1is real and robust — canonical AND walk-forward both = 1 ⇒ M3 ruled out (not the fold's train/val boundary). Weak genuine signal (AUC 0.58), consistent with S22A5+A5b.- M5 is the PRIMARY mechanism — 87/320 KS-flagged features + 26 MI-divergent features (ratio up to 30), balanced labels (χ²=0.50). Severe train/val covariate shift : the model fits the train distribution ; val being distributionally different pins
best_iterat 1 (every round after the first worsens val-loss). - M6 (G.1 range=0) FIRES but is CONFOUNDED — the committee endorsed verbatim : "It is impossible to observe a shift in best_iter due to label noise when the model already [collapses at round 1]." G.1 invariance here is more likely a consequence of the
best_iter=1collapse than clean proof of label non-consumption. The §3.2 tree firedM5_AND_M6mechanically (both conditions met) ; scientifically M5 is fundamental, M6 needs disambiguation. - I.1 (2 features at the 0.05 knife-edge) is marginal, not load-bearing.
5. Complementary point surfaced by the clean run¶
G.1 records only best_iter (val-loss argmin), not the train-loss response to label noise — the decisive M6 test. If labels ARE consumed, train-loss at 30% noise ≫ train-loss at 0% ; if train-loss is ALSO invariant → labels genuinely not consumed (M6 real). Cold-eyes endorsed enriching G.1 with train-loss-vs-noise as "critical and correct" before any M6 harness audit. (This is exactly the kind of complementary evidence a crash-on-verdict would have suppressed — the no-crash + probe-isolation fix is what let the full evidence surface.)
6. Disposition¶
- S22A6 closes with
M5_AND_M6— a defensible Axis-3 characterisation (cold-eyes9b4ab4b2PASSED 5/5). Verdict codeML_USELESS: the ML system is NOT production-usable (no further eval / backtest / deploy) until M5/M6 are resolved andbest_iter=1is eliminated. - Open M5 remediation Story (PRIORITY) : audit why train/val feature distributions diverge so severely at fold 3 — candidates : normalisation fitted-on-train leakage, non-stationary raw features drifting between train and val windows, or fold-3 val = regime-shifted period. Inspect the flagged feature indices. Validate across folds/assets (ADR-14). Ablation on the 26 MI-divergent features.
- Open M6 remediation Story : add the G.1
train-loss-vs-noisedisambiguator first ; then (if M6 confirmed independent) audit dtrain construction / label binding. - Strengthen upstream drift detection (post-Enrichment, pre-FeatureEngineering, Feast feature store) + document purging/embargo for future folds.
- No remediation implemented in S22A6 (characterisation only, per scope §3 +
feedback_stop_hacking).
7. Artefacts¶
- Verdict JSON :
/tmp/s18-diagnostic/s22a6-leakage-AAVEUSDC-fold3.json(pod-ephemeral) ; full payload in the run log + §3 above. - Cold-eyes session :
committee/sessions/9b4ab4b2_committee.json(ADR-82 Meeting #170). - Plan dossier : 2026-05-20-cvn-n001-ee-s22a6-leakage-walkforward-plan.md.
- Code :
src/commun/finetune/diagnostic/s22_a6_leakage_walkforward.py,dags/dag_diagnostic__s22_a6.py(PR #1014 + no-crash hotfix #1015).