Plan dossier — CVN-N001-EE-S22A6 leakage + walk-forward + data sanity (Axis-3, M3-M6)¶
Date : 2026-05-20 · Story : CVN-N001-EE-S22A6 (wp#186) · Issue : #968
Parent : CVN-N001-EE-S22 (wp#168) · Epic MAYDAY CVN-N001-EF
Session type : plan_review
Scientific basis (committee-approved) : parent S22 plan dossier v3 §S22-I (sub-tests I.1 / I.2 / I.3) + §3.2 (rows M3 / M5 / M6) + §S22-G.1 (label-noise injection) — committee 31b95c00 PASSED 9.0/10, 0 dissent. This micro-story inherits that pre-specified design + decision tree ; no post-hoc.
Entry gate : S22A5 INCONCLUSIVE_UNCOVERED (wp#185 Closed) + S22A5b CONSISTENT_UNCOVERED (wp#199 Closed, multi-cell defi_top5). Chain : S22A1 REPRODUCED → S22A2 SEED_INDEPENDENT → S22A3 CURVE_DEGENERATE → S22A4 NEITHER (Axis-1 OUT) → S22A5+A5b (Axis-2 OUT) → S22A6 (Axis-3 = M3-M6, this Story).
1. Decisional question (pre-specified)¶
Axis-1 (M1+M2 hyperparameters) and Axis-2 (signal existence) are both ruled out. The convergence bug best_iter = 1 is real, seed-independent, full-curve-visible, cross-crypto on defi_top5 at fold=3, and not a class-balance / learning-rate / signal-feasibility artefact. By process of elimination, the bug lives in Axis-3 : train/val construction, leakage, or harness row-mismatch.
S22A6 isolates which mechanism by applying the S22 plan v3 §3.2 rows M3 / M5 / M6 verbatim, with the §S22-I.1 / I.2 / I.3 sub-tests + the §S22-G.1 label-noise injection on the canonical cell (AAVEUSDC fold=3, S18-captured parquet, single seed 1337).
| Probe | Spec | Verdict-load-bearing |
|---|---|---|
| I.1 — temporal correlation | for each feature f_i, compute corr(f_i[t], label[t]), corr(f_i[t], label[t+1]), corr(f_i[t], label[t-1]) ; flag if \|corr(f_i[t], label[t+1])\| > 0.05 (forward-leakage suspect) |
M5 |
| I.2 — per-split MI | MI(f_i, label) on train AND on val separately ; flag if MI_train > 2 × MI_val (train-only signal suspect) |
M5 |
| I.3 — walk-forward strict-chrono | rebuild train/val by date (no shuffle=True, no purge cheat), strict chronological order ; re-run _train_full_no_earlystop on this split with the canonical hp ; record best_iter, proba_std_val, AUC |
M3 |
| G.1 — label-noise injection | inject random label flips at 0 % / 10 % / 20 % / 30 % ; re-run training each ; record best_iter[noise_level] (per master plan §S22-G method 1) |
M6 |
| KS — train/val feature shift | Kolmogorov-Smirnov 2-sample test on each feature, train vs val ; flag features with p < 0.01 (descriptive only) |
descriptive → context for M3 |
| χ² — label distribution shift | χ² test on label class proportions train vs val ; report p (descriptive only) |
descriptive → context for M3 |
Aggregate verdict (frozen, no post-hoc) :
| Pattern | Verdict | Next |
|---|---|---|
I.3 walk-forward recovers best_iter > 30 ∧ proba_std_val > 0.10 |
M3_CONFIRMED — train/val construction bias | remediation Story : fold layout audit + window selection rule |
| ≥ 1 feature flagged by I.1 OR I.2 (and I.3 does not recover) | M5_CONFIRMED — leak feature(s) listed in the verdict | remediation Story : feature audit + extend F4 to lower correlation thresholds |
best_iter[noise_level] identical within ±1 across {0, 10, 20, 30}% (and I.3 does not recover, and I.1/I.2 are clean) |
M6_CONFIRMED — harness row-mismatch (labels not actually consumed by the training loop) | remediation Story : deep harness audit (dtrain construction, feature ordering, label binding) |
| Two-or-more of M3 / M5 / M6 fire | M3_AND_OTHER / M5_AND_OTHER / M3_AND_M5_AND_M6 as applicable | open the corresponding remediation Stories in parallel |
| None of M3 / M5 / M6 fires exactly | M7_ESCALATE — none of M3-M6 fits ; full evidence dossier to committee (per master plan §3.2 last row, M1-M11 exhausted in spirit ; do NOT autonomously propose remediation) | committee adjudication |
| Any probe tooling-fail / non-finite / S22A1 precondition non-scientific | INCONCLUSIVE_TOOLING | clean AirflowFailException (red), never a Python crash |
best_iter > 30 and proba_std > 0.10 are the §3.1 healthy-reference thresholds, identical to S22A3 / S22A4 / S22A5 conventions (no new constant introduced).
1.1 DAG verdict-severity contract — NO crash, ever (operator rule)¶
Superseded 2026-05-21. The first execution of diagnostic__s22_a6 produced verdict M5_AND_M6, which under the previous contract (below) raised an AirflowFailException → a Task failed with exception traceback in the Airflow log. The operator ruled this a crash : "il ne doit pas y avoir de crash dans l'exécution du DAG, c'est une règle que j'ai posée". An AirflowFailException IS a visible crash, even for a scientific verdict that needs attention. This operator rule overrides the committee pr_review c5617cb1 recommendation that wanted RED-via-exception (committee is advisory, ADR-68) — and the operator additionally noted that a raise can hide complementary points (an exception aborts execution before the remaining evidence is computed/logged).
Corrected contract : the DAG task ALWAYS returns the verdict dict (green) and NEVER raises to signal an outcome. Severity is expressed only via structured event=... severity=... logs (Loki → Grafana = the operator's real alert channel, ADR-26/30). The Airflow task colour is no longer the alert surface.
| Verdict | Task outcome | Structured log |
|---|---|---|
M3_CONFIRMED / M5_CONFIRMED / M6_CONFIRMED |
returns dict (green) | event=s22a6_outcome severity=info status=<...> (+ flagged-feature idx / G.1 by_noise in the verdict JSON) |
M3_AND_M5 / M3_AND_M6 / M5_AND_M6 / M3_AND_M5_AND_M6 |
returns dict (green) | event=s22a6_escalation_required severity=error reason=multiple_mechanisms_fire |
M7_ESCALATE |
returns dict (green) | event=s22a6_escalation_required severity=error reason=no_axis3_mechanism_fires ; full evidence dossier mandatory before any remediation Story |
INCONCLUSIVE_TOOLING |
returns dict (green) | event=s22a6_inconclusive_tooling severity=error |
| contract violation (unknown status) | returns dict (green) | event=s22a6_contract_violation severity=error |
The verdict JSON is persisted best-effort ; a persistence I/O error is logged severity=error and swallowed (the verdict still reaches XCom via the task return). Every precondition guard (artifact_dir out of bounds, stale-parquet unremovable, captured-parquet missing) likewise logs severity=error + returns an INCONCLUSIVE_TOOLING dict — no raise. Only a genuine infra fault (OOM / import error / network) may surface as a task failure.
Probe isolation (companion correction) : the six probes (I.1/I.2/I.3/G.1/KS/χ²) each run in their own try/except inside
run_s22a6. A failure logsseverity=error(load-bearing : I.1/I.2/I.3/G.1) orseverity=warn(descriptive : KS/χ²), records the failed probe, and continues — a single probe crash must never hide the complementary evidence. If any load-bearing probe failed the verdict isINCONCLUSIVE_TOOLING, but built with the partial evidence from the probes that did compute.
Previous contract (2026-05-20, superseded — RED via AirflowFailException)
The earlier version (committee `pr_review` `c5617cb1`, OP Meeting #166) elevated combined verdicts + M7 to RED via a clean `AirflowFailException`. This was reverted on 2026-05-21 per the operator no-crash rule above.1.2 Pre-probe data-integrity guards¶
Added 2026-05-20 following committee pr_review c5617cb1 recommendation 6 :
- Parquet row-order monotonicity : I.3 walk-forward is meaningful only if the captured parquet rows are in chronological order. The S18 Step 1 capture (
verbose_capture=True) preserves the original time-ordered fold rows, but the run-time guard makes this explicit. If the parquet carries a recognisable timestamp column (e.g.open_time,_timestamp,timestamp,ts),_walkforward_split_from_parquetvalidates strictly-non-NaT parsing +np.diff(timestamps).min() >= 0and, on violation, raises aRuntimeErrorthat the isolated I.3 probe catches →INCONCLUSIVE_TOOLING(returned, never a DAG crash). When no timestamp column is present (current S18 capture schema), the implicit chronological row-order is documented + logged (event=s22a6_walkforward_chronology_implicit) and a follow-up Story will add an explicit timestamp column to S18 capture for stronger validation.
1.3 ADR alignment (explicit cross-reference)¶
Per committee pr_review c5617cb1 recommendation 7 :
- ADR-25 : no silent fallback. Degenerate AUC → chance + WARN ; every RED path raises
AirflowFailException, never a raw Python crash. - ADR-30 : structured logging is durable — every verdict, every escalation, every WARN reaches Loki via
log_event(the canonical helper). - ADR-31 / 32 / 33 : no
print();event=key=valueformat ; events listed in the closed catalogue (s22a6_started,s22a6_probes_started,s22a6_walkforward_split,s22a6_verdict,s22a6_verdict_persisted,s22a6_phase_a_started,s22a6_phase_a_done,s22a6_phase_a_evidence,s22a6_phase_a_skipped,s22a6_base_env_injected,s22a6_stale_parquet_removed,s22a6_precondition_failed,s22a6_protocol_nonconformance,s22a6_degenerate_auc,s22a6_hp_resolve_failed,s22a6_probe_failed,s22a6_verdict_persist_failed,s22a6_inconclusive_tooling,s22a6_escalation_required,s22a6_contract_violation,dag_loaded). - ADR-59 / 90 : every diagnostic probe parameter (I.1 / I.2 thresholds, walk-forward
val_fraction, label-noise levels) is a run-level diagnostic constant, NEVER written to PGftf_config. The canonical LGB HP load goes through the ADR-90 fail-fast resolvercommun.finetune.hyperparams.resolve; any miss →INCONCLUSIVE_TOOLING. - ADR-92 :
dag_diagnostic__s22_a6is ADR-92-versioned from day-one (dag_doc_mdbanner +dag_loaded_eventfirst-task event +make_tagsbuild tag).
2. Design — new probes + walk-forward re-train ; ADR-92 versioned DAG ; reuse the proven trainer¶
Single-pod chain, exact mirror of merged dag_diagnostic__s22_a5 post-PR-1010 (ADR-92 versioned : dag_doc_md banner + dag_loaded_event first-task event + make_tags build tag). New module commun/finetune/diagnostic/s22_a6_leakage_walkforward.py : run_s22a6(...) -> S22A6Verdict + write_s22a6_verdict(...) ; new DAG dags/dag_diagnostic__s22_a6.py (mirror s22_a5 : BASE_ENV inject #991, sentinel-sha #981, persist-before-fail #983, AirflowFailException-not-crash, single-pod Phase A+B, path-bound + protocol-conformance WARN, atomic verdict write — all S22A3/A4/A5 R2-R6 invariants).
- Reuse
run_s22a1wholesale for preconditions + S22A1 anchor cross-ref (S22A3/A4/A5 precedent — zero duplication). - Reuse
_train_full_no_earlystopfor I.3 walk-forward re-train (same canonical hp,return_val_proba=Trueto compute AUC like S22A5) and for G.1 label-noise re-trains (4 trainings at 0 / 10 / 20 / 30 % noise, seeded permutation). - Walk-forward split : new helper
_walkforward_split(captured_parquet, val_fraction=0.20)that orders by the parquet's timestamp column (or the row index if the parquet was captured chronologically — S22-I.3 specifies "no shuffle"). Builds train = first 80 % by time, val = last 20 % by time. Important : per ADR-14, a single 80/20 walk-forward is the minimum viable test for M3 ; a full multi-fold walk-forward is intrinsic to any subsequent remediation Story but out of scope here (≤1-day budget, S22 plan §S22-I.3 verbatim says "Re-run S22-B on this split", singular). - Temporal correlation (I.1) : pure numpy —
np.corrcoef(X[:, i], y_shifted_by_k)for k ∈ {-1, 0, +1} per feature. - Per-split MI (I.2) :
sklearn.feature_selection.mutual_info_classifseparately on train and val ; flag whereMI_train > 2 × MI_val. - Label-noise injection (G.1) : seeded
rng(1337).choice(n, k=ratio*n)rows flipped per noise level ; re-run training, recordbest_iter. - KS / χ² : sklearn-free implementations from
scipy.stats.ks_2sampandscipy.stats.chi2_contingency. - Single seed 1337 (canonical, pre-registered ; seed-independence CLOSED by S22A2 ; a multi-seed re-run busts the ≤1-day budget).
- Artifact :
/tmp/s18-diagnostic/s22a6-leakage-AAVEUSDC-fold3.json+ Lokievent=s22a6_verdict— identical persistence model to S22A4/A5 (pod-ephemeral JSON, durable Loki).
3. Scope / non-goals¶
NOT a fix — characterisation only (which Axis-3 mechanism, M3 or M5 or M6 or escalate). Zero proven-module change (the _train_full_no_earlystop(return_val_proba=True) opt-in shipped in S22A5 covers the AUC needs ; no further helper extension expected). NO new PG hyperparameters (G.1 noise levels, I.1 / I.2 thresholds, walk-forward val_fraction are all diagnostic probes, run-level, never written to ftf_config). Single cell (AAVEUSDC fold=3) ; single seed 1337 ; the multi-cell generalisation question is not re-opened here — S22A5b already established that the convergence bug holds cross-crypto on defi_top5 at fold=3, and the M3-M6 mechanism is expected to be the same on each cell. M4 is not in the master plan §3.2 main rows ; the OP description's "M3-M6" wording is treated as "M3 + M5 + M6" — a follow-up Story would handle any M4 (label-horizon misalignment, S22-L Variant L3) or M8-M11 if S22A6 verdicts M7_ESCALATE. Multi-fold walk-forward (5-fold per ADR-14) is intrinsic to the remediation Story for whichever Mx is confirmed, not to this characterisation. The ADR-92 retrofit of dag_diagnostic__s22_a6 is built-in from day one (Phase 1 of CVN-N014-EB-S01 already shipped the helpers).
4. Risks¶
- Walk-forward split definition : the captured parquet must carry a column that allows chronological ordering (timestamp index, row order, or a separately recorded date). The plan reuses the canonical parquet — if the row order is not chronological, the walk-forward split degenerates to a random split and I.3 cannot test M3. Mitigation : the captured parquet is produced by S18 Step 1 with
verbose_capture=Truewhich preserves the original time-ordered fold rows ; verify with anp.diff(timestamps).min() >= 0assertion at the start of I.3 and emitINCONCLUSIVE_TOOLING+ WARN if not (perfeedback_no_python_crash_visible). - G.1 budget : 4 re-trainings × ~10 s each = ~40 s additional. Easily within the ≤ 1 day budget.
- I.1 / I.2 thresholds :
|corr| > 0.05andMI_train > 2 × MI_valare the master-plan-verbatim cuts. Pre-registered here BEFORE the run. A flagged feature is suspicious, not guilty — the M5 verdict requires at least one flagged feature, AND I.3 not recovering ; otherwise M5 is masked by M3. - Multiple mechanisms firing simultaneously : if M3 and M5 both fire, the verdict is
M3_AND_M5with both remediation paths queued. The order of remediation is operator-driven, not part of this Story. statistically_non_defensiblecells : not applicable — AAVEUSDC fold=3 was already established as defensible (n_buy_train=1725 / n_buy_val=353) by S22A5.- Phase A divergence : S18 FAIL = bug reproducing, recorded as evidence (#982, proven non-blocking S22A2/A3/A4/A5).
5. Test plan¶
- Unit
tests/unit/test_s22_a6_leakage_walkforward.py: decision tree (every verdict row : M3_CONFIRMED, M5_CONFIRMED, M6_CONFIRMED, M3_AND_M5, M3_AND_M6, M5_AND_M6, M3_AND_M5_AND_M6, M7_ESCALATE, INCONCLUSIVE_TOOLING) ; I.1 threshold boundary (|corr|exactly at 0.05 → not flagged, strict>) ; I.2 threshold boundary (MI_trainexactly at2 × MI_val→ not flagged, strict>) ; I.3 recovery threshold (best_iterexactly at 30 → not recovered, strict>) ; G.1 invariance test (best_iter identical within ±1 across noise levels → M6 candidate) ; walk-forward split determinism + chronological-order assertion ; KS / χ² degenerate-input guards ; verdict JSON round-trip ; S22A1-precondition-non-scientific → INCONCLUSIVE_TOOLING. No real LGB train / no real sklearn fit on big data — stub the trainer + tiny synthetic arrays (mirrortest_s22_a5_signal_existence.pydiscipline). - §1.1 DAG verdict-severity (no-crash) conformance (added 2026-05-20, corrected 2026-05-21) : a dedicated block (
_dag_severity_for) mirrors the DAG task body's terminal status block and locks the verdict → severity mapping, asserting (i) the helper never raises for any status (valid or garbage), (ii) single-mechanismM3/M5/M6_CONFIRMED→severity=info, (iii) combinedM3_AND_*/M5_AND_*/M3_AND_M5_AND_M6,M7_ESCALATE,INCONCLUSIVE_TOOLING, contract-violation →severity=error, (iv) a combined (error-severity) verdict is a normally-returned dataclass that is JSON-persistable with no exception anywhere. - Probe isolation : (i) a descriptive-probe (KS) failure does not change the scientific verdict and is recorded as
None+ a failed-probe trail ; (ii) a load-bearing-probe (I.3) failure →INCONCLUSIVE_TOOLINGbut the I.1/I.2/G.1 evidence that did compute is retained on the verdict (no hidden complementary points). - DAG control-flow validated by the system run (no Airflow-task harness — consistent with all S22 PRs).
6. Success criteria (binary)¶
Module + DAG (ADR-92 versioned) + tests merged (committee pr_review + CR converged) ; run executed ; verdict ∈ {M3_CONFIRMED, M5_CONFIRMED, M6_CONFIRMED, M3_AND_M5, M3_AND_M6, M5_AND_M6, M3_AND_M5_AND_M6, M7_ESCALATE, INCONCLUSIVE_TOOLING, INCONCLUSIVE_UNCOVERED} persisted (Loki + dossier) ; cold-eyes sign-off (S22 §6.4).
7. Review question¶
(a) Is the scope inclusion of S22-G.1 label-noise injection (4 trainings at 0/10/20/30 % noise) inside S22A6 the correct minimal way to make a confident M6 verdict emittable — given the OP description names M6 in scope and the master plan §3.2 row 5 explicitly requires the G.1 evidence — versus deferring M6 to a tactical follow-up Story if only M3 / M5 fire ? (b) Is the single 80/20 walk-forward split (operator-tunable val_fraction Param, default 0.20) the right minimal viable I.3 test for M3 — given ADR-14 mandates multi-fold for the remediation Story but not for this characterisation Story — and is the chronological-order assertion at the start of I.3 (emit INCONCLUSIVE_TOOLING if the captured parquet row order is not monotone-in-time) the right tooling-failsafe ? (c) Is the strict-> interpretation of the master-plan thresholds (|corr| > 0.05 strict, MI_train > 2 × MI_val strict, best_iter > 30 strict, identical to S22A3/A4/A5 conventions and committee-endorsed in those Stories) the right pre-registered translation, refusing post-hoc relaxation at the knife-edge ? (d) Does the ADR-92 versioned DAG (dag_diagnostic__s22_a6 retrofitted from day-one with dag_doc_md / dag_loaded_event / make_tags build tag, per the Phase-1 helpers shipped in PR #1010) correctly inherit the operator-visible build-provenance invariant — and is the same R2-R6 robustness boilerplate (BASE_ENV inject, sentinel-sha, persist-before-fail, no-raise / always-green task per §1.1, atomic verdict write) the right zero-deviation reuse from the merged S22A5 DAG ? (e) Re-submit question 2026-05-21 (no-crash correction) : the first diagnostic__s22_a6 run produced M5_AND_M6, which under the prior contract raised an AirflowFailException (a Task failed with exception traceback). The operator ruled that a crash and set an absolute rule : the DAG task must never raise to signal an outcome — it always returns the verdict dict (green), severity goes only through event=... severity=... logs (Loki/Grafana), and each probe is isolated so one failure can't hide the others' evidence. This overrides the committee c5617cb1 RED-via-exception recommendation (committee advisory, ADR-68). Does the corrected §1.1 contract (all verdicts → green task + structured severity ; probe isolation ; precondition guards return INCONCLUSIVE_TOOLING instead of raising ; best-effort persistence) correctly and completely honour the operator no-crash rule while preserving the pre-registered §3.2 scientific decision tree and not losing any complementary diagnostic evidence ?