Plan dossier — CVN-N001-EE-S22A5 negative controls + signal existence (Axis-2)¶

Date : 2026-05-19 · Story : CVN-N001-EE-S22A5 (wp#185) · Issue : #967 Parent : CVN-N001-EE-S22 (wp#168) · Epic MAYDAY CVN-N001-EF Session type : plan_review Scientific basis (committee-approved) : parent S22 plan dossier v3 §S22-C (variants 3-7) + §S22-F + §S22-K + §3.3 Axis-2 decision tree — committee 31b95c00 PASSED 9.0/10, 0 dissent. This micro-story inherits that pre-specified design + decision tree ; no post-hoc. Entry gate : S22A4 = NEITHER (M1+M2 ruled out, cold-eyes 6fb3f0b8 PASSED, wp#184 Closed). Chain : S22A1 REPRODUCED → S22A2 SEED_INDEPENDENT → S22A3 CURVE_DEGENERATE → S22A4 NEITHER. The bug is real, seed-independent, full-curve-visible, NOT a class-balance / learning-rate artefact ⇒ Axis-2 isolation is the pre-specified next step.

1. Decisional question (pre-specified)¶

S22A4 ruled out Axis-1 hyperparameters (M1+M2). Before chasing Axis-3 (M3-M6, S22A6), does the dataset contain exploitable signal at all (Axis-2), and is the diagnostic harness itself trustworthy? Run the S22-C negative controls + S22-F baselines + S22-K signal-existence probes on the canonical cell (AAVEUSDC fold=3, S18-captured parquet, single seed 1337 = S22A2 median) and apply S22 plan v3 §3.3 verbatim.

Probes (all on the SAME captured parquet) :

Probe	Spec	Pre-registered expectation
N1 — shuffled labels (S22-C v3)	seeded `rng(1337)` permute `y_train`	AUC < 0.55 (chance) — else harness leaks
N2 — shuffled features (S22-C v4)	seeded `rng(1337)` row-permute `X_train`	AUC < 0.55 (chance) — else harness leaks
N3 — train/val swap (S22-C v5)	swap (train,val) roles	descriptive only — feeds S22A6 §3.2 M3
B1 — LogReg (S22-C v6 / S22-F)	`LogisticRegression(class_weight='balanced')`, scaler fit on train only	AUC vs 0.55
B2 — shallow tree (S22-C v7 / S22-F)	LGB `max_depth=3, n_estimators=1` via `resolved_hp` override	AUC vs 0.55
K1 — per-feature MI (S22-K m1)	`mutual_info_classif`, train + val, ranked	≥1 feature MI > 0.05 ?
K2 — PCA→KNN (S22-K m2, reco #3)	`PCA(0.95)` + `StandardScaler` fit train-only → `KNeighborsClassifier(5)`	AUC vs 0.55 (canonical KNN for §3.3)
K2b — raw KNN (S22-K m2-bis)	`KNeighborsClassifier(5)` on raw features	sanity / curse-of-dim cross-check, NOT verdict-load-bearing
K3 — trivial rules (S22-K m3)	majority-class, prior-prob, top-MI single-feature threshold	AUC lower bounds, descriptive

Operationalisation of §3.3's qualitative terms, pre-registered here BEFORE the run (no post-hoc) : "AUC ≈ 0.50" ⇒ strict cut AUC < 0.55 (the plan's own signal bar, symmetric) ; "MI ≈ 0" ⇒ max per-feature MI < 0.05 (verbatim §3.3 informative-feature threshold). §3.3 "K-NN" = the PCA-KNN (K2, per reco #3 — raw-KNN biased by curse-of-dim on ~50-150-feature FTF spaces).

Aggregate verdict (frozen, no post-hoc) :

Pattern	Verdict	Next
N1 or N2 AUC ≥ 0.55	NEG_CONTROL_LEAK — harness/data leaks ; Axis-2 signal verdict NOT trusted	hand to S22A6 (leakage + walk-forward) — strong M5/M6 signal
N1∧N2 clean and B1 > 0.55 ∧ B2 > 0.55 ∧ max-MI > 0.05	SIGNAL_EXISTS — Axis-2 ruled out ; S22A4 NEITHER stands	proceed S22A6 (M3-M6)
N1∧N2 clean and B1 < 0.55 ∧ B2 < 0.55 ∧ PCA-KNN < 0.55 ∧ max-MI < 0.05	NO_SIGNAL — no exploitable signal in feature space	escalate : data/label/split design review (separate Story, NOT CVN-N001-EE) + CVN-N001 feasibility re-evaluation (§3.3 row 2)
N1∧N2 clean and B1 > 0.55 ∧ B2 < 0.55	SIGNAL_MONOTONIC_ONLY	diagnostic flag ; informs future FE Stories ; proceed S22A6
N1∧N2 clean and B1 < 0.55 ∧ B2 > 0.55	SIGNAL_NON_MONOTONIC (expected on crypto)	diagnostic flag, no action ; proceed S22A6
N1∧N2 clean but no row above fires exactly	`INCONCLUSIVE_UNCOVERED` (scientific finding)	escalate per §3.2 M7 discipline — never a post-hoc reclassification ; DAG colour: GREEN + WARN escalation
any probe tooling-fail / non-finite / incomplete / S22A1 precondition non-scientific	`INCONCLUSIVE_TOOLING`	clean `AirflowFailException` (RED), never a Python crash — distinct from `INCONCLUSIVE_UNCOVERED` above

best_iter for LGB probes is 1-indexed argmin0(val_loss)+1 over the full curve, early stop DISABLED — identical convention to S22A3/A4. AUC = roc_auc_score; degenerate constant predictions ⇒ AUC pinned to 0.50 + WARN event=s22a5_degenerate_auc (conservative, never a crash).

2. Design — reuse the proven trainer ; ONE narrow additive opt-in (S22A4 §7(a) precedent)¶

Correction (post-plan_review, disclosed to pr_review) : the plan_review dossier claimed "zero proven-module change". Implementation review found §3.3 needs val-prediction AUC for the only LGB baseline (B2 shallow-tree) + the LGB negative controls (N1/N2) ; _train_full_no_earlystop computes booster.predict(X_val) internally but returns only (val_loss, train_loss, proba_std) — predictions are discarded. Getting AUC requires EITHER a narrow additive opt-in return, OR a local copy-paste of the lgb.train block (the anti-pattern CR hammered in S22A3/A4), OR swapping B2 to a non-LGB model (a post-hoc protocol deviation from S22 v3 §S22-C v7). Resolution : a narrow additive opt-in return_val_proba: bool = False kwarg on _train_full_no_earlystop — default False ⇒ the existing 3-tuple is byte-identical for S22A1/A2/A3/A4 (3 prod callers + 2 test refs, unit-locked) ; True ⇒ additionally returns val_proba. This is the exact pattern committee ddb05ad9 already endorsed for the S22A4 scale_pos_weight_override kwarg (additive, opt-in, default byte-identical, unit-locked). The literal S22 v3 protocol (LGB shallow-tree + LGB negative controls measured by AUC) is preserved — no scientific protocol deviation. Operator-approved 2026-05-19, flagged here for pr_review.

Post-first-run clarification (2026-05-19, follow-up fix PR) : the first operator-triggered run produced a valid scientific verdict (B1=0.638, B2=0.618 — both > cut ; max_MI=0.019 < cut — uncovered §3.3 pattern → §3.2 M7 escalation) but the DAG colour layer crashed RED because the implementation used a single string "INCONCLUSIVE" for both the scientific uncovered case (§1 last "verdict" row, scientific finding) and the tooling case (§1 last "tooling" row, operational fault) ; only the latter was meant to be RED. This contradicted §1 of this dossier (which already conceptually distinguished the two — see verdict table above) and feedback_no_python_crash_visible. Resolution : explicit split into INCONCLUSIVE_UNCOVERED (GREEN + WARN escalation, scientific finding → §3.2 M7) and INCONCLUSIVE_TOOLING (RED AirflowFailException, tooling). No scientific contract change — the same two cases were always pre-registered ; the implementation now matches the dossier distinction explicitly. Locked by test_run_uncovered_is_inconclusive_uncovered_not_tooling.

Single-pod chain, exact mirror of merged dag_diagnostic__s22_a4 : Phase A S18 capture ONCE (evidence, #982), then Phase B runs all probes on the SAME captured parquet. New module commun/finetune/diagnostic/s22_a5_signal_existence.py : run_s22a5(...) -> S22A5Verdict + write_s22a5_verdict(...) ; new DAG dags/dag_diagnostic__s22_a5.py (mirror s22_a4 : BASE_ENV inject #991, sentinel-sha #981, persist-before-fail #983, AirflowFailException-not-crash, single-pod, path-bound + protocol-conformance WARN, atomic verdict write — all hard-won S22A3/A4 R2-R6 invariants).

Exactly one narrow additive opt-in on _train_full_no_earlystop (return_val_proba: bool = False, default ⇒ byte-identical, unit-locked — S22A4 §7(a) precedent). N1/N2 shuffles operate on the y/X arrays before the helper call (seeded np.random.default_rng(1337) permutation) ; N3 swaps the (train,val) tuples passed in ; B2 shallow-tree reuses the helper via a resolved_hp max_depth=3 / n_estimators=1 override — exactly the S22A4 V2 parameterisation pattern. All four LGB probes consume the opt-in val_proba for AUC. B1/K1/K2/K2b/K3 are pure sklearn on the loaded arrays, outside the LGB path.
Reuse run_s22a1 wholesale for preconditions / INCONCLUSIVE flows / canonical best_iter cross-ref / parquet+SHA / ADR-90 hp / n_buy floor (S22A3/A4 precedent — zero duplication, NO behaviour change). Reuse _load_captured_parquet / _resolve_canonical_lgb_hp / _sha256_file (parquet-SHA revalidation, non-finite + completeness guards = R2-R6 carried verbatim).
New deps already in .venv_airflow : sklearn.{linear_model.LogisticRegression, neighbors.KNeighborsClassifier, decomposition.PCA, preprocessing.StandardScaler, metrics.roc_auc_score, feature_selection.mutual_info_classif}. Scaler + PCA fit on train only, applied to val (S22-K m2 verbatim) — leakage-guarded by a unit assertion.
Single seed 1337 (canonical, pre-registered ; seed-independence CLOSED by S22A2 ; a multi-seed re-run busts the ≤1-day budget).
Artifact : /tmp/s18-diagnostic/s22a5-signal-AAVEUSDC-fold3.json (pod-ephemeral) + durable Loki event=s22a5_verdict (ADR-30) — identical persistence model to S22A4 (see §5 limitation).

3. Scope / non-goals¶

NOT a fix — Axis-2 characterisation only (does the data have signal ; is the harness trustworthy). Exactly one narrow additive opt-in on _train_full_no_earlystop (return_val_proba, default ⇒ byte-identical ; §2 correction box — S22A4 §7(a) precedent, disclosed to pr_review) ; no other proven-module touch ; no behaviour change for S22A1/A2/A3/A4. NO new PG hyperparameters (ADR-90 — shuffles / PCA / KNN / MI are diagnostic probes, run-level, never written to ftf_config). Single cell, single seed 1337, pre-registered. §3.6 Axis-2 relaxed tier (100 ≤ n_buy < 200, axis2_relaxed) applies in principle but the canonical cell clears the strict ≥200 bar (S22A4 observed n_buy_train=1725 / n_buy_val=353) — no relaxed-tier flag needed here. N3 train/val-swap is run + reported but NOT verdict-emitting in S22A5 : its interpretation is §3.2 M3 (train/val construction bias = Axis-3), which is S22A6's axis — keeping S22A5 axis-pure per master plan §2 ("the tests themselves are axis-pure"). Multi-cell / defi_top5 generalisation = explicit S22A6+ scope, not claimed. Deferred (tracked, inherited) : #995 per-iter proba_dispersion ; #992 _load_base_env ; connect_timeout (ADR-0091 INV-5).

4. Risks¶

Shuffle determinism : N1/N2 permutations MUST be seeded (np.random.default_rng(1337)) so the verdict is reproducible — unit-locked (same seed ⇒ identical permutation).
§3.3 cut pre-registration : < 0.55 = "≈ 0.50 / no signal", < 0.05 = "MI ≈ 0" — both the plan's own constants, pre-registered BEFORE the run. A baseline at 0.54 is "no signal" (strict, no soft zone, no post-hoc nudging).
PCA/scaler fit leakage : fit on train indices only, transform val — a fit-on-all leak would manufacture fake signal. Unit-locked assertion.
KNN curse-of-dim : raw-KNN (K2b) may read ≈0.50 even with signal ; §3.3 KNN is therefore the PCA-KNN (K2). Raw-KNN is sanity cross-check only, pre-registered as non-verdict-load-bearing.
Degenerate predictions : constant-output baseline ⇒ roc_auc_score undefined ⇒ AUC pinned 0.50 + WARN, never a crash (ADR-25 / no-Python-crash-visible).
Phase A divergence : S18 FAIL = bug reproducing, recorded as evidence (#982 ; proven non-blocking in S22A2/A3/A4).
Runtime : Phase A ~23-26 min (observed) + B2 (max_depth=3, n_est=1 — instant) + N1/N2 full-curve LGB (≈ a few min each) + LogReg/KNN/PCA/MI (seconds) ≪ ≤1-day budget.

5. Honest limitations (pre-stated, not a weakening)¶

Single cell (AAVEUSDC fold=3, seed 1337). Generalisation is explicit S22A6+ scope, not claimed here.
Verdict JSON is pod-ephemeral (/tmp) ; durable system-of-record is the Loki event=s22a5_verdict (ADR-30) + this dossier + the OP wp#185 log — identical to the S22A4 model the cold-eyes committee already signed off (6fb3f0b8 §5).
§3.3 4-label tree is extended by exactly three operationally-necessary states pre-registered here : NEG_CONTROL_LEAK (harness-validity gate, green+WARN), INCONCLUSIVE_UNCOVERED (uncovered §3.3 pattern = scientific finding → §3.2 M7, green+WARN), and INCONCLUSIVE_TOOLING (precondition/probe-failure, red AirflowFailException) — none weakens the inherited contract ; all three prevent a silent-fallback or a Python crash (ADR-25 + feedback_no_python_crash_visible). The TOOLING/UNCOVERED split is explicit since the post-first-run fix (see §2 second correction box) ; before the fix both shared a single INCONCLUSIVE string and the DAG conflated them.

6. Test plan¶

Unit tests/unit/test_s22_a5_signal_existence.py : §1 decision tree, every row (NEG_CONTROL_LEAK ; SIGNAL_EXISTS ; NO_SIGNAL ; SIGNAL_MONOTONIC_ONLY ; SIGNAL_NON_MONOTONIC ; INCONCLUSIVE uncovered + tooling) ; AUC-band boundary (0.55 both sides) ; MI threshold boundary (0.05) ; verdict JSON round-trip ; shuffle determinism (seed ⇒ identical permutation) ; PCA/scaler fit-on-train-only assertion ; degenerate-prediction AUC guard ; S22A1-precondition-non-scientific → INCONCLUSIVE. No LGB train / no sklearn fit on real data — stub the trainer + tiny synthetic arrays (mirror test_s22_a4_m1m2_ablation.py discipline).
DAG control-flow validated by the system run (no Airflow-task harness — consistent with all S22 PRs).

7. Success criteria (binary)¶

Module + DAG + tests merged (committee pr_review + CR converged) ; run executed ; verdict ∈ {NEG_CONTROL_LEAK, SIGNAL_EXISTS, NO_SIGNAL, SIGNAL_MONOTONIC_ONLY, SIGNAL_NON_MONOTONIC, INCONCLUSIVE_UNCOVERED, INCONCLUSIVE_TOOLING} persisted (Loki + dossier) ; cold-eyes sign-off (S22 §6.4).

8. Review question¶

(a) Is operationalising §3.3's qualitative "AUC ≈ 0.50" as the strict cut < 0.55 (the plan's own signal bar, symmetric) and "MI ≈ 0" as < 0.05 (verbatim §3.3 informative-feature threshold), both pre-registered here BEFORE the run, the correct non-post-hoc translation of the inherited decision tree — or should a distinct chance-band constant be pre-registered instead? (b) Is adding NEG_CONTROL_LEAK as an explicit harness-validity gate (any shuffled-label/feature AUC ≥ 0.55 ⇒ the Axis-2 signal verdict is NOT emitted, hand to S22A6 M5/M6) the right pre-specified guard, versus silently trusting the baselines on a possibly-leaking harness? (c) Is deferring the train/val-swap (N3) interpretation to S22A6 — running + reporting it here but NOT verdict-emitting in S22A5 to keep the test axis-pure (master plan §2), since swap feeds §3.2 M3 = Axis-3 = S22A6's axis — the correct scope boundary? (d) [Updated post-plan_review — see §2 correction box] Is the one narrow additive opt-in return_val_proba: bool = False on _train_full_no_earlystop (default ⇒ byte-identical 3-tuple, unit-locked ; the exact committee-ddb05ad9-endorsed S22A4 scale_pos_weight_override pattern) the correct minimal way to obtain the §3.3 LGB-baseline AUC — preserving the literal S22 v3 protocol with no scientific deviation — versus the copy-paste anti-pattern or a non-LGB B2 substitution ? And does the rest of the design (shuffles/PCA/KNN/MI on arrays+sklearn outside the helper ; B2 via resolved_hp override like S22A4 V2 ; run_s22a1 reused wholesale) correctly inherit the S22A3/A4 R2-R6 invariants for a defensible Axis-2 verdict?