Plan dossier — CVN-N001-EE-S22A5 negative controls + signal existence (Axis-2)¶
Date : 2026-05-19 · Story : CVN-N001-EE-S22A5 (wp#185) · Issue : #967
Parent : CVN-N001-EE-S22 (wp#168) · Epic MAYDAY CVN-N001-EF
Session type : plan_review
Scientific basis (committee-approved) : parent S22 plan dossier v3 §S22-C (variants 3-7) + §S22-F + §S22-K + §3.3 Axis-2 decision tree — committee 31b95c00 PASSED 9.0/10, 0 dissent. This micro-story inherits that pre-specified design + decision tree ; no post-hoc.
Entry gate : S22A4 = NEITHER (M1+M2 ruled out, cold-eyes 6fb3f0b8 PASSED, wp#184 Closed). Chain : S22A1 REPRODUCED → S22A2 SEED_INDEPENDENT → S22A3 CURVE_DEGENERATE → S22A4 NEITHER. The bug is real, seed-independent, full-curve-visible, NOT a class-balance / learning-rate artefact ⇒ Axis-2 isolation is the pre-specified next step.
1. Decisional question (pre-specified)¶
S22A4 ruled out Axis-1 hyperparameters (M1+M2). Before chasing Axis-3 (M3-M6, S22A6), does the dataset contain exploitable signal at all (Axis-2), and is the diagnostic harness itself trustworthy? Run the S22-C negative controls + S22-F baselines + S22-K signal-existence probes on the canonical cell (AAVEUSDC fold=3, S18-captured parquet, single seed 1337 = S22A2 median) and apply S22 plan v3 §3.3 verbatim.
Probes (all on the SAME captured parquet) :
| Probe | Spec | Pre-registered expectation |
|---|---|---|
| N1 — shuffled labels (S22-C v3) | seeded rng(1337) permute y_train |
AUC < 0.55 (chance) — else harness leaks |
| N2 — shuffled features (S22-C v4) | seeded rng(1337) row-permute X_train |
AUC < 0.55 (chance) — else harness leaks |
| N3 — train/val swap (S22-C v5) | swap (train,val) roles | descriptive only — feeds S22A6 §3.2 M3 |
| B1 — LogReg (S22-C v6 / S22-F) | LogisticRegression(class_weight='balanced'), scaler fit on train only |
AUC vs 0.55 |
| B2 — shallow tree (S22-C v7 / S22-F) | LGB max_depth=3, n_estimators=1 via resolved_hp override |
AUC vs 0.55 |
| K1 — per-feature MI (S22-K m1) | mutual_info_classif, train + val, ranked |
≥1 feature MI > 0.05 ? |
| K2 — PCA→KNN (S22-K m2, reco #3) | PCA(0.95) + StandardScaler fit train-only → KNeighborsClassifier(5) |
AUC vs 0.55 (canonical KNN for §3.3) |
| K2b — raw KNN (S22-K m2-bis) | KNeighborsClassifier(5) on raw features |
sanity / curse-of-dim cross-check, NOT verdict-load-bearing |
| K3 — trivial rules (S22-K m3) | majority-class, prior-prob, top-MI single-feature threshold | AUC lower bounds, descriptive |
Operationalisation of §3.3's qualitative terms, pre-registered here BEFORE the run (no post-hoc) : "AUC ≈ 0.50" ⇒ strict cut AUC < 0.55 (the plan's own signal bar, symmetric) ; "MI ≈ 0" ⇒ max per-feature MI < 0.05 (verbatim §3.3 informative-feature threshold). §3.3 "K-NN" = the PCA-KNN (K2, per reco #3 — raw-KNN biased by curse-of-dim on ~50-150-feature FTF spaces).
Aggregate verdict (frozen, no post-hoc) :
| Pattern | Verdict | Next |
|---|---|---|
| N1 or N2 AUC ≥ 0.55 | NEG_CONTROL_LEAK — harness/data leaks ; Axis-2 signal verdict NOT trusted | hand to S22A6 (leakage + walk-forward) — strong M5/M6 signal |
| N1∧N2 clean and B1 > 0.55 ∧ B2 > 0.55 ∧ max-MI > 0.05 | SIGNAL_EXISTS — Axis-2 ruled out ; S22A4 NEITHER stands | proceed S22A6 (M3-M6) |
| N1∧N2 clean and B1 < 0.55 ∧ B2 < 0.55 ∧ PCA-KNN < 0.55 ∧ max-MI < 0.05 | NO_SIGNAL — no exploitable signal in feature space | escalate : data/label/split design review (separate Story, NOT CVN-N001-EE) + CVN-N001 feasibility re-evaluation (§3.3 row 2) |
| N1∧N2 clean and B1 > 0.55 ∧ B2 < 0.55 | SIGNAL_MONOTONIC_ONLY | diagnostic flag ; informs future FE Stories ; proceed S22A6 |
| N1∧N2 clean and B1 < 0.55 ∧ B2 > 0.55 | SIGNAL_NON_MONOTONIC (expected on crypto) | diagnostic flag, no action ; proceed S22A6 |
| N1∧N2 clean but no row above fires exactly | INCONCLUSIVE_UNCOVERED (scientific finding) |
escalate per §3.2 M7 discipline — never a post-hoc reclassification ; DAG colour: GREEN + WARN escalation |
| any probe tooling-fail / non-finite / incomplete / S22A1 precondition non-scientific | INCONCLUSIVE_TOOLING |
clean AirflowFailException (RED), never a Python crash — distinct from INCONCLUSIVE_UNCOVERED above |
best_iter for LGB probes is 1-indexed argmin0(val_loss)+1 over the full curve, early stop DISABLED — identical convention to S22A3/A4. AUC = roc_auc_score; degenerate constant predictions ⇒ AUC pinned to 0.50 + WARN event=s22a5_degenerate_auc (conservative, never a crash).
2. Design — reuse the proven trainer ; ONE narrow additive opt-in (S22A4 §7(a) precedent)¶
Correction (post-plan_review, disclosed to
pr_review) : the plan_review dossier claimed "zero proven-module change". Implementation review found §3.3 needs val-prediction AUC for the only LGB baseline (B2 shallow-tree) + the LGB negative controls (N1/N2) ;_train_full_no_earlystopcomputesbooster.predict(X_val)internally but returns only(val_loss, train_loss, proba_std)— predictions are discarded. Getting AUC requires EITHER a narrow additive opt-in return, OR a local copy-paste of thelgb.trainblock (the anti-pattern CR hammered in S22A3/A4), OR swapping B2 to a non-LGB model (a post-hoc protocol deviation from S22 v3 §S22-C v7). Resolution : a narrow additive opt-inreturn_val_proba: bool = Falsekwarg on_train_full_no_earlystop— defaultFalse⇒ the existing 3-tuple is byte-identical for S22A1/A2/A3/A4 (3 prod callers + 2 test refs, unit-locked) ;True⇒ additionally returnsval_proba. This is the exact pattern committeeddb05ad9already endorsed for the S22A4scale_pos_weight_overridekwarg (additive, opt-in, default byte-identical, unit-locked). The literal S22 v3 protocol (LGB shallow-tree + LGB negative controls measured by AUC) is preserved — no scientific protocol deviation. Operator-approved 2026-05-19, flagged here forpr_review.Post-first-run clarification (2026-05-19, follow-up fix PR) : the first operator-triggered run produced a valid scientific verdict (B1=0.638, B2=0.618 — both > cut ; max_MI=0.019 < cut — uncovered §3.3 pattern → §3.2 M7 escalation) but the DAG colour layer crashed RED because the implementation used a single string
"INCONCLUSIVE"for both the scientific uncovered case (§1 last "verdict" row, scientific finding) and the tooling case (§1 last "tooling" row, operational fault) ; only the latter was meant to be RED. This contradicted §1 of this dossier (which already conceptually distinguished the two — see verdict table above) andfeedback_no_python_crash_visible. Resolution : explicit split intoINCONCLUSIVE_UNCOVERED(GREEN + WARN escalation, scientific finding → §3.2 M7) andINCONCLUSIVE_TOOLING(REDAirflowFailException, tooling). No scientific contract change — the same two cases were always pre-registered ; the implementation now matches the dossier distinction explicitly. Locked bytest_run_uncovered_is_inconclusive_uncovered_not_tooling.
Single-pod chain, exact mirror of merged dag_diagnostic__s22_a4 : Phase A S18 capture ONCE (evidence, #982), then Phase B runs all probes on the SAME captured parquet. New module commun/finetune/diagnostic/s22_a5_signal_existence.py : run_s22a5(...) -> S22A5Verdict + write_s22a5_verdict(...) ; new DAG dags/dag_diagnostic__s22_a5.py (mirror s22_a4 : BASE_ENV inject #991, sentinel-sha #981, persist-before-fail #983, AirflowFailException-not-crash, single-pod, path-bound + protocol-conformance WARN, atomic verdict write — all hard-won S22A3/A4 R2-R6 invariants).
- Exactly one narrow additive opt-in on
_train_full_no_earlystop(return_val_proba: bool = False, default ⇒ byte-identical, unit-locked — S22A4 §7(a) precedent). N1/N2 shuffles operate on they/Xarrays before the helper call (seedednp.random.default_rng(1337)permutation) ; N3 swaps the (train,val) tuples passed in ; B2 shallow-tree reuses the helper via aresolved_hpmax_depth=3 / n_estimators=1override — exactly the S22A4 V2 parameterisation pattern. All four LGB probes consume the opt-inval_probafor AUC. B1/K1/K2/K2b/K3 are pure sklearn on the loaded arrays, outside the LGB path. - Reuse
run_s22a1wholesale for preconditions / INCONCLUSIVE flows / canonical best_iter cross-ref / parquet+SHA / ADR-90 hp / n_buy floor (S22A3/A4 precedent — zero duplication, NO behaviour change). Reuse_load_captured_parquet/_resolve_canonical_lgb_hp/_sha256_file(parquet-SHA revalidation, non-finite + completeness guards = R2-R6 carried verbatim). - New deps already in
.venv_airflow:sklearn.{linear_model.LogisticRegression, neighbors.KNeighborsClassifier, decomposition.PCA, preprocessing.StandardScaler, metrics.roc_auc_score, feature_selection.mutual_info_classif}. Scaler + PCA fit on train only, applied to val (S22-K m2 verbatim) — leakage-guarded by a unit assertion. - Single seed 1337 (canonical, pre-registered ; seed-independence CLOSED by S22A2 ; a multi-seed re-run busts the ≤1-day budget).
- Artifact :
/tmp/s18-diagnostic/s22a5-signal-AAVEUSDC-fold3.json(pod-ephemeral) + durable Lokievent=s22a5_verdict(ADR-30) — identical persistence model to S22A4 (see §5 limitation).
3. Scope / non-goals¶
NOT a fix — Axis-2 characterisation only (does the data have signal ; is the harness trustworthy). Exactly one narrow additive opt-in on _train_full_no_earlystop (return_val_proba, default ⇒ byte-identical ; §2 correction box — S22A4 §7(a) precedent, disclosed to pr_review) ; no other proven-module touch ; no behaviour change for S22A1/A2/A3/A4. NO new PG hyperparameters (ADR-90 — shuffles / PCA / KNN / MI are diagnostic probes, run-level, never written to ftf_config). Single cell, single seed 1337, pre-registered. §3.6 Axis-2 relaxed tier (100 ≤ n_buy < 200, axis2_relaxed) applies in principle but the canonical cell clears the strict ≥200 bar (S22A4 observed n_buy_train=1725 / n_buy_val=353) — no relaxed-tier flag needed here. N3 train/val-swap is run + reported but NOT verdict-emitting in S22A5 : its interpretation is §3.2 M3 (train/val construction bias = Axis-3), which is S22A6's axis — keeping S22A5 axis-pure per master plan §2 ("the tests themselves are axis-pure"). Multi-cell / defi_top5 generalisation = explicit S22A6+ scope, not claimed. Deferred (tracked, inherited) : #995 per-iter proba_dispersion ; #992 _load_base_env ; connect_timeout (ADR-0091 INV-5).
4. Risks¶
- Shuffle determinism : N1/N2 permutations MUST be seeded (
np.random.default_rng(1337)) so the verdict is reproducible — unit-locked (same seed ⇒ identical permutation). - §3.3 cut pre-registration :
< 0.55= "≈ 0.50 / no signal",< 0.05= "MI ≈ 0" — both the plan's own constants, pre-registered BEFORE the run. A baseline at 0.54 is "no signal" (strict, no soft zone, no post-hoc nudging). - PCA/scaler fit leakage : fit on train indices only, transform val — a fit-on-all leak would manufacture fake signal. Unit-locked assertion.
- KNN curse-of-dim : raw-KNN (K2b) may read ≈0.50 even with signal ; §3.3 KNN is therefore the PCA-KNN (K2). Raw-KNN is sanity cross-check only, pre-registered as non-verdict-load-bearing.
- Degenerate predictions : constant-output baseline ⇒
roc_auc_scoreundefined ⇒ AUC pinned 0.50 + WARN, never a crash (ADR-25 / no-Python-crash-visible). - Phase A divergence : S18 FAIL = bug reproducing, recorded as evidence (#982 ; proven non-blocking in S22A2/A3/A4).
- Runtime : Phase A ~23-26 min (observed) + B2 (max_depth=3, n_est=1 — instant) + N1/N2 full-curve LGB (≈ a few min each) + LogReg/KNN/PCA/MI (seconds) ≪ ≤1-day budget.
5. Honest limitations (pre-stated, not a weakening)¶
- Single cell (AAVEUSDC fold=3, seed 1337). Generalisation is explicit S22A6+ scope, not claimed here.
- Verdict JSON is pod-ephemeral (
/tmp) ; durable system-of-record is the Lokievent=s22a5_verdict(ADR-30) + this dossier + the OP wp#185 log — identical to the S22A4 model the cold-eyes committee already signed off (6fb3f0b8§5). - §3.3 4-label tree is extended by exactly three operationally-necessary states pre-registered here : NEG_CONTROL_LEAK (harness-validity gate, green+WARN),
INCONCLUSIVE_UNCOVERED(uncovered §3.3 pattern = scientific finding → §3.2 M7, green+WARN), andINCONCLUSIVE_TOOLING(precondition/probe-failure, redAirflowFailException) — none weakens the inherited contract ; all three prevent a silent-fallback or a Python crash (ADR-25 +feedback_no_python_crash_visible). The TOOLING/UNCOVERED split is explicit since the post-first-run fix (see §2 second correction box) ; before the fix both shared a singleINCONCLUSIVEstring and the DAG conflated them.
6. Test plan¶
- Unit
tests/unit/test_s22_a5_signal_existence.py: §1 decision tree, every row (NEG_CONTROL_LEAK ; SIGNAL_EXISTS ; NO_SIGNAL ; SIGNAL_MONOTONIC_ONLY ; SIGNAL_NON_MONOTONIC ; INCONCLUSIVE uncovered + tooling) ; AUC-band boundary (0.55 both sides) ; MI threshold boundary (0.05) ; verdict JSON round-trip ; shuffle determinism (seed ⇒ identical permutation) ; PCA/scaler fit-on-train-only assertion ; degenerate-prediction AUC guard ; S22A1-precondition-non-scientific → INCONCLUSIVE. No LGB train / no sklearn fit on real data — stub the trainer + tiny synthetic arrays (mirrortest_s22_a4_m1m2_ablation.pydiscipline). - DAG control-flow validated by the system run (no Airflow-task harness — consistent with all S22 PRs).
7. Success criteria (binary)¶
Module + DAG + tests merged (committee pr_review + CR converged) ; run executed ; verdict ∈ {NEG_CONTROL_LEAK, SIGNAL_EXISTS, NO_SIGNAL, SIGNAL_MONOTONIC_ONLY, SIGNAL_NON_MONOTONIC, INCONCLUSIVE_UNCOVERED, INCONCLUSIVE_TOOLING} persisted (Loki + dossier) ; cold-eyes sign-off (S22 §6.4).
8. Review question¶
(a) Is operationalising §3.3's qualitative "AUC ≈ 0.50" as the strict cut < 0.55 (the plan's own signal bar, symmetric) and "MI ≈ 0" as < 0.05 (verbatim §3.3 informative-feature threshold), both pre-registered here BEFORE the run, the correct non-post-hoc translation of the inherited decision tree — or should a distinct chance-band constant be pre-registered instead? (b) Is adding NEG_CONTROL_LEAK as an explicit harness-validity gate (any shuffled-label/feature AUC ≥ 0.55 ⇒ the Axis-2 signal verdict is NOT emitted, hand to S22A6 M5/M6) the right pre-specified guard, versus silently trusting the baselines on a possibly-leaking harness? (c) Is deferring the train/val-swap (N3) interpretation to S22A6 — running + reporting it here but NOT verdict-emitting in S22A5 to keep the test axis-pure (master plan §2), since swap feeds §3.2 M3 = Axis-3 = S22A6's axis — the correct scope boundary? (d) [Updated post-plan_review — see §2 correction box] Is the one narrow additive opt-in return_val_proba: bool = False on _train_full_no_earlystop (default ⇒ byte-identical 3-tuple, unit-locked ; the exact committee-ddb05ad9-endorsed S22A4 scale_pos_weight_override pattern) the correct minimal way to obtain the §3.3 LGB-baseline AUC — preserving the literal S22 v3 protocol with no scientific deviation — versus the copy-paste anti-pattern or a non-LGB B2 substitution ? And does the rest of the design (shuffles/PCA/KNN/MI on arrays+sklearn outside the helper ; B2 via resolved_hp override like S22A4 V2 ; run_s22a1 reused wholesale) correctly inherit the S22A3/A4 R2-R6 invariants for a defensible Axis-2 verdict?