Skip to content

CVN-N001-EI-S04 — LightGBM capacity ablation (Block 2) — Plan dossier

  • Story: CVN-N001-EI-S04 · wp#227 (New) · GH issue #1059 (OPEN, 2026-05-24)
  • Epic: CVN-N001-EI (#1055) — Signal root-cause & tradability diagnostic program · Block 2 · decides status B
  • Diagnostic instance: s42 (scaffold generated 2026-05-31 via /diagnostic-scaffold 227dag_diagnostic__s42.py + s42_lgb_capacity_ablation.py + Hamilton compute + invariant tests)
  • Dependencies: S01 learning curves (Block 1) — STATUS: CLOSED (GH #1056 closed 2026-05-25, PR #1062 — closure note PR-review filed at documentation/reviews/2026-05-25-cvn-n001-ei-s01-pr1062-pr-review.md). S01 instrumentation provides the best_iter baseline trajectory needed for the conjunction criterion in §1. Not blocking.
  • Pattern reference: S03 split-regime reconstruction — same Epic, two-layer (Airflow + Hamilton), warm S07 pin infrastructure, ADR-0093 r2 dry-run cluster gate
  • Hardened invariants from S03 Q1.g: ADR-0093 r3 Invariant 6 (§0bis cross-check), ADR-0094 Invariant 7 (call_count == 0 spy) — applies to this Story by inheritance from the Epic

0. Why this Story exists (provenance)

The diagnostic program (Epic CVN-N001-EI) decomposes the weak-signal investigation into 5 blocks with 6 Stories. Block 2 = LightGBM capacity over-fit : the hypothesis that current production HPO defaults (num_leaves, learning_rate, min_child_samples, lambda_l2, min_gain_to_split) over-fit per round, producing a misleadingly high training-time AUC that doesn't generalise. Under this hypothesis, a gentler LGB configuration (lower capacity, slower learning, stronger regularisation) would actually achieve higher OOS AUC.

Status B is binary, pre-registered, and load-bearing for the Epic's overall verdict. Confirming B means the AUC ceiling is fundamental (not a hyperparam issue → look elsewhere : C-blocks, regime, data, etc.). Refuting B means defaults are over-fitting and the production HPO grid needs replacement (concrete deliverable : recommended HP swap).

S04 follows S03 (Block 3b) which closed with SPLIT_INCONCLUSIVE_DEFI_TOP5 + 1× BASELINE_LEAK_INFLATED on OPUSDC (methodological lateral). S04 is independent of S03's verdict (different hypothesis, different block) but shares the warm S07 pin infrastructure for fast iteration on the same defi_top5 / fold_id=3 cells.


1. Objective & pre-committed decisions

Hypothesis B — formal statement as the NULL TESTED (operator r1 fix — inversion bug r0)

H₀(B) tested : Under current production HPO defaults on defi_top5 / fold_id=3, there exists NO gentler configuration along the pre-registered PRIMARY axis (or along any of the 4 exploratory axes after correction) that achieves a marginal ΔAUC ≥ +0.02 with bootstrap CI excluding 0.

Reading the verdicts unambiguously : - H₀ confirmed (= primary axis does NOT refute) → B_CAPACITY_OK : defaults correctly sized, AUC ceiling fundamental on this cohort. Status B = confirmed. - H₀ refuted (= primary axis crosses the materiality bar with valid CI, OR an exploratory axis crosses after multiple-testing correction) → B_DEFAULTS_OVERFIT : defaults over-fit, HP swap recommended. Status B = refuted.

The verdict label now literally matches the null's outcome (r0 had the inverse pairing : §1 formal statement said "defaults overfit" but B_CAPACITY_OK was labelled "B confirmed").

Multiple-testing strategy — primary axis + intra-axis correction (operator r1 fix #2 + r2 fix #1)

Inter-axis : pre-register a PRIMARY axis + 4 exploratory axes to keep decision power on the primary while remaining honest on the others :

  • Primary axis = num_leaves — most direct capacity proxy. Tested at α = 0.05 (95% bootstrap CI). Decision-relevant.
  • Exploratory axes = learning_rate, min_child_samples, lambda_l2, min_gain_to_split — tested at α = 0.05 with Bonferroni correction (effective α = 0.0125 each, 99% CI per axis).

Intra-axis (operator r2 fix #1)SINGLE pre-registered test point per axis (Option A — extreme gentler endpoint). Each axis sweep produces 5 AUC values (default + 4 non-default), but only ONE comparison is decision-relevant per axis :

Axis Default Decision test point (extreme gentler) The 3 other sweep points
num_leaves (PRIMARY) 31 8 (= ÷4, log-2 endpoint) Reported in notes, trajectory only, not decisional
learning_rate (exploratory) 0.1 0.025 (÷4) idem
min_child_samples (exploratory) 20 80 (×4) idem
lambda_l2 (exploratory) 0.0 100.0 (log-10 endpoint) idem
min_gain_to_split (exploratory) 0.0 1.0 (log-10 endpoint) idem

5 decisional tests total (1 primary @ 5% + 4 exploratory @ 1.25% Bonferroni). Group-FWER ≤ 5% on H₀. The 4 non-decision points per axis are reported for trajectory + optimum-localization (e.g. "edge-win signals optimum beyond grid → follow-up Story widens") but do NOT enter _decide_s42. Locked.

Rationale for Option A vs Option B (trend test) : num_leaves is monotonically related to capacity (smaller = simpler). The "extreme gentler" hypothesis IS the strongest prior. Trend tests (Option B) add power when optimum is intermediate, but we have no prior reason to expect intermediate optima for these hyperparams. Option A is simpler to code and review, and we lose nothing decision-relevant under the prior.

Non-monotonic optima trajectory reporting + follow-up trigger (operator r3 fix B — exploratory non-decisional)

Single-point Option A risks missing intermediate optima (e.g., num_leaves=16 could outperform both 8 and 31). Rather than expand the grid (partial fix — only addresses edge-wins), pre-register a mechanical exploratory statistic + follow-up trigger :

  • Decisional (S04 verdict) : Option A single decision test point per axis (locked in _decide_s42).
  • Exploratory (reported in verdict notes, NOT decisional) : compute max_delta_auc_across_sweep[axis] = max(ΔAUC at each non-default point of axis) AND argmax_sweep_point[axis] = the sweep_point achieving the max.
  • Pre-registered follow-up trigger : if max_delta_auc > +0.03 with CI excluding 0 at α=0.01 (Bonferroni correction for selection across the 4 non-default points per axis) on ANY axis → flag a follow-up Story in the S04 closure note : "widen grid around argmax_sweep_point OR test joint optimum around (argmax, neighbors)". Does NOT change S04 verdict — only triggers S04b or a new Story.

Why this avoids selection-effect FWER inflation : the threshold (+0.03, NOT +0.02) is tighter than the decisional bar, AND the α=0.01 Bonferroni correction over 4 non-default points keeps FWER ≤ 5% per axis for the trigger decision. The trajectory is published regardless ; the trigger only fires when the signal is strong enough to justify a follow-up Story (NOT to claim S04 missed a verdict).

Pre-registered verdict rule (locked in _decide_s42)

Cell verdict Trigger
B_CAPACITY_OK (B confirmed — TOST equivalence test, r2 fix #2) (i) PRIMARY axis decision test point (num_leaves=8 vs 31) : 95%-CI on ΔAUC entirely contained in [−0.02, +0.02] (true equivalence test, NOT a non-significance pass). With multi-seed CI half-width target ≈ 0.02, this requires actual half-width < 0.02 (i.e. tight enough to bound effect) ; (ii) AND best_iter at default within ±20% of S01 baseline AND ≤ 50 (conjunction, r1 minor) ; (iii) AND NO exploratory axis (learning_rate=0.025, min_child_samples=80, lambda_l2=100.0, min_gain_to_split=1.0) crosses the Bonferroni-corrected materiality bar (ΔAUC ≥ +0.02 with 99%-CI excluding 0). Defaults correctly sized AND we have the precision to bound the effect.
B_DEFAULTS_OVERFIT (B refuted) EITHER : (a) PRIMARY axis decision test point crosses the 95% materiality bar (ΔAUC ≥ +0.02, lower bound of 95%-CI > 0, gentler direction) ; OR (b) PRIMARY OK at TOST BUT ≥ 1 exploratory axis crosses the Bonferroni-corrected bar (ΔAUC ≥ +0.02, lower bound of 99%-CI > 0). The triggering axis is logged in the verdict notes.
INCONCLUSIVE_UNDERPOWERED Power gate fails on PRIMARY (n_eff < 25 obs OR observed CI half-width > 0.04 — see §5.1) AND no exploratory axis triggers refute. Most likely outcome under H₀ given n=5 cohort even with multi-seed : the TOST equivalence test is strict — a wide CI (even centered on 0) fails B_CAPACITY_OK. Pre-engaged as the most likely outcome in §5.1.
INCONCLUSIVE_TOOLING Capture/training/Hamilton compute failure (preserved from scaffold default).

Group synthesis (locked in synthesize_group_s42) — B_SYSTEMATIC_OVERFIT added (operator r1 fix #4)

Group verdict Trigger
B_SYSTEMATIC (defaults OK system-wide) ≥ 80% (SYSTEMATIC_FRAC_CUT) cells = B_CAPACITY_OK AND directional consistency on primary axis ΔAUC (no cell shows a gentler direction even below materiality).
B_SYSTEMATIC_OVERFIT (defaults overfit system-wide, added r1) ≥ 80% cells = B_DEFAULTS_OVERFIT on the SAME triggering axis (primary or same exploratory) with directional consistency. → Recommendation : global HP swap on the triggering axis (concrete delta values in the closure note).
B_PER_ASSET (heterogeneous) Mixed B_CAPACITY_OK / B_DEFAULTS_OVERFIT (no clear majority) OR different axes triggering refute across cells. → Recommendation : per-asset HP, not global swap.
INCONCLUSIVE_GROUP_COVERAGE (renamed from cell-level INCONCLUSIVE_TOOLING, r1 minor) < 80% of cells reached a scientific verdict (i.e. cell-level INCONCLUSIVE_* ≥ 20%). Distinct label avoids the r0 confusion between cell- and group-level INCONCLUSIVE.

Cohort & folds

  • Cohort : defi_top5 (= LDOUSDC, OPUSDC, ARBUSDC, AAVEUSDC, UNIUSDC) — same as S01 / S03 for direct comparability.
  • Fold : fold_id=3 — pre-captured by S07 Gate-4 pins (warm path available, ~3-5 min/cell vs ~3 h/cell cold).
  • No expansion to other cohorts in this Story — out of scope, deferred to a follow-up Story if S04 verdict warrants.

2. Scope — the 5 axes (one-at-a-time ablation)

Per the Epic's pre-registration : one axis at a time (NOT joint), to keep n_cells × n_axes × n_points interpretable + the CI computation tractable. Each axis sweeps relative to its production default :

Axis Default (prod HPO) Sweep grid r1 (log-2 multiplicatif uniformisé, operator r1 fix minor) Direction "gentler" Role
num_leaves (PRIMARY) 31 {8, 16, 31, 62, 124} — log-2 around default (÷4, ÷2, ×, ×2, ×4) lower decisional at α=0.05
learning_rate 0.1 {0.025, 0.05, 0.1, 0.2, 0.4} — log-2 around default lower exploratory at α=0.0125 (Bonferroni)
min_child_samples 20 {5, 10, 20, 40, 80} — log-2 around default higher exploratory at α=0.0125
lambda_l2 0.0 {0.0, 0.1, 1.0, 10.0, 100.0} — log-10 (defaults at 0, can only INCREASE; log-10 for wide span) higher exploratory at α=0.0125
min_gain_to_split 0.0 {0.0, 0.001, 0.01, 0.1, 1.0} — log-10 (same reason as lambda_l2) higher exploratory at α=0.0125

Grid spacing rationale : multiplicative axes (num_leaves, learning_rate, min_child_samples) use log-2 spacing centered on prod default to cover ÷4..×4 symmetric range. Additive-from-zero axes (lambda_l2, min_gain_to_split) use log-10 spacing from 0 because the prod default is 0 (you can't decrease below) AND the meaningful range spans 3 orders of magnitude. Both choices are pre-registered and justified per axis (vs r0's mixed/unjustified spacing).

5 axes × 5 sweep points × 5 cells = 125 training runs minimum (per axis baseline). Per-axis CI requires bootstrap over fold × crypto residuals.

One-at-a-time rationale (NOT joint ablation): the Epic pre-registration commits to marginal effects on each axis to keep the search space interpretable. A joint 5×5×5×5×5 grid = 3 125 training runs/cell → 15 625 across cohort, infeasible. Marginal sweep gives directional gradient per axis, which is what the verdict rule requires (we don't need joint optimum to falsify B — refute on ANY axis is enough).


3. Technical approach

Two-layer (Airflow + Hamilton), Hamilton-native (S03 lesson)

  • Airflow orchestration (dags/dag_diagnostic__s42.py, scaffolded) : resolve cells → discriminate_cell.expand (5 parallel pods) → synthesize. Reuses S07 warm pin (use_pin=true) + ADR-92 provenance (dag_loaded_event / dag_doc_md / make_tags).
  • Hamilton compute (src/commun/finetune/diagnostic/hamilton/s42_nodes.py, scaffolded) : combined → 5 _probe_* nodes (one per axis) → aggregated cell_verdict via _decide_s42. Pure functions, no side effects, all ValidationResult-style returns (no raise per [feedback_no_python_crash_visible]).

Probe shape (per axis, to implement)

def _probe_num_leaves(X_tr, y_tr, X_va, y_va, resolved_hp, sweep_points, seed, n_rounds) -> dict:
    """Sweep num_leaves across pre-registered grid. Return per-point AUC + clustered bootstrap CI."""
    # for each value in sweep_points:
    #   override resolved_hp["num_leaves"] = value
    #   fit LGB with feature_name=feature_names (Q1.g pattern, harness lightgbm_dag.py:163 inheritance)
    #   compute val AUC + best_iter
    # bootstrap CI over fold × crypto residuals
    # return {"axis": "num_leaves", "points": [...], "auc_per_point": [...], "ci_per_point": [...],
    #         "best_iter_per_point": [...], "delta_auc_vs_default": [...]}

Same shape for the 5 axes. The verdict rule (_decide_s42) consumes all 5 dicts + decides per the pre-registered table §1. Only the decision test point (extreme gentler endpoint, §1 r2 fix #1) enters the verdict ; the 3 other non-default points are for trajectory reporting.

Warm-pin reuse from S07 / S41

The captured parquet (with semantic feature_names post-Q1.g) is pinned per cell. S42's training fits use the same X_tr / y_tr / feature_names from load_cell_inputs. No re-capture needed if S03's recapture is still warm (cluster pins from 2026-05-30 21:40Z, content-addressed by input_data_sha — should persist through the data window).

Clustering bootstrap method — harmonized + n_cluster=5 acknowledged (operator r2 fix #3)

Clustering structure : (crypto, seed) pairs. fold_id=3 is fixed → no fold-variance to exploit. r0/r1 §3 mentioned "fold × crypto residuals" (legacy r0 wording) — superseded by (crypto, seed) exclusively. Harmonized.

Bootstrap CI method under n_cluster=5 — consensus n-of-m (r3, committee + operator review) : classical cluster bootstrap with n_cluster=5 is discretized (5 cryptos with replacement → ~1024 unique re-realizations, many duplicates), known-unreliable for small-cluster inference per Cameron et al. (2008). At this boundary, no single CI method is definitively more credible than the others — wild cluster bootstrap has Rademacher weight assumptions, Bayesian hierarchical has prior sensitivity, permutation is exact but limited to sign-based inference. Forcing one as "tiebreaker" would smuggle in arbitrary methodology preference.

Pre-engaged choice (r3) : compute 3 independent CIs in parallel, declare verdict by 2/3 consensus, treat method-disagreement as itself a signal of insufficient cluster count.

Method Role Properties
Wild cluster bootstrap (Cameron-Gelbach-Miller) Primary Rademacher weights, calibrated for few-clusters, 2000 bootstrap iterations. Resamples seed-level residuals within cluster.
Permutation test on signs of paired-ΔAUC Primary Exact under H₀ for n=5 paired observations. Sign-based — robust to scale/shape but limited inferential power.
Bayesian hierarchical bootstrap (weak priors) Primary Posterior on ΔAUC under cluster-exchangeable model. Weak priors (Normal(0, 1) on cluster effects) acknowledged as own assumption.

Consensus rule (locked in _decide_s42) :

  • B_DEFAULTS_OVERFIT triggered iff ≥ 2/3 methods agree : ΔAUC ≥ +0.02 AND CI lower bound > 0 at the corrected α (95% for primary axis, 99% Bonferroni for exploratory). All 3 methods + their CIs are logged in the verdict notes.
  • B_CAPACITY_OK (TOST equivalence) triggered iff ≥ 2/3 methods agree : 95%-CI entirely contained in [−0.02, +0.02] (with CI half-width < 0.02 enforced separately per §5.1).
  • INCONCLUSIVE_UNDERPOWERED triggered iff fewer than 2/3 methods agree on any conclusion. Method disagreement at n_cluster=5 IS the signal — the closure note logs "CI methods disagree at n_cluster=5 — verdict cannot be issued at this cluster count" explicitly. Honest underpower acknowledgment, not arbitrary tiebreaker.

Compute cost : marginal — all 3 methods reuse the same (crypto, seed) resampling base. Implementation effort : ~1 day across the 3 methods + the consensus aggregator in _decide_s42.


4. Files to create / modify

File State Action
dags/dag_diagnostic__s42.py scaffolded (354 LoC) Commit as-is, refine if probe signatures dictate
src/commun/finetune/diagnostic/s42_lgb_capacity_ablation.py scaffolded (204 LoC, 5 _probe_* + _decide_s42 as NotImplementedError stubs) Implement 5 probes + _decide_s42
src/commun/finetune/diagnostic/hamilton/s42_nodes.py scaffolded (206 LoC) Wire probes to Hamilton nodes per scaffold pattern
src/commun/finetune/diagnostic/hamilton/s42_io.py scaffolded (134 LoC) Reuse load_cell_inputs verbatim (S41 pattern post-Q1.g)
tests/unit/test_s42_parity.py scaffolded (109 LoC, 5 passing + 1 skipped oracle) Fill test_full_parity with synthetic LGB oracle once probes land
documentation/reviews/2026-05-31-cvn-n001-ei-s04-lgb-capacity-ablation-plan.md THIS FILE Commit + link from PR

5. Risks & threats to validity

Expected CI half-width under raw n=5 cohort design (r0) : per-crypto AUC SD on same fold typically > 0.05 (S03 saw OPUSDC differ by ~0.08 from cohort mean). Bootstrap of n=5 with replacement → 95%-CI half-width ≈ 1.96 × σ/√n_eff ≈ 0.04-0.05 under H₀. Materiality bar = 0.02 → CI excluding 0 nearly unreachable even with a real ΔAUC of +0.05.

Without remediation, INCONCLUSIVE_UNDERPOWERED is THE most likely outcome (probability > 0.5 under H₀ AND under most realistic alternatives).

Three-pronged remediation (locked, r1) :

  1. Multi-seed intra-cellule — refit each (cell, sweep_point) with 5 independent seeds (seed ∈ {1337, 1338, 1339, 1340, 1341}). Yields n_eff = 5 cells × 5 seeds = 25 obs per (axis, sweep_point), expected CI half-width ≈ 0.02 — i.e. materiality-bar-detectable for the primary axis.
  2. Compute cost : 5 axes × 5 points × 5 cells × 5 seeds = 625 training runs total (vs r0's 125). At ~3-5 min/run warm = ~30-50 h wall-clock, sequential per cell within a pod, ~6-10 h with the 5 parallel cell-pods. Acceptable per Epic budget.
  3. Clustered bootstrap : resample over (crypto, seed) pairs (NOT (crypto, sweep_point) — sweep points are fixed by design).
  4. Materiality bar held at +0.02 — Epic consistency (S03 + S05+ use same bar). Lowering it to +0.01 would break cross-Block comparability.
  5. INCONCLUSIVE_UNDERPOWERED pre-engaged as most likely outcome in the closure-note template. Verdict notes MUST state the observed n_eff + observed CI half-width per axis, not just the verdict label. A B_CAPACITY_OK with CI half-width 0.04 = honest "we can't tell"; logged distinct from a tight B_CAPACITY_OK with CI half-width 0.015.

Locked power gate : per-axis INCONCLUSIVE_UNDERPOWERED triggers when n_eff < 25 OR observed CI_half_width > 0.04. Per-cell INCONCLUSIVE_UNDERPOWERED triggers when primary axis is underpowered AND no exploratory axis triggers refute.

5.2 Inter-axis confounding

  • One-at-a-time ablation cannot detect interaction effects (e.g. num_leaves=15 may overfit only if learning_rate=0.2). Marginal verdicts on each axis can mask joint behaviour.
  • Mitigation : explicit in the plan_review and in the verdict notes — S04 is marginal B, not joint B. A future Story may do joint sweep if marginal verdict surfaces interaction signals.

5.3 Pin invalidation

  • Q1.g warm pins from 2026-05-30 may be invalidated if upstream data refreshes (cache invalidation on data_date change) → first-run cold-capture cost returns.
  • Mitigation : check Loki for event=s41_pin_hit on first dry-run trigger. If pins still warm, proceed. If not, accept the 3 h/cell cold cost as one-off.

5.4 Feature-name drift between S03 and S04

  • S03's Q1.g fix means feature_names live in the parquet (post-Q1.g). S04 inherits this → harness must use same lgb.Dataset(feature_name=...) path. The S42 scaffold doesn't override this, but the probes must explicitly use datasets.feature_names in their fit calls.
  • Mitigation : explicit invariant in _probe_* : assert isinstance(feature_names, (list, tuple)) and len(feature_names) > 0 before lgb.Dataset(...). Test in test_full_parity.

5.5 BASELINE_LEAK_INFLATED on OPUSDC (S03 finding) — pre-engaged sensitivity verdict (operator r1 fix minor)

  • S03 surfaced that OPUSDC's naive baseline is leak-inflated vs purged-WF. This is orthogonal to LGB capacity but may amplify or mask the AUC signal we measure here.
  • Pre-engaged sensitivity verdict (locked) — the closure note MUST report TWO group verdicts side-by-side :
  • Full cohort verdict (5/5 cells) — primary verdict, what the verdict rule §1 emits.
  • Excl-OPUSDC verdict (4/4 cells with OPUSDC removed) — sensitivity check.
  • Disagree definition pre-engaged (r2) — the two verdicts disagree iff ANY of :
  • (a) Group label differs (e.g. Full=B_SYSTEMATIC, Excl-OPUSDC=B_SYSTEMATIC_OVERFIT).
  • (b) The triggering axis differs (e.g. Full triggers on num_leaves, Excl-OPUSDC triggers on lambda_l2).
  • (c) Both label + triggering axis match, but |ΔAUC_full − ΔAUC_excl_OPUSDC| > 0.01 on the triggering axis (effect-size delta exceeds half the materiality bar).
  • Decision rule : if AGREE → primary verdict (Full cohort) stands. If DISAGREE → the closure note flags it explicitly + Epic-level reviewer arbitrates. Not discretionary at the threshold definition layer — the disagree-vs-agree call is mechanical per the 3 above conditions. Only the arbitration (what to do under disagree) is human-judged.

5.6 ADR-0093 r2 dry-run cluster gate — smoke vs full split (operator r2 fix #4)

S04 is a pin-READ PR (consumes pre-existing S07/S41 pins; does NOT write new pins via the s42 path). ADR-0093 r2 was authored for pin-WRITE PRs, but ADR-0093 r3 Invariant 6 (§0bis cross-check) applies regardless : the probes must consume what the harness actually produces post-Q1.g (semantic feature_names, not f0..fN).

Two distinct cluster runs (r2) :

Run When Scope Wall-clock Purpose
Smoke dry-run pre-merge (ADR-0093 r2 Invariant 2c) 2 cells × 2 axes × 2 seeds = 8 runs (r3 fix C, pre-registered) : crypto ∈ {LDOUSDC, OPUSDC} × axis ∈ {num_leaves, lambda_l2} × seed ∈ {1337, 1338}. Justification : LDOUSDC = highest N (=319), OPUSDC = leak-known baseline (covers cohort extremes per S03 closure §3) ; num_leaves = primary axis decisional test point, lambda_l2 = atypical grid (log-10 from 0 vs log-2 around default — exercises both grid-schema branches) ; 2 seeds = exercises multi-seed clustering aggregation. ~30-60 min (8 runs sequential on a single pod, ~5 min/run warm) Validates the machinery : (i) Hamilton graph compiles ; (ii) probe consumes datasets.feature_names post-Q1.g ; (iii) _decide_s42 returns a status in the valid pre-registered set ; (iv) multi-cell parallel pod orchestration (the 2-cell expansion exercises the cross-pod cell-fanout that the 1-cell smoke missed) ; (v) multi-axis aggregation in _decide_s42 (the 2-axis expansion exercises the inter-axis Bonferroni branch that the 1-axis smoke missed) ; (vi) multi-seed bootstrap clustering (the 2-seed expansion exercises the consensus-n-of-m aggregator at minimal scale). No pre-judgment on WHICH verdict — just that the decider ran end-to-end without crash and produced a structurally-valid decision at non-trivial scale. Satisfies ADR-0093 r2 Invariant 2c at sufficient integration coverage.
Full scientific run post-merge 5 axes × 5 sweep points × 5 cells × 5 seeds = 625 training runs ~6-10 h with 5 parallel pods Produces the decisional verdict entering the closure note. Triggered by operator post-merge per the standard /diagnostic-scaffold → impl → cluster trigger flow.

Smoke validates the code path, full validates the hypothesis. The two ARE distinct ADR-0093 r2 satisfaction items + Story-completion deliverables ; conflating them would either force a 6-10h pre-merge block (impractical) or weaken the gate (smoke alone doesn't justify the PR's correctness if the full run later fails).

Helm-temp bypass mechanism as in S03 (operator-acknowledged anti-pattern, follow-up tracked in #1092). Until #1092 closes, smoke uses helm-temp pre-merge ; full uses standard deploy CI post-merge on main.

5.7 Ops SLO + pod failure semantics (operator r3 fix D — load-bearing for verdict quality)

At 625-run scale on 5 parallel pods, pod failures (OOM, Scaleway eviction, transient timeout) are quasi-certain. Without pre-spec, a failure can be silently retried, drop runs, or worst-case obscure the verdict by dropping cells whose signal mattered. Pre-engaged ops contract (locked in _decide_s42 + pod resource spec) :

  • Resource quotas per pod (sized for ~125 runs/cell × 2× margin) :
  • Memory : 8 GiB request / 16 GiB limit per pod (vs current S41 2 GiB/8 GiB — doubled for headroom on multi-seed multi-axis).
  • CPU : 2 request / 4 limit (unchanged).
  • Pod node selector : compute pool (unchanged from S41).
  • Pod failure classification (load-bearing distinction) :
  • Pod failure (OOM / eviction / timeout > 12 h) → cell-level INCONCLUSIVE_TOOLING explicit, NEVER INCONCLUSIVE_UNDERPOWERED. The two are different : tooling = "we couldn't measure", underpower = "we measured but precision too low". Verdict notes MUST log the failure mode + retry attempts.
  • Auto-retry budget bounded : max 1 retry per cell. If retry also fails → INCONCLUSIVE_TOOLING recorded, the cell is excluded from group synthesis (NOT counted toward the 80% scientific-verdict gate for INCONCLUSIVE_GROUP_COVERAGE).
  • SLO : 95% of pods MUST complete within ≤ 8 h. Beyond 12 h → pod killed via Airflow timeout + cell INCONCLUSIVE_TOOLING (timeout class).
  • Resource starvation cross-check : if ≥ 2 cells fail with the same root cause (e.g., 2× OOM) → group-level INCONCLUSIVE_GROUP_COVERAGE with note "systematic ops failure — re-run after capacity bump". Avoids partial verdict on a degraded substrate.

6. Success criteria

A Story is "done" when all of the following hold:

  • Pre-registered grid traversed : 5 axes × 5 sweep points × 5 cells × 5 seeds (multi-seed r1) = 625 training runs completed (or honest report of which subset succeeded).
  • Per-axis ΔAUC + clustered bootstrap CI (clustered on (crypto, seed)) computed at the locked materiality bar α (95% for primary num_leaves, 99% Bonferroni-corrected for the 4 exploratory axes). Persisted in s42_per_cell_*.json + s42_group_*.json artefacts.
  • Cell verdict emitted : one of B_CAPACITY_OK / B_DEFAULTS_OVERFIT / INCONCLUSIVE_UNDERPOWERED / INCONCLUSIVE_TOOLING. The notes MUST include the observed n_eff + CI_half_width per axis (not just the verdict label — honest underpower reporting per §5.1 r1).
  • Group verdict pair emitted (operator r1 fix #5 sensitivity) : Full cohort verdict (5/5 cells) + Excl-OPUSDC verdict (4/4 cells). If they DISAGREE, both surfaced explicitly in the closure note.
  • Group verdict ∈ B_SYSTEMATIC / B_SYSTEMATIC_OVERFIT / B_PER_ASSET / INCONCLUSIVE_GROUP_COVERAGE (per the r1 group matrix in §1).
  • If B_SYSTEMATIC_OVERFIT : concrete HP swap recommendation in the closure note (new defaults for the triggering axis) backed by the observed ΔAUC × CI on the primary OR Bonferroni-corrected exploratory.
  • If B_SYSTEMATIC : note explicitly that "LGB capacity is not the issue, AUC ceiling is fundamental on this cohort" → next-block (C-blocks) work is unblocked, no HP change.
  • Cluster dry-run on PR head SHA produced verifiable success events (per ADR-0093 r2 Invariant 2c) : event=s42_cell_outcome status=<verdict> for ≥ 3/5 cells. Helm-temp bypass mechanism documented (#1092 still open).
  • Closure note (~1 page) per the S03 pattern on main before /story-advance 227 → Closed.

Out-of-scope for S04 (pre-engaged to prevent re-litigation in future reviews)

  • Tradability validation (slippage / fees / funding model) : S04 measures AUC (ML diagnostic), not P&L. AUC is invariant to execution costs. Handled by a post-S04 Story if B_SYSTEMATIC_OVERFIT triggers an HP swap (the HP-swap candidate is then validated for tradability under new defaults). Out-of-scope for this Story per the Epic Block 2 pre-registration.
  • Joint multi-axis ablation (e.g., num_leaves × learning_rate joint sweep) : marginal one-axis-at-a-time is locked per Epic pre-registration. Triggered as follow-up Story only if the trajectory non-monotonic-optima trigger fires (§2 Option B exploratory, r3 fix).
  • Cohort extension beyond defi_top5 : same-cohort comparability with S01/S03 is the priority. Cohort extension follow-up only if group verdict = B_PER_ASSET (heterogeneous response justifies broader sample).

7. Open questions for plan_review (r1 — major issues now tranched in §1/§2/§5, only true unknowns remain)

Q1 — Primary axis choice (num_leaves) — prod default confirmed r2

§1 r1 pre-registers num_leaves as the PRIMARY (decisional) axis. Rationale : most direct capacity proxy ; LightGBM's primary capacity knob.

Prod default verified (operator r2 30s check) : scripts/seed_hyperparams_console.py:115 seeds NUM_LEAVES: 31 to ftf_config. HPO range (15, 127, linear) per line 216, but S07/S41 captured parquets were trained with the canonical resolved HP (via _resolve_canonical_lgb_hp(DEFAULT_TIMEFRAME) in s41_io.py) → num_leaves=31. Ablation vs 31 IS the correct baseline. ✅

Alternative candidates considered + rejected : learning_rate (most-tuned axis in production HPO grids) — secondary capacity proxy ; min_child_samples — regularizer not capacity proxy.

Reco r2 : keep num_leaves PRIMARY. Locked.

Q2 — Multi-seed budget vs cohort-extension

§5.1 r1 commits to 5 seeds × 5 cells = 25 obs/axis (625 total runs). Alternative : 1 seed × 25 cells (extend cohort to defi_top25 or similar) — bigger n_eff via heterogeneity rather than seed noise. Trade-off : seed noise is i.i.d. (tightens CI), cohort heterogeneity is structured (may inflate CI if assets behave differently — exactly what B_PER_ASSET is for).

Reco : keep 5-seed × defi_top5 per Epic pre-registration (same cohort as S01/S03 → cross-comparability). Cohort extension is a follow-up Story if verdict is B_PER_ASSET.

Q3 — lambda_l2 / min_gain_to_split half-grids

Both axes pre-registered at log-10 spacing from 0 (no "less regularised" half because defaults are at 0). Edge-wins at grid endpoint (e.g. lambda_l2 = 100.0 wins over default 0.0 by ≥ 0.02 CI) would still trigger B_DEFAULTS_OVERFIT but the optimum may be off-grid (at 1000 or higher).

Reco : accept. An edge-win IS a refute signal even if the optimum is beyond the grid — the closure note flags "optimum beyond grid, follow-up Story to widen". Pre-registered, not opportunistic.

Q4 — Pin invalidation contingency

If S07 / S41 warm pins are invalidated by data refresh before S04's dry-run, accept 3 h × 5 cells = 15 h cold-capture, or wait? Per S03 closure : same risk → accepted. Budget allows.

Reco : accept.

Q5 — Post-Q1.g schema check

S03's Q1.g made parquets self-describing. S04 probes MUST consume datasets.feature_names (NOT positional f0..fN) when fitting LGB inside the probes — same invariant as S03's harness. ADR-0094 Invariant 7 (call_count == 0 on from_training_cache) MUST hold.

Reco : enforce via explicit invariant in _probe_* :

assert isinstance(feature_names, (list, tuple)) and len(feature_names) > 0
train_set = lgb.Dataset(X_tr, label=y_tr, feature_name=feature_names)
and a call_count == 0 spy test in test_s42_parity.py mirroring test_s41_pin.py.


Appendix — Evidence anchors

  • Scaffold artefacts (committed with this plan) :
  • dags/dag_diagnostic__s42.py (354 LoC)
  • src/commun/finetune/diagnostic/s42_lgb_capacity_ablation.py (204 LoC, 5 stubs + _decide_s42 stub)
  • src/commun/finetune/diagnostic/hamilton/s42_nodes.py (206 LoC)
  • src/commun/finetune/diagnostic/hamilton/s42_io.py (134 LoC)
  • tests/unit/test_s42_parity.py (109 LoC, 5 pass + 1 skipped oracle)
  • Inherited Epic infrastructure :
  • S07 Lever #1 warm pin path (s41_io._pin_load / _pin_store) — reused via scaffold inheritance.
  • Q1.g harness fix (lightgbm_dag.py:lgb_booster_and_time feature_name=datasets.feature_names) — load-bearing for S04's probes to use semantic feature_names.
  • ADR-0093 r3 + ADR-0094 r2 — apply by inheritance.
  • Reference dossiers :
  • S01 plan (dependency)
  • S03 plan (pattern)
  • S03 closure (verdict pattern)
  • S03 post-merge RCA §12 Q1.g resolution (PR #1090 pr_review dossier — root-cause B′ semantic regime-axis names + harness feature_name fix)