CVN-N001-EI-S04 — LightGBM capacity ablation (Block 2) — Plan dossier¶
- Story: CVN-N001-EI-S04 · wp#227 (
New) · GH issue #1059 (OPEN, 2026-05-24) - Epic: CVN-N001-EI (#1055) — Signal root-cause & tradability diagnostic program · Block 2 · decides status B
- Diagnostic instance:
s42(scaffold generated 2026-05-31 via/diagnostic-scaffold 227→dag_diagnostic__s42.py+s42_lgb_capacity_ablation.py+ Hamilton compute + invariant tests) - Dependencies: S01 learning curves (Block 1) — STATUS: CLOSED (GH #1056 closed 2026-05-25, PR #1062 — closure note PR-review filed at
documentation/reviews/2026-05-25-cvn-n001-ei-s01-pr1062-pr-review.md). S01 instrumentation provides thebest_iterbaseline trajectory needed for the conjunction criterion in §1. Not blocking. - Pattern reference: S03 split-regime reconstruction — same Epic, two-layer (Airflow + Hamilton), warm S07 pin infrastructure, ADR-0093 r2 dry-run cluster gate
- Hardened invariants from S03 Q1.g: ADR-0093 r3 Invariant 6 (§0bis cross-check), ADR-0094 Invariant 7 (
call_count == 0spy) — applies to this Story by inheritance from the Epic
0. Why this Story exists (provenance)¶
The diagnostic program (Epic CVN-N001-EI) decomposes the weak-signal investigation into 5 blocks with 6 Stories. Block 2 = LightGBM capacity over-fit : the hypothesis that current production HPO defaults (num_leaves, learning_rate, min_child_samples, lambda_l2, min_gain_to_split) over-fit per round, producing a misleadingly high training-time AUC that doesn't generalise. Under this hypothesis, a gentler LGB configuration (lower capacity, slower learning, stronger regularisation) would actually achieve higher OOS AUC.
Status B is binary, pre-registered, and load-bearing for the Epic's overall verdict. Confirming B means the AUC ceiling is fundamental (not a hyperparam issue → look elsewhere : C-blocks, regime, data, etc.). Refuting B means defaults are over-fitting and the production HPO grid needs replacement (concrete deliverable : recommended HP swap).
S04 follows S03 (Block 3b) which closed with SPLIT_INCONCLUSIVE_DEFI_TOP5 + 1× BASELINE_LEAK_INFLATED on OPUSDC (methodological lateral). S04 is independent of S03's verdict (different hypothesis, different block) but shares the warm S07 pin infrastructure for fast iteration on the same defi_top5 / fold_id=3 cells.
1. Objective & pre-committed decisions¶
Hypothesis B — formal statement as the NULL TESTED (operator r1 fix — inversion bug r0)¶
H₀(B) tested : Under current production HPO defaults on
defi_top5 / fold_id=3, there exists NO gentler configuration along the pre-registered PRIMARY axis (or along any of the 4 exploratory axes after correction) that achieves a marginal ΔAUC ≥ +0.02 with bootstrap CI excluding 0.
Reading the verdicts unambiguously :
- H₀ confirmed (= primary axis does NOT refute) → B_CAPACITY_OK : defaults correctly sized, AUC ceiling fundamental on this cohort. Status B = confirmed.
- H₀ refuted (= primary axis crosses the materiality bar with valid CI, OR an exploratory axis crosses after multiple-testing correction) → B_DEFAULTS_OVERFIT : defaults over-fit, HP swap recommended. Status B = refuted.
The verdict label now literally matches the null's outcome (r0 had the inverse pairing : §1 formal statement said "defaults overfit" but B_CAPACITY_OK was labelled "B confirmed").
Multiple-testing strategy — primary axis + intra-axis correction (operator r1 fix #2 + r2 fix #1)¶
Inter-axis : pre-register a PRIMARY axis + 4 exploratory axes to keep decision power on the primary while remaining honest on the others :
- Primary axis =
num_leaves— most direct capacity proxy. Tested at α = 0.05 (95% bootstrap CI). Decision-relevant. - Exploratory axes =
learning_rate,min_child_samples,lambda_l2,min_gain_to_split— tested at α = 0.05 with Bonferroni correction (effective α = 0.0125 each, 99% CI per axis).
Intra-axis (operator r2 fix #1) — SINGLE pre-registered test point per axis (Option A — extreme gentler endpoint). Each axis sweep produces 5 AUC values (default + 4 non-default), but only ONE comparison is decision-relevant per axis :
| Axis | Default | Decision test point (extreme gentler) | The 3 other sweep points |
|---|---|---|---|
num_leaves (PRIMARY) |
31 | 8 (= ÷4, log-2 endpoint) | Reported in notes, trajectory only, not decisional |
learning_rate (exploratory) |
0.1 | 0.025 (÷4) | idem |
min_child_samples (exploratory) |
20 | 80 (×4) | idem |
lambda_l2 (exploratory) |
0.0 | 100.0 (log-10 endpoint) | idem |
min_gain_to_split (exploratory) |
0.0 | 1.0 (log-10 endpoint) | idem |
→ 5 decisional tests total (1 primary @ 5% + 4 exploratory @ 1.25% Bonferroni). Group-FWER ≤ 5% on H₀. The 4 non-decision points per axis are reported for trajectory + optimum-localization (e.g. "edge-win signals optimum beyond grid → follow-up Story widens") but do NOT enter _decide_s42. Locked.
Rationale for Option A vs Option B (trend test) : num_leaves is monotonically related to capacity (smaller = simpler). The "extreme gentler" hypothesis IS the strongest prior. Trend tests (Option B) add power when optimum is intermediate, but we have no prior reason to expect intermediate optima for these hyperparams. Option A is simpler to code and review, and we lose nothing decision-relevant under the prior.
Non-monotonic optima trajectory reporting + follow-up trigger (operator r3 fix B — exploratory non-decisional)¶
Single-point Option A risks missing intermediate optima (e.g., num_leaves=16 could outperform both 8 and 31). Rather than expand the grid (partial fix — only addresses edge-wins), pre-register a mechanical exploratory statistic + follow-up trigger :
- Decisional (S04 verdict) : Option A single decision test point per axis (locked in
_decide_s42). - Exploratory (reported in verdict notes, NOT decisional) : compute
max_delta_auc_across_sweep[axis] = max(ΔAUC at each non-default point of axis)ANDargmax_sweep_point[axis] = the sweep_point achieving the max. - Pre-registered follow-up trigger : if
max_delta_auc > +0.03with CI excluding 0 at α=0.01 (Bonferroni correction for selection across the 4 non-default points per axis) on ANY axis → flag a follow-up Story in the S04 closure note : "widen grid around argmax_sweep_point OR test joint optimum around (argmax, neighbors)". Does NOT change S04 verdict — only triggers S04b or a new Story.
Why this avoids selection-effect FWER inflation : the threshold (+0.03, NOT +0.02) is tighter than the decisional bar, AND the α=0.01 Bonferroni correction over 4 non-default points keeps FWER ≤ 5% per axis for the trigger decision. The trajectory is published regardless ; the trigger only fires when the signal is strong enough to justify a follow-up Story (NOT to claim S04 missed a verdict).
Pre-registered verdict rule (locked in _decide_s42)¶
| Cell verdict | Trigger |
|---|---|
B_CAPACITY_OK (B confirmed — TOST equivalence test, r2 fix #2) |
(i) PRIMARY axis decision test point (num_leaves=8 vs 31) : 95%-CI on ΔAUC entirely contained in [−0.02, +0.02] (true equivalence test, NOT a non-significance pass). With multi-seed CI half-width target ≈ 0.02, this requires actual half-width < 0.02 (i.e. tight enough to bound effect) ; (ii) AND best_iter at default within ±20% of S01 baseline AND ≤ 50 (conjunction, r1 minor) ; (iii) AND NO exploratory axis (learning_rate=0.025, min_child_samples=80, lambda_l2=100.0, min_gain_to_split=1.0) crosses the Bonferroni-corrected materiality bar (ΔAUC ≥ +0.02 with 99%-CI excluding 0). Defaults correctly sized AND we have the precision to bound the effect. |
B_DEFAULTS_OVERFIT (B refuted) |
EITHER : (a) PRIMARY axis decision test point crosses the 95% materiality bar (ΔAUC ≥ +0.02, lower bound of 95%-CI > 0, gentler direction) ; OR (b) PRIMARY OK at TOST BUT ≥ 1 exploratory axis crosses the Bonferroni-corrected bar (ΔAUC ≥ +0.02, lower bound of 99%-CI > 0). The triggering axis is logged in the verdict notes. |
INCONCLUSIVE_UNDERPOWERED |
Power gate fails on PRIMARY (n_eff < 25 obs OR observed CI half-width > 0.04 — see §5.1) AND no exploratory axis triggers refute. Most likely outcome under H₀ given n=5 cohort even with multi-seed : the TOST equivalence test is strict — a wide CI (even centered on 0) fails B_CAPACITY_OK. Pre-engaged as the most likely outcome in §5.1. |
INCONCLUSIVE_TOOLING |
Capture/training/Hamilton compute failure (preserved from scaffold default). |
Group synthesis (locked in synthesize_group_s42) — B_SYSTEMATIC_OVERFIT added (operator r1 fix #4)¶
| Group verdict | Trigger |
|---|---|
B_SYSTEMATIC (defaults OK system-wide) |
≥ 80% (SYSTEMATIC_FRAC_CUT) cells = B_CAPACITY_OK AND directional consistency on primary axis ΔAUC (no cell shows a gentler direction even below materiality). |
B_SYSTEMATIC_OVERFIT (defaults overfit system-wide, added r1) |
≥ 80% cells = B_DEFAULTS_OVERFIT on the SAME triggering axis (primary or same exploratory) with directional consistency. → Recommendation : global HP swap on the triggering axis (concrete delta values in the closure note). |
B_PER_ASSET (heterogeneous) |
Mixed B_CAPACITY_OK / B_DEFAULTS_OVERFIT (no clear majority) OR different axes triggering refute across cells. → Recommendation : per-asset HP, not global swap. |
INCONCLUSIVE_GROUP_COVERAGE (renamed from cell-level INCONCLUSIVE_TOOLING, r1 minor) |
< 80% of cells reached a scientific verdict (i.e. cell-level INCONCLUSIVE_* ≥ 20%). Distinct label avoids the r0 confusion between cell- and group-level INCONCLUSIVE. |
Cohort & folds¶
- Cohort :
defi_top5(= LDOUSDC, OPUSDC, ARBUSDC, AAVEUSDC, UNIUSDC) — same as S01 / S03 for direct comparability. - Fold :
fold_id=3— pre-captured by S07 Gate-4 pins (warm path available, ~3-5 min/cell vs ~3 h/cell cold). - No expansion to other cohorts in this Story — out of scope, deferred to a follow-up Story if S04 verdict warrants.
2. Scope — the 5 axes (one-at-a-time ablation)¶
Per the Epic's pre-registration : one axis at a time (NOT joint), to keep n_cells × n_axes × n_points interpretable + the CI computation tractable. Each axis sweeps relative to its production default :
| Axis | Default (prod HPO) | Sweep grid r1 (log-2 multiplicatif uniformisé, operator r1 fix minor) | Direction "gentler" | Role |
|---|---|---|---|---|
num_leaves (PRIMARY) |
31 | {8, 16, 31, 62, 124} — log-2 around default (÷4, ÷2, ×, ×2, ×4) |
lower | decisional at α=0.05 |
learning_rate |
0.1 | {0.025, 0.05, 0.1, 0.2, 0.4} — log-2 around default |
lower | exploratory at α=0.0125 (Bonferroni) |
min_child_samples |
20 | {5, 10, 20, 40, 80} — log-2 around default |
higher | exploratory at α=0.0125 |
lambda_l2 |
0.0 | {0.0, 0.1, 1.0, 10.0, 100.0} — log-10 (defaults at 0, can only INCREASE; log-10 for wide span) |
higher | exploratory at α=0.0125 |
min_gain_to_split |
0.0 | {0.0, 0.001, 0.01, 0.1, 1.0} — log-10 (same reason as lambda_l2) |
higher | exploratory at α=0.0125 |
Grid spacing rationale : multiplicative axes (num_leaves, learning_rate, min_child_samples) use log-2 spacing centered on prod default to cover ÷4..×4 symmetric range. Additive-from-zero axes (lambda_l2, min_gain_to_split) use log-10 spacing from 0 because the prod default is 0 (you can't decrease below) AND the meaningful range spans 3 orders of magnitude. Both choices are pre-registered and justified per axis (vs r0's mixed/unjustified spacing).
5 axes × 5 sweep points × 5 cells = 125 training runs minimum (per axis baseline). Per-axis CI requires bootstrap over fold × crypto residuals.
One-at-a-time rationale (NOT joint ablation): the Epic pre-registration commits to marginal effects on each axis to keep the search space interpretable. A joint 5×5×5×5×5 grid = 3 125 training runs/cell → 15 625 across cohort, infeasible. Marginal sweep gives directional gradient per axis, which is what the verdict rule requires (we don't need joint optimum to falsify B — refute on ANY axis is enough).
3. Technical approach¶
Two-layer (Airflow + Hamilton), Hamilton-native (S03 lesson)¶
- Airflow orchestration (
dags/dag_diagnostic__s42.py, scaffolded) : resolve cells → discriminate_cell.expand (5 parallel pods) → synthesize. Reuses S07 warm pin (use_pin=true) + ADR-92 provenance (dag_loaded_event/dag_doc_md/make_tags). - Hamilton compute (
src/commun/finetune/diagnostic/hamilton/s42_nodes.py, scaffolded) :combined→ 5_probe_*nodes (one per axis) → aggregatedcell_verdictvia_decide_s42. Pure functions, no side effects, allValidationResult-style returns (noraiseper [feedback_no_python_crash_visible]).
Probe shape (per axis, to implement)¶
def _probe_num_leaves(X_tr, y_tr, X_va, y_va, resolved_hp, sweep_points, seed, n_rounds) -> dict:
"""Sweep num_leaves across pre-registered grid. Return per-point AUC + clustered bootstrap CI."""
# for each value in sweep_points:
# override resolved_hp["num_leaves"] = value
# fit LGB with feature_name=feature_names (Q1.g pattern, harness lightgbm_dag.py:163 inheritance)
# compute val AUC + best_iter
# bootstrap CI over fold × crypto residuals
# return {"axis": "num_leaves", "points": [...], "auc_per_point": [...], "ci_per_point": [...],
# "best_iter_per_point": [...], "delta_auc_vs_default": [...]}
Same shape for the 5 axes. The verdict rule (_decide_s42) consumes all 5 dicts + decides per the pre-registered table §1. Only the decision test point (extreme gentler endpoint, §1 r2 fix #1) enters the verdict ; the 3 other non-default points are for trajectory reporting.
Warm-pin reuse from S07 / S41¶
The captured parquet (with semantic feature_names post-Q1.g) is pinned per cell. S42's training fits use the same X_tr / y_tr / feature_names from load_cell_inputs. No re-capture needed if S03's recapture is still warm (cluster pins from 2026-05-30 21:40Z, content-addressed by input_data_sha — should persist through the data window).
Clustering bootstrap method — harmonized + n_cluster=5 acknowledged (operator r2 fix #3)¶
Clustering structure : (crypto, seed) pairs. fold_id=3 is fixed → no fold-variance to exploit. r0/r1 §3 mentioned "fold × crypto residuals" (legacy r0 wording) — superseded by (crypto, seed) exclusively. Harmonized.
Bootstrap CI method under n_cluster=5 — consensus n-of-m (r3, committee + operator review) : classical cluster bootstrap with n_cluster=5 is discretized (5 cryptos with replacement → ~1024 unique re-realizations, many duplicates), known-unreliable for small-cluster inference per Cameron et al. (2008). At this boundary, no single CI method is definitively more credible than the others — wild cluster bootstrap has Rademacher weight assumptions, Bayesian hierarchical has prior sensitivity, permutation is exact but limited to sign-based inference. Forcing one as "tiebreaker" would smuggle in arbitrary methodology preference.
Pre-engaged choice (r3) : compute 3 independent CIs in parallel, declare verdict by 2/3 consensus, treat method-disagreement as itself a signal of insufficient cluster count.
| Method | Role | Properties |
|---|---|---|
| Wild cluster bootstrap (Cameron-Gelbach-Miller) | Primary | Rademacher weights, calibrated for few-clusters, 2000 bootstrap iterations. Resamples seed-level residuals within cluster. |
| Permutation test on signs of paired-ΔAUC | Primary | Exact under H₀ for n=5 paired observations. Sign-based — robust to scale/shape but limited inferential power. |
| Bayesian hierarchical bootstrap (weak priors) | Primary | Posterior on ΔAUC under cluster-exchangeable model. Weak priors (Normal(0, 1) on cluster effects) acknowledged as own assumption. |
Consensus rule (locked in _decide_s42) :
B_DEFAULTS_OVERFITtriggered iff ≥ 2/3 methods agree : ΔAUC ≥ +0.02 AND CI lower bound > 0 at the corrected α (95% for primary axis, 99% Bonferroni for exploratory). All 3 methods + their CIs are logged in the verdict notes.B_CAPACITY_OK(TOST equivalence) triggered iff ≥ 2/3 methods agree : 95%-CI entirely contained in [−0.02, +0.02] (with CI half-width < 0.02 enforced separately per §5.1).INCONCLUSIVE_UNDERPOWEREDtriggered iff fewer than 2/3 methods agree on any conclusion. Method disagreement at n_cluster=5 IS the signal — the closure note logs"CI methods disagree at n_cluster=5 — verdict cannot be issued at this cluster count"explicitly. Honest underpower acknowledgment, not arbitrary tiebreaker.
Compute cost : marginal — all 3 methods reuse the same (crypto, seed) resampling base. Implementation effort : ~1 day across the 3 methods + the consensus aggregator in _decide_s42.
4. Files to create / modify¶
| File | State | Action |
|---|---|---|
dags/dag_diagnostic__s42.py |
scaffolded (354 LoC) | Commit as-is, refine if probe signatures dictate |
src/commun/finetune/diagnostic/s42_lgb_capacity_ablation.py |
scaffolded (204 LoC, 5 _probe_* + _decide_s42 as NotImplementedError stubs) |
Implement 5 probes + _decide_s42 |
src/commun/finetune/diagnostic/hamilton/s42_nodes.py |
scaffolded (206 LoC) | Wire probes to Hamilton nodes per scaffold pattern |
src/commun/finetune/diagnostic/hamilton/s42_io.py |
scaffolded (134 LoC) | Reuse load_cell_inputs verbatim (S41 pattern post-Q1.g) |
tests/unit/test_s42_parity.py |
scaffolded (109 LoC, 5 passing + 1 skipped oracle) | Fill test_full_parity with synthetic LGB oracle once probes land |
documentation/reviews/2026-05-31-cvn-n001-ei-s04-lgb-capacity-ablation-plan.md |
THIS FILE | Commit + link from PR |
5. Risks & threats to validity¶
5.1 Power-related — quantified honestly (operator r1 fix #3)¶
Expected CI half-width under raw n=5 cohort design (r0) : per-crypto AUC SD on same fold typically > 0.05 (S03 saw OPUSDC differ by ~0.08 from cohort mean). Bootstrap of n=5 with replacement → 95%-CI half-width ≈ 1.96 × σ/√n_eff ≈ 0.04-0.05 under H₀. Materiality bar = 0.02 → CI excluding 0 nearly unreachable even with a real ΔAUC of +0.05.
→ Without remediation, INCONCLUSIVE_UNDERPOWERED is THE most likely outcome (probability > 0.5 under H₀ AND under most realistic alternatives).
Three-pronged remediation (locked, r1) :
- Multi-seed intra-cellule — refit each
(cell, sweep_point)with 5 independent seeds (seed ∈ {1337, 1338, 1339, 1340, 1341}). Yields n_eff = 5 cells × 5 seeds = 25 obs per (axis, sweep_point), expected CI half-width ≈ 0.02 — i.e. materiality-bar-detectable for the primary axis. - Compute cost : 5 axes × 5 points × 5 cells × 5 seeds = 625 training runs total (vs r0's 125). At ~3-5 min/run warm = ~30-50 h wall-clock, sequential per cell within a pod, ~6-10 h with the 5 parallel cell-pods. Acceptable per Epic budget.
- Clustered bootstrap : resample over
(crypto, seed)pairs (NOT (crypto, sweep_point) — sweep points are fixed by design). - Materiality bar held at +0.02 — Epic consistency (S03 + S05+ use same bar). Lowering it to +0.01 would break cross-Block comparability.
INCONCLUSIVE_UNDERPOWEREDpre-engaged as most likely outcome in the closure-note template. Verdict notes MUST state the observed n_eff + observed CI half-width per axis, not just the verdict label. AB_CAPACITY_OKwith CI half-width 0.04 = honest "we can't tell"; logged distinct from a tightB_CAPACITY_OKwith CI half-width 0.015.
→ Locked power gate : per-axis INCONCLUSIVE_UNDERPOWERED triggers when n_eff < 25 OR observed CI_half_width > 0.04. Per-cell INCONCLUSIVE_UNDERPOWERED triggers when primary axis is underpowered AND no exploratory axis triggers refute.
5.2 Inter-axis confounding¶
- One-at-a-time ablation cannot detect interaction effects (e.g.
num_leaves=15may overfit only iflearning_rate=0.2). Marginal verdicts on each axis can mask joint behaviour. - Mitigation : explicit in the plan_review and in the verdict notes — S04 is marginal B, not joint B. A future Story may do joint sweep if marginal verdict surfaces interaction signals.
5.3 Pin invalidation¶
- Q1.g warm pins from 2026-05-30 may be invalidated if upstream data refreshes (cache invalidation on data_date change) → first-run cold-capture cost returns.
- Mitigation : check Loki for
event=s41_pin_hiton first dry-run trigger. If pins still warm, proceed. If not, accept the 3 h/cell cold cost as one-off.
5.4 Feature-name drift between S03 and S04¶
- S03's Q1.g fix means feature_names live in the parquet (post-Q1.g). S04 inherits this → harness must use same
lgb.Dataset(feature_name=...)path. The S42 scaffold doesn't override this, but the probes must explicitly usedatasets.feature_namesin their fit calls. - Mitigation : explicit invariant in
_probe_*:assert isinstance(feature_names, (list, tuple)) and len(feature_names) > 0beforelgb.Dataset(...). Test intest_full_parity.
5.5 BASELINE_LEAK_INFLATED on OPUSDC (S03 finding) — pre-engaged sensitivity verdict (operator r1 fix minor)¶
- S03 surfaced that OPUSDC's naive baseline is leak-inflated vs purged-WF. This is orthogonal to LGB capacity but may amplify or mask the AUC signal we measure here.
- Pre-engaged sensitivity verdict (locked) — the closure note MUST report TWO group verdicts side-by-side :
Full cohort verdict(5/5 cells) — primary verdict, what the verdict rule §1 emits.Excl-OPUSDC verdict(4/4 cells with OPUSDC removed) — sensitivity check.- Disagree definition pre-engaged (r2) — the two verdicts disagree iff ANY of :
- (a) Group label differs (e.g. Full=
B_SYSTEMATIC, Excl-OPUSDC=B_SYSTEMATIC_OVERFIT). - (b) The triggering axis differs (e.g. Full triggers on
num_leaves, Excl-OPUSDC triggers onlambda_l2). - (c) Both label + triggering axis match, but
|ΔAUC_full − ΔAUC_excl_OPUSDC| > 0.01on the triggering axis (effect-size delta exceeds half the materiality bar). - Decision rule : if AGREE → primary verdict (Full cohort) stands. If DISAGREE → the closure note flags it explicitly + Epic-level reviewer arbitrates. Not discretionary at the threshold definition layer — the disagree-vs-agree call is mechanical per the 3 above conditions. Only the arbitration (what to do under disagree) is human-judged.
5.6 ADR-0093 r2 dry-run cluster gate — smoke vs full split (operator r2 fix #4)¶
S04 is a pin-READ PR (consumes pre-existing S07/S41 pins; does NOT write new pins via the s42 path). ADR-0093 r2 was authored for pin-WRITE PRs, but ADR-0093 r3 Invariant 6 (§0bis cross-check) applies regardless : the probes must consume what the harness actually produces post-Q1.g (semantic feature_names, not f0..fN).
Two distinct cluster runs (r2) :
| Run | When | Scope | Wall-clock | Purpose |
|---|---|---|---|---|
| Smoke dry-run | pre-merge (ADR-0093 r2 Invariant 2c) | 2 cells × 2 axes × 2 seeds = 8 runs (r3 fix C, pre-registered) : crypto ∈ {LDOUSDC, OPUSDC} × axis ∈ {num_leaves, lambda_l2} × seed ∈ {1337, 1338}. Justification : LDOUSDC = highest N (=319), OPUSDC = leak-known baseline (covers cohort extremes per S03 closure §3) ; num_leaves = primary axis decisional test point, lambda_l2 = atypical grid (log-10 from 0 vs log-2 around default — exercises both grid-schema branches) ; 2 seeds = exercises multi-seed clustering aggregation. |
~30-60 min (8 runs sequential on a single pod, ~5 min/run warm) | Validates the machinery : (i) Hamilton graph compiles ; (ii) probe consumes datasets.feature_names post-Q1.g ; (iii) _decide_s42 returns a status in the valid pre-registered set ; (iv) multi-cell parallel pod orchestration (the 2-cell expansion exercises the cross-pod cell-fanout that the 1-cell smoke missed) ; (v) multi-axis aggregation in _decide_s42 (the 2-axis expansion exercises the inter-axis Bonferroni branch that the 1-axis smoke missed) ; (vi) multi-seed bootstrap clustering (the 2-seed expansion exercises the consensus-n-of-m aggregator at minimal scale). No pre-judgment on WHICH verdict — just that the decider ran end-to-end without crash and produced a structurally-valid decision at non-trivial scale. Satisfies ADR-0093 r2 Invariant 2c at sufficient integration coverage. |
| Full scientific run | post-merge | 5 axes × 5 sweep points × 5 cells × 5 seeds = 625 training runs | ~6-10 h with 5 parallel pods | Produces the decisional verdict entering the closure note. Triggered by operator post-merge per the standard /diagnostic-scaffold → impl → cluster trigger flow. |
Smoke validates the code path, full validates the hypothesis. The two ARE distinct ADR-0093 r2 satisfaction items + Story-completion deliverables ; conflating them would either force a 6-10h pre-merge block (impractical) or weaken the gate (smoke alone doesn't justify the PR's correctness if the full run later fails).
Helm-temp bypass mechanism as in S03 (operator-acknowledged anti-pattern, follow-up tracked in #1092). Until #1092 closes, smoke uses helm-temp pre-merge ; full uses standard deploy CI post-merge on main.
5.7 Ops SLO + pod failure semantics (operator r3 fix D — load-bearing for verdict quality)¶
At 625-run scale on 5 parallel pods, pod failures (OOM, Scaleway eviction, transient timeout) are quasi-certain. Without pre-spec, a failure can be silently retried, drop runs, or worst-case obscure the verdict by dropping cells whose signal mattered. Pre-engaged ops contract (locked in _decide_s42 + pod resource spec) :
- Resource quotas per pod (sized for ~125 runs/cell × 2× margin) :
- Memory : 8 GiB request / 16 GiB limit per pod (vs current S41
2 GiB/8 GiB— doubled for headroom on multi-seed multi-axis). - CPU : 2 request / 4 limit (unchanged).
- Pod node selector :
computepool (unchanged from S41). - Pod failure classification (load-bearing distinction) :
- Pod failure (OOM / eviction / timeout > 12 h) → cell-level
INCONCLUSIVE_TOOLINGexplicit, NEVERINCONCLUSIVE_UNDERPOWERED. The two are different : tooling = "we couldn't measure", underpower = "we measured but precision too low". Verdict notes MUST log the failure mode + retry attempts. - Auto-retry budget bounded : max 1 retry per cell. If retry also fails →
INCONCLUSIVE_TOOLINGrecorded, the cell is excluded from group synthesis (NOT counted toward the 80% scientific-verdict gate forINCONCLUSIVE_GROUP_COVERAGE). - SLO : 95% of pods MUST complete within ≤ 8 h. Beyond 12 h → pod killed via Airflow timeout + cell
INCONCLUSIVE_TOOLING(timeout class). - Resource starvation cross-check : if ≥ 2 cells fail with the same root cause (e.g., 2× OOM) → group-level
INCONCLUSIVE_GROUP_COVERAGEwith note"systematic ops failure — re-run after capacity bump". Avoids partial verdict on a degraded substrate.
6. Success criteria¶
A Story is "done" when all of the following hold:
- Pre-registered grid traversed : 5 axes × 5 sweep points × 5 cells × 5 seeds (multi-seed r1) = 625 training runs completed (or honest report of which subset succeeded).
- Per-axis ΔAUC + clustered bootstrap CI (clustered on
(crypto, seed)) computed at the locked materiality bar α (95% for primarynum_leaves, 99% Bonferroni-corrected for the 4 exploratory axes). Persisted ins42_per_cell_*.json+s42_group_*.jsonartefacts. - Cell verdict emitted : one of
B_CAPACITY_OK/B_DEFAULTS_OVERFIT/INCONCLUSIVE_UNDERPOWERED/INCONCLUSIVE_TOOLING. The notes MUST include the observedn_eff+CI_half_widthper axis (not just the verdict label — honest underpower reporting per §5.1 r1). - Group verdict pair emitted (operator r1 fix #5 sensitivity) :
Full cohort verdict(5/5 cells) +Excl-OPUSDC verdict(4/4 cells). If they DISAGREE, both surfaced explicitly in the closure note. - Group verdict ∈
B_SYSTEMATIC/B_SYSTEMATIC_OVERFIT/B_PER_ASSET/INCONCLUSIVE_GROUP_COVERAGE(per the r1 group matrix in §1). - If
B_SYSTEMATIC_OVERFIT: concrete HP swap recommendation in the closure note (new defaults for the triggering axis) backed by the observed ΔAUC × CI on the primary OR Bonferroni-corrected exploratory. - If
B_SYSTEMATIC: note explicitly that "LGB capacity is not the issue, AUC ceiling is fundamental on this cohort" → next-block (C-blocks) work is unblocked, no HP change. - Cluster dry-run on PR head SHA produced verifiable success events (per ADR-0093 r2 Invariant 2c) :
event=s42_cell_outcome status=<verdict>for ≥ 3/5 cells. Helm-temp bypass mechanism documented (#1092 still open). - Closure note (~1 page) per the S03 pattern on main before
/story-advance 227 → Closed.
Out-of-scope for S04 (pre-engaged to prevent re-litigation in future reviews)¶
- Tradability validation (slippage / fees / funding model) : S04 measures AUC (ML diagnostic), not P&L. AUC is invariant to execution costs. Handled by a post-S04 Story if
B_SYSTEMATIC_OVERFITtriggers an HP swap (the HP-swap candidate is then validated for tradability under new defaults). Out-of-scope for this Story per the Epic Block 2 pre-registration. - Joint multi-axis ablation (e.g.,
num_leaves × learning_ratejoint sweep) : marginal one-axis-at-a-time is locked per Epic pre-registration. Triggered as follow-up Story only if the trajectory non-monotonic-optima trigger fires (§2 Option B exploratory, r3 fix). - Cohort extension beyond
defi_top5: same-cohort comparability with S01/S03 is the priority. Cohort extension follow-up only if group verdict =B_PER_ASSET(heterogeneous response justifies broader sample).
7. Open questions for plan_review (r1 — major issues now tranched in §1/§2/§5, only true unknowns remain)¶
Q1 — Primary axis choice (num_leaves) — prod default confirmed r2¶
§1 r1 pre-registers num_leaves as the PRIMARY (decisional) axis. Rationale : most direct capacity proxy ; LightGBM's primary capacity knob.
Prod default verified (operator r2 30s check) : scripts/seed_hyperparams_console.py:115 seeds NUM_LEAVES: 31 to ftf_config. HPO range (15, 127, linear) per line 216, but S07/S41 captured parquets were trained with the canonical resolved HP (via _resolve_canonical_lgb_hp(DEFAULT_TIMEFRAME) in s41_io.py) → num_leaves=31. Ablation vs 31 IS the correct baseline. ✅
Alternative candidates considered + rejected : learning_rate (most-tuned axis in production HPO grids) — secondary capacity proxy ; min_child_samples — regularizer not capacity proxy.
Reco r2 : keep num_leaves PRIMARY. Locked.
Q2 — Multi-seed budget vs cohort-extension¶
§5.1 r1 commits to 5 seeds × 5 cells = 25 obs/axis (625 total runs). Alternative : 1 seed × 25 cells (extend cohort to defi_top25 or similar) — bigger n_eff via heterogeneity rather than seed noise. Trade-off : seed noise is i.i.d. (tightens CI), cohort heterogeneity is structured (may inflate CI if assets behave differently — exactly what B_PER_ASSET is for).
Reco : keep 5-seed × defi_top5 per Epic pre-registration (same cohort as S01/S03 → cross-comparability). Cohort extension is a follow-up Story if verdict is B_PER_ASSET.
Q3 — lambda_l2 / min_gain_to_split half-grids¶
Both axes pre-registered at log-10 spacing from 0 (no "less regularised" half because defaults are at 0). Edge-wins at grid endpoint (e.g. lambda_l2 = 100.0 wins over default 0.0 by ≥ 0.02 CI) would still trigger B_DEFAULTS_OVERFIT but the optimum may be off-grid (at 1000 or higher).
Reco : accept. An edge-win IS a refute signal even if the optimum is beyond the grid — the closure note flags "optimum beyond grid, follow-up Story to widen". Pre-registered, not opportunistic.
Q4 — Pin invalidation contingency¶
If S07 / S41 warm pins are invalidated by data refresh before S04's dry-run, accept 3 h × 5 cells = 15 h cold-capture, or wait? Per S03 closure : same risk → accepted. Budget allows.
Reco : accept.
Q5 — Post-Q1.g schema check¶
S03's Q1.g made parquets self-describing. S04 probes MUST consume datasets.feature_names (NOT positional f0..fN) when fitting LGB inside the probes — same invariant as S03's harness. ADR-0094 Invariant 7 (call_count == 0 on from_training_cache) MUST hold.
Reco : enforce via explicit invariant in _probe_* :
assert isinstance(feature_names, (list, tuple)) and len(feature_names) > 0
train_set = lgb.Dataset(X_tr, label=y_tr, feature_name=feature_names)
call_count == 0 spy test in test_s42_parity.py mirroring test_s41_pin.py.
Appendix — Evidence anchors¶
- Scaffold artefacts (committed with this plan) :
dags/dag_diagnostic__s42.py(354 LoC)src/commun/finetune/diagnostic/s42_lgb_capacity_ablation.py(204 LoC, 5 stubs +_decide_s42stub)src/commun/finetune/diagnostic/hamilton/s42_nodes.py(206 LoC)src/commun/finetune/diagnostic/hamilton/s42_io.py(134 LoC)tests/unit/test_s42_parity.py(109 LoC, 5 pass + 1 skipped oracle)- Inherited Epic infrastructure :
- S07 Lever #1 warm pin path (
s41_io._pin_load/_pin_store) — reused via scaffold inheritance. - Q1.g harness fix (
lightgbm_dag.py:lgb_booster_and_timefeature_name=datasets.feature_names) — load-bearing for S04's probes to use semantic feature_names. - ADR-0093 r3 + ADR-0094 r2 — apply by inheritance.
- Reference dossiers :
- S01 plan (dependency)
- S03 plan (pattern)
- S03 closure (verdict pattern)
- S03 post-merge RCA §12 Q1.g resolution (PR #1090 pr_review dossier — root-cause B′ semantic regime-axis names + harness
feature_namefix)