Skip to content

Multi-fold stability filtering of HP recommendations (2026-06-03, draft)

About this document. This is both (a) a self-contained experiment report and (b) a reusable template. Italicised Template note — blocks explain what each section is for and may be deleted in a concrete report. The worked example throughout is a single diagnostic (multi-fold stability filtering of a hyperparameter recommendation); replace its content section-by-section for subsequent experiments while keeping the structure.

Publication status (this draft). §6.1 / §6.2 / §6.3 / §10 are final — the OOS stability stage is complete (all three in-sample-significant axes gated; core = {learning_rate = 0.025}). Figures (§6.4) and Appendices are pending. Asset names are anonymised (A1–A5); internal readers: see the project review dossier for the real-name mapping.

Framing update + resolution (2026-06-04) — read before §9. The out-of-sample stability finding stands; the methods (§5) and results (§6) are unaffected. What was re-opened — and is now resolved — is the bridge from this fixed-hyperparameter diagnostic to a production recommendation. The deployment regime selects hyperparameters by search under a trading objective, not at the fixed default this diagnostic compared against. An observational check on objective-confirmed production search records shows that, over its reachable range, the production search does not favour the gentler learning-rate the AUC analysis preferred. The AUC stability finding therefore does not transfer: the hyperparameter-swap recommendation is not supported. (We do not claim a lower value would harm the trading objective — the production floor prevents testing it — only that the swap carries no demonstrated benefit; the burden was on the swap.) The transferable contribution — the asymmetric multi-fold stability method (§5.4) — is independent of this and intact; it correctly filtered candidates, and a separate stage showed even the survivor does not transfer to the deployment objective. See §9.


Abstract

Background. Default hyperparameters carried over between retraining cycles can overfit the training fold in walk-forward financial machine-learning pipelines, inflating in-sample model-selection metrics without improving out-of-sample (OOS) trading performance. A prior diagnostic established that, on the asset group studied, the production hyperparameter-optimisation (HPO) defaults overfit systematically on the validation fold.

Objective. To derive a gentler (lower-capacity / more-regularised) hyperparameter recommendation that generalises beyond the single fold on which it was derived, and to filter out candidate levers whose apparent benefit is an artefact of in-sample selection before committing expensive downstream validation.

Methods. We swept five capacity/regularisation axes of a gradient-boosted-tree model over a 5-point grid × 5 seeds for each of five assets, on a single derivation fold, and aggregated paired (per asset × seed) deltas versus the default with bootstrap confidence intervals. Axes significant in-sample were carried into a pre-registered multi-fold stability test on two OOS folds (the maximum disjoint from the derivation fold under a three-fold walk-forward); axes not significant in-sample were not tested OOS, to avoid post-hoc multiplicity. The decision rule (stable optimum across folds → retain; drifting optimum → reject as fragile) was fixed before OOS data were observed.

Results. Of the two in-sample-significant levers carried forward, the learning-rate reduction (0.1 → 0.025) was stable across all three folds (paired delta significant on each; optimum invariant; cross-asset consistency strengthening OOS). The tree-capacity reduction (num_leaves 31 → 8) failed: its in-sample optimum did not reproduce OOS (no significant effect on one OOS fold; unstable argmax; scattered per-asset optima), consistent with a winner's-curse selection artefact. A third in-sample-significant lever (min_child_samples), re-tested symmetrically, also failed (optimum drifts to the opposite end on one OOS fold, where the recommended setting is negative). The two null-in-sample axes were excluded a priori by the registered no-fishing rule.

Conclusion. The recommendation collapses to a single robust lever (learning_rate 0.1 → 0.025); all three in-sample-significant candidates passed through the same gate and only the learning-rate reduction survived. The result illustrates that in-sample selection of the "most consistent" lever can be actively misleading under walk-forward evaluation, and that a cheap multi-fold stability gate — applied symmetrically to all in-sample-significant candidates and to none of the null ones — is an effective and discipline-preserving pre-filter before decision-metric validation.


1. Introduction

In machine-learning pipelines that are periodically retrained and evaluated by walk-forward cross-validation — common in quantitative finance — the hyperparameters used in production are often fixed defaults rather than values re-optimised per period. If those defaults grant the model more capacity than the signal supports, the model can overfit the training window: validation metrics measured within a fold improve while genuine out-of-sample performance does not. Detecting and correcting this requires care, because the same data used to select a "better" hyperparameter also biases any estimate of its benefit (selection bias; the winner's curse).

This report documents one stage of a programme addressing that problem. A preceding diagnostic concluded that, for the studied asset group and fold, the production HPO defaults overfit systematically across boosting rounds. The mandated deliverable is a concrete, gentler hyperparameter recommendation that demonstrably generalises — or, as an acceptable outcome, a documented negative ("no single-axis swap generalises").

Question addressed here. Given a candidate recommendation derived on one fold, which of its constituent levers survive a multi-fold stability test, and what is the disciplined treatment of the levers that were excluded before that test?

Contribution. 1. A worked, pre-registered protocol for filtering hyperparameter recommendations by out-of-sample optimum stability before spending downstream (trading-metric) validation. 2. An empirical instance in which the most in-sample-consistent lever is the one that fails OOS, while a less-consistent lever survives — a cautionary, mechanistically plausible case of selection bias. 3. A statement of the asymmetric multi-fold principle: re-test OOS exactly those candidates that cleared the in-sample significance bar, and do not fish across folds on levers that were null in-sample.


Walk-forward evaluation. Standard k-fold cross-validation is inappropriate for serially dependent financial data; time-ordered ("walk-forward") schemes with purging and embargoing are the accepted alternative [Bergmeir2012; LopezDePrado2018]. Our folds are contiguous, time-ordered test windows with an optional embargo between train and test.

Backtest overfitting and selection bias. Searching over models or hyperparameters and reporting the best in-sample result yields optimistically biased performance estimates; the more configurations are tried, the worse the bias [Cawley2010; Bailey2014a; Bailey2014b]. The corrective is to estimate any selected configuration's value on data not used in its selection — the core of the present design.

Pre-registration and the multiplicity of analyses. Researcher degrees of freedom and data-dependent analysis choices ("the garden of forking paths") inflate false-positive rates even without intentional p-hacking [Simmons2011; Gelman2014]. Fixing hypotheses, the axis set, and decision rules before observing OOS data is the mitigation we adopt; it also governs which excluded axes may legitimately be re-tested (Section 5.4).

Inference. Effect sizes are reported with bootstrap confidence intervals [Efron1993]; we emphasise interval estimates over dichotomous significance [Wasserstein2016]. Where a null of no difference must be supported rather than merely not rejected, two-one-sided-tests (TOST) equivalence testing is used [Schuirmann1987; Lakens2017]. Family-wise error across axes is controlled by a Bonferroni correction; a false-discovery-rate alternative [Benjamini1995] is noted where exploratory axes are involved.

Model family. The estimator is a histogram-based gradient-boosted decision tree [Ke2017]; the swept axes (leaf count, learning rate, leaf-size floor, L2 penalty, minimum split gain) are its principal capacity/regularisation controls. Downstream trading quality is summarised by a downside-risk-adjusted return measure [Sortino1994] alongside the buy-class F1.


3. Hypotheses and Pre-Registration

Pre-registration. Hypotheses, the axis set, the gentler-direction per axis, the statistical procedure, and the decision rules below were registered in the deliverable plan (§9) at hp_swap_deliverable_plan §9 (champollion commit 3f860f52, 2026-06-03), prior to any OOS fold being computed (the first OOS run started 12:58 UTC, after the registration commit).

H1 (per axis). Moving an axis in its gentler direction increases mean OOS validation AUC relative to the production default.

Decision rule — stability gate (this report). - Retain an axis iff its gentler-direction optimum (i) reproduces across folds (stable argmax location) and (ii) yields a paired delta whose 95% CI excludes zero on the derivation fold and at least one OOS fold, with consistent per-asset direction. - Reject as fragile an axis whose optimum drifts fold-to-fold or whose OOS effect is not distinguishable from zero.

Pre-registered scope of re-testing (Section 5.4). Only axes significant in-sample are eligible for the multi-fold stability test. Axes not significant on the derivation fold are not re-tested OOS.

Downstream gate (subsequent report). A retained axis becomes an exploitable recommendation only if it transfers to the decision metric (trading F1 / downside-adjusted return) out-of-sample; AUC is a proxy, not the objective.


4. Data and Experimental Setup

Item Value
Asset universe 5 assets (anonymised here as A1–A5; see Glossary)
Timeframe 15-minute bars
Estimator Gradient-boosted decision trees [Ke2017]
Swept axes learning_rate, num_leaves, min_child_samples, L2 penalty, min split gain
Grid per axis 5 points spanning default ± gentler/harsher directions
Seeds 5 (fixed list); n_effective = 5 per cell
Boosting rounds 300 (argmax over the validation-AUC curve; see §6.3)
Folds (walk-forward) n_folds = 3; test window = 2 months; train window = 9 months
Derivation fold (in-sample) test window [2025-12-01, 2026-02-01)
OOS fold #1 test window [2026-02-01, 2026-04-01)
OOS fold #2 test window [2025-10-01, 2025-12-01)
Software LightGBM 4.5.0; NumPy 2.3.2; Python 3.12.8
Container image champollion 7cea82d

Fold construction. Folds are contiguous, time-ordered 2-month test windows stepping back from a monthly anchor, with the training window immediately preceding each test window (optional embargo, off by default). The derivation and OOS folds above are all the folds the three-fold scheme yields; two are disjoint from the derivation fold, which caps the OOS sample at two folds (see Limitations, Section 8).

Reproducibility caveat (time-anchoring). The fold generator anchors windows to the current month; calendar windows therefore shift if the experiment is re-run in a different month. We pin the calendar windows above so the artefact is auditable regardless of re-run date. Reproducing the indices alone in a later month yields different windows. A parameterised anchor is recommended as a follow-up (out of scope).


5. Methods

5.1 Per-axis aggregation

For each axis we collect validation AUC at each grid point for every (asset × seed) cell. The default point and the gentler-direction extreme are identified a priori. Per-point mean AUC is pooled across all 5 assets × 5 seeds (n = 25 observations per point).

5.2 Paired effect size and interval

The quantity of interest is the paired delta: for each (asset × seed) cell, AUC(optimum point) − AUC(default point), where the optimum is the pooled argmax. Because the two points share the same cell (asset, seed, fold, data), the paired delta has lower variance than a difference of two independent point means; we report its mean and a 95% bootstrap/normal interval over the n = 25 paired differences. (The paired interval is the correct statistic and is materially tighter than a naive point-wise interval.)

5.3 Stability decision (operationalised)

An axis is robust if its pooled optimum point is the same across folds, its paired 95% CI excludes zero on the derivation fold and ≥1 OOS fold, and per-asset optima concentrate (majority at the same point). It is fragile if the optimum location moves between folds or the OOS effect's CI includes zero.

5.4 Multiplicity policy and the asymmetric multi-fold principle

We carry into the multi-fold stability test only axes whose paired delta is significant on the derivation fold. Axes not significant in-sample are not re-tested OOS. The rationale is twofold: (i) there is no optimum to "stabilise" for a null axis, so an OOS test would assess the stability of noise; (ii) searching multiple folds for an effect that the most favourable (in-sample) fold reports as absent is precisely the multiple-comparisons / forking-paths trap that pre-registration exists to prevent [Simmons2011; Gelman2014]. This yields a symmetric, defensible treatment: every in-sample-significant candidate faces the same OOS gate; every in-sample-null candidate is excluded by the registered no-fishing rule.

Template note — This subsection is the methodological core worth generalising. State the principle explicitly; it is what makes the exclusion of some axes principled rather than convenient.


6. Results

6.1 Carried-forward axes — stability across folds

Axis Derivation fold OOS #1 OOS #2 Verdict
learning_rate opt 0.025 · Δ +0.0193 · 95% CI [+0.015, +0.023] · sig · 4/5 assets opt 0.025 · +0.0162 · [+0.010, +0.022] · sig · 4/5 opt 0.025 · +0.0139 · [+0.009, +0.018] · sig · 5/5 Robust — optimum invariant on all 3 folds; significant throughout; cross-asset consistency strengthens OOS
num_leaves opt 8 · +0.0128 · [+0.008, +0.018] · sig · 5/5 opt 124 · +0.0027 · [−0.004, +0.009] · n.s. · scattered opt 8 · +0.0065 · [+0.001, +0.013] · borderline · 4/5 Fragile — in-sample effect does not reproduce; OOS #1 shows no significant effect and an unstable argmax (2/3 folds only)

Reading. learning_rate 0.1 → 0.025 satisfies the registered stability rule on all available folds. num_leaves 31 → 8 does not: on OOS #1 the effect is statistically indistinguishable from zero and the argmax is unstable, with scattered per-asset optima. The in-sample 5/5 consistency of num_leaves was thus an optimistic selection artefact (winner's curse), not evidence of robustness — exactly the failure mode the gate is designed to catch. The core recommendation collapses to {learning_rate = 0.025}, pending Section 6.2.

6.2 Re-tested excluded axis — min_child_samples

min_child_samples was significant in-sample (Δ +0.0065, paired 95% CI [+0.002, +0.011]), excluded at the derivation stage on weak cross-asset consistency (3/5), not on non-significance; it is therefore eligible for the same OOS test by the principle of Section 5.4.

Axis Derivation fold OOS #1 OOS #2 Verdict
min_child_samples opt 80 · +0.0065 · [+0.002, +0.011] · sig · 3/5 opt 5 · +0.0012 · [−0.004, +0.006] · n.s. · scattered (and −0.0065 at the recommended point 80) opt 80 · +0.0043 · [+0.0001, +0.009] · sig (borderline) · 3/5 Fragile — optimum drifts to the opposite (least-regularised) end on OOS #1, where the recommended setting is negative; significant on only one OOS fold; per-asset optima scattered on all folds

Reading. min_child_samples = 80 fails the same stability gate as num_leaves: its optimum is not invariant (80 → 5 → 80 across folds), it is significant on only one OOS fold, and on OOS #1 the recommended setting is actively harmful (paired Δ −0.0065 at point 80). The weak in-sample cross-asset consistency (3/5) was, as for num_leaves, a fragility signature confirmed OOS — not noise to be re-tested but a candidate that genuinely failed the gate. Excluding it here is therefore symmetric with num_leaves, not a weaker fold-3 judgment.

Reproducibility note. An initial single-axis run was void — the diagnostic requires the pre-registered primary axis (num_leaves) present in the sweep for its decision scaffold ("missing primary axis probe"); the re-test therefore sweeps {num_leaves, min_child_samples} jointly and reads the min_child_samples trajectory (runs …18:19:19, …18:19:43).

Outcome. All three in-sample-significant axes (learning_rate, num_leaves, min_child_samples) have now passed through the same multi-fold stability gate; only learning_rate = 0.025 survives. The two null-in-sample axes (L2 penalty, min split gain) were excluded a priori by the registered no-fishing rule (Section 5.4). The core recommendation is therefore the single robust lever {learning_rate = 0.025}.

6.3 Round budget / under-fitting check

Best-iteration counts at the gentler optima are well below the 300-round cap on the derivation fold (learning_rate = 0.025: mean 42, max 95; num_leaves = 8: mean 15, max 40), so the gentle-point AUCs are not round-limited. With the recommendation now reduced toward a single learning-rate change (other axes at default), the compound under-fitting risk of a joint low-capacity / slow-learning configuration does not arise; the registered "best-iter < cap" assertion still guards any joint configuration tested downstream.

6.4 Figures (to attach)

Template note — A publishable version includes: (Fig. 1) per-axis AUC trajectories across the 5 grid points with 95% CIs, one panel per fold, optimum marked; (Fig. 2) a fold timeline showing derivation vs OOS windows on a calendar axis. Generate from the per-point trajectory arrays; do not hand-draw.

  • Figure 1. AUC-vs-grid-point trajectories per axis and fold, with paired-delta CIs. (pending — trajectory arrays available from run XComs)
  • Figure 2. Walk-forward fold timeline (calendar windows; derivation vs OOS). (pending)

7. Discussion

The headline finding is instructive precisely because it is counter-intuitive: the lever with the strongest in-sample cross-asset consistency (num_leaves, 5/5) is the one that fails out-of-sample, while a lever with weaker in-sample consistency (learning_rate, 4/5) survives. This is a clean, mechanistically plausible instance of selection bias in model selection [Cawley2010]: an optimum chosen on one fold's metric carries that fold's noise with it, and the apparent benefit regresses — here, all the way to non-significance — on unseen folds [Bailey2014a]. A plausible mechanism is that the capacity optimum is regime-dependent (period-specific signal richness) whereas slower learning is a more universal regulariser; a within-fold contrast supports this, since on the OOS fold where num_leaves collapses, learning_rate remains significant on the same data — evidence that the fold is not simply noisy but cleanly separates a robust lever from a fragile one.

The cheap AUC-based stability gate did its job: it rejected a fragile lever before any expensive decision-metric validation was spent on it. Equally, the asymmetric treatment of excluded axes (Section 5.4) keeps the exercise honest — every in-sample-significant axis is held to the same OOS standard, while null axes are not mined across folds. We regard the explicit statement of this principle as the transferable methodological output.

What the result does not license: AUC stability is not trading value. A surviving lever is a candidate, not a recommendation, until it transfers to the decision metric out-of-sample (subsequent report). With only two OOS folds available, even that transfer test is power-limited (Section 8), so a positive downstream result should be read as a lead warranting wider-history confirmation, not as validation.


8. Threats to Validity / Limitations

  • Limited OOS sample (statistical-conclusion validity). The three-fold walk-forward yields at most two folds disjoint from the derivation fold. A robust pass is "all available folds," but two OOS windows are regime-specific; positive downstream results are leads, not validation. Widening the scheme would re-define folds and invalidate the derivation, so it is out of scope here.
  • Time-anchored fold windows (reproducibility). Windows are anchored to the current month; re-running later shifts them. Mitigated by pinning calendar windows (Section 4); a parameterised anchor is a recommended follow-up.
  • Replay fidelity of the captured fold (construct validity). The derivation fold's in-pipeline replay diverged from the production baseline on an upstream check. This bounds any claim about "exact production-baseline fidelity" but does not invalidate the relative within-run gentler-vs-default contrast on the captured fold. The downstream validation uses an independent real backtest path and is not subject to this divergence.
  • Single asset group (external validity). Findings are specific to the studied universe and timeframe; generalisation to other universes is untested.
  • AUC as a proxy, and a deployment-regime mismatch (criterion + external validity). The stability test optimises validation AUC; the objective is a trading metric, and the AUC → trading-metric transfer is deferred and not assumed. Sharpened (2026-06-04): this gap is wider than a deferred transfer test, because the deployment regime differs from the diagnostic regime. This diagnostic fixes hyperparameters and compares against the default value; the production pipeline instead searches each hyperparameter over a range and selects by a trading objective, not AUC. Two consequences: (i) the deployed value comes from the search, so a recommendation framed as "change the default" does not map to how the value is set; (ii) the quantity this diagnostic optimises (AUC) is not the quantity the production search optimises, so an AUC-optimal setting may be irrelevant to the selected one. The recommendation framing is therefore re-opened (internal follow-up); the stability finding and the method are unaffected.
  • One-at-a-time sweep (interaction effects). Each axis is swept with the others held at default; joint optima may differ from combined marginal optima. The downstream test validates the combined configuration but a single combined run cannot disentangle interaction effects from a null swap effect.
  • In-sample optimum selection (selection bias). The optima are argmaxes on shared data and are optimistically biased; the multi-fold gate is the explicit control, and regression toward the mean on OOS folds is expected.

9. Conclusion and Next Steps

Under a pre-registered multi-fold stability gate, the learning-rate reduction (0.1 → 0.025) is the sole robust hyperparameter-swap candidate; both the tree-capacity reduction (num_leaves 31 → 8) and the leaf-size-floor increase (min_child_samples 20 → 80) are rejected as winner's-curse artefacts, each failing the same gate (drifting optimum, OOS effect not distinguishable from — or, for one, opposite to — the in-sample direction). The episode demonstrates that in-sample consistency can mislead and that a cheap stability pre-filter — applied symmetrically across in-sample-significant candidates, and to none of the null ones — prevents fragile levers from consuming downstream validation budget.

Resolution (the deployment-transfer question). The single AUC-robust candidate (learning_rate, gentler direction) was checked against the production deployment regime, which searches the hyperparameter under a trading objective rather than fixing it at the default. On objective-confirmed production search records, the search does not favour the gentler value over its reachable range (its sampling concentrates away from the gentle end, none below the range floor). The recommendation is therefore not supported: the validation-AUC finding does not transfer to the production trading objective. This is the pre-registered acceptable negative, reached observationally — from production records already generated, without spending a new controlled run. We claim only the absence of support, not a harm (the production floor prevents testing the gentler value directly; the burden of demonstrating a benefit was on the swap, and none is shown).

What this leaves. (i) The method (§5.4) is the transferable contribution and stands. (ii) A broader follow-up — does any hyperparameter's fixed-axis diagnostic finding transfer to the production search-under-trading-objective regime? — is the natural next question (internal follow-up). (iii) The diagnostic-vs-deployment regime gap (fixed-HP measurement vs HPO search on a trading objective) is the methodological lesson worth carrying: a fixed-HP, proxy-metric diagnostic can be internally sound yet not map the deployment it is meant to inform.


10. Reproducibility Statement

  • Runs. Derivation (5-axis): manual__2026-06-03T08:02:10. OOS stability (learning_rate, num_leaves): manual__2026-06-03T12:58:43 (OOS #1), manual__2026-06-03T13:11:13 (OOS #2). min_child_samples re-test (joint with num_leaves): manual__2026-06-03T18:19:19, manual__2026-06-03T18:19:43. (Void: manual__2026-06-03T14:17:04 / …14:17:23 — single-axis, missing primary probe.)
  • Code. Image / commit champollion 7cea82d; fold generator src/commun/finetune/ablation_runner.py::_generate_folds; analysis = read-only aggregation of stored run outputs (per-cell verdict XComs).
  • Data windows. As pinned in Section 4 (calendar, not indices).
  • Statistics. Paired bootstrap/normal CIs over n = 25 (asset × seed) differences; Bonferroni across axes; TOST for any equivalence claim.
  • Environment. LightGBM 4.5.0, NumPy 2.3.2, Python 3.12.8.

Author contributions. (TBD). Funding. (internal). Conflicts of interest. (none). Data/code availability. (internal — CVNTrade repository).


Glossary

  • Walk-forward fold — a time-ordered (train, test) split; test windows step through time rather than being randomly sampled.
  • Derivation fold — the fold on which the candidate recommendation was selected (in-sample for the selection).
  • OOS fold — a fold disjoint from the derivation fold (out-of-sample for the selection).
  • Gentler direction — the per-axis direction reducing model capacity or increasing regularisation (e.g. fewer leaves, lower learning rate, larger leaf-size floor).
  • Winner's curse — the optimistic bias incurred when the same data select a configuration and estimate its benefit.
  • Decision metric — the production-relevant objective (trading F1 / downside-adjusted return), as opposed to the model-selection proxy (validation AUC).
  • Paired delta — per-cell difference between the optimum and default points, exploiting the shared (asset, seed, fold) to reduce variance.
  • (Project-specific terms — internal: asset group = defi_top5 (A1–A5 = UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC); diagnostic = s42; capture/replay anchor caveat = "A6"; downstream A/B vehicle = the FTF framework; operator gate = ADR-79.)

References

[Bailey2014a] Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5), 94–107.

[Bailey2014b] Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the American Mathematical Society, 61(5), 458–471.

[Benjamini1995] Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B, 57(1), 289–300.

[Bergmeir2012] Bergmeir, C., & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation. Information Sciences, 191, 192–213.

[Bergstra2012] Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281–305.

[Cawley2010] Cawley, G. C., & Talbot, N. L. C. (2010). On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079–2107.

[Efron1993] Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.

[Gelman2014] Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science (the "garden of forking paths"). American Scientist, 102(6), 460–465.

[Ke2017] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems (NeurIPS), 30.

[Lakens2017] Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362.

[LopezDePrado2018] López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

[Schuirmann1987] Schuirmann, D. J. (1987). A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680.

[Simmons2011] Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366.

[Sortino1994] Sortino, F. A., & Price, L. N. (1994). Performance Measurement in a Downside Risk Framework. Journal of Investing, 3(3), 59–64.

[Wasserstein2016] Wasserstein, R. L., & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129–133.


Appendix A — Full numerical results

Template note — Per-axis, per-fold, per-point mean AUC with paired CIs and per-asset optima; one table per fold. Generated from the per-cell trajectory XComs.

(Pending — to be generated from the run trajectory arrays alongside Figure 1.)

Appendix B — Pre-registration snapshot

Template note — Verbatim copy (or immutable link + hash) of the registered hypotheses, axis set, and decision rules, with timestamp, so confirmatory status is auditable.

Registered design: hp_swap_deliverable_plan §9 (champollion 3f860f52, 2026-06-03, prior to OOS runs).