Is the signal tradable — and, secondarily, why does LightGBM stop at `best_iter=1`? — a multi-model diagnostic and experimental plan¶

Mission: CVN-N001-EE — F1_buy boost Artefact type: diagnostic study for committee experiment_review (ADR-68, ADR-82) Date: 2026-05-24 Author: dococeven Status: committee experiment_review (session 24745ff4) PASSED / OK — strong consensus, 0 blocker, 5 experts, 2026-05-24. Spun into autonomous Epic CVN-N001-EI (#1055, wp#223), which concludes the current sprint. The 5-block plan below is now tracked as Stories S01–S06; this document is the Epic's base reference. How to read this: the data establishes the headline economic result (§8.2) and a secondary symptom (best_iter=1), rules out several simple causes, and does not prove the leaf-wise mechanism (T6). The output is an experiment plan to isolate the root cause (§12); no production code change ships before Blocks 2 and 3 pass (ADR-2 / ADR-25).

Executive summary (conclusion first)¶

Headline (the result that matters): on the defi_top5 control group, the current BUY signal is not tradable as-is. BUY precision plateaus at ~0.20, below the gross breakeven of 0.25 (sl/(tp+sl) for ATR0.5_1.5_H4) — and the net breakeven, once transaction costs are folded in, is strictly higher than 0.25, so the gap is under-stated. Net expectancy is negative at every decision threshold θ on the one model whose full θ-curve was captured (CatBoost, §8.2). This is the conclusion; everything else is diagnosis of why.

best_iter=1 is a secondary symptom, not the problem. best_iteration is by definition the iteration of minimum validation loss; that it equals 1 for an aggressive leaf-wise learner (num_leaves=31, lr=0.1) on a weak signal (AUC≈0.64) with a confirmed train↔val regime shift (C-d) is expected, not an anomaly, and has no causal link to tradability. It does not deserve to be the headline.

Three statuses replace the ambiguous word "benign" (each verifiable independently — see §5):

(A) No implementation defect in boosting / class-balancing / early-stopping / data corruption — ESTABLISHED (eliminations T1/T2/T3/T5).
(B) Mechanism = LightGBM leaf-wise excess capacity per round — UNPROVEN HYPOTHESIS (only the Block-2 capacity ablation can rule it in or out; cross-model correlation cannot).
(C) No economic cost / signal is tradable — REFUTED in the current setup (§8.2).

Most probable primary driver (provisional posterior, §5.0): validation / regime instability (train↔val non-stationarity, C-d, confirmed S27/S28) — not any single library's growth policy. Leaf-wise capacity is at most an amplifier of the model-specific symptom.

Recommendation: do not tune θ or force more boosting rounds (both shown to be wrong first moves, §9). Instead: (1) instrument learning curves [Block 1]; (2) run the controlled LightGBM capacity ablation [Block 2]; (3) test the split — including forward-leakage and label-stability audits — before blaming the model [Block 3]; (4) make economic metrics primary [Block 4]; (5) keep discrimination / calibration / decision-policy decoupled [Block 5].

The committee questions are Q1–Q5 (§12). The detailed dossier (§§1–13) follows as the traceability record.

0. Reading guide¶

Over roughly one week several explanations for best_iter=1 were proposed in conversation and none was backed by the data already in Loki; two were the author's and are retracted here (T3 and the "AUC saturates but logloss would keep climbing" framing — §4.3). This dossier judges each theory against the numbers.

What this study can and cannot conclude (the bright line):

Supported by the data: (1) the signal is not tradable as-is — net expectancy negative at every θ, precision ~0.20 < breakeven (§8.2); (2) LightGBM has best_iteration=1 in ~89 % of runs; (3) AUC is weak-to-moderate (~0.63–0.67) and similar across the three models; (4) f1_buy is threshold-, calibration- and trigger-rate-dependent and is misleading as stated; (5) CatBoost over-trades and is poorly calibrated; (6) full per-iteration learning curves are not logged in production.

NOT sufficiently demonstrated (open hypotheses, not findings): (1) the leaf-wise per-round-aggressiveness mechanism (status B); (2) that the AUC≈0.64 plateau is data-intrinsic; (3) that LightGBM hyperparameters cannot improve anything.

The "benign" framing is abandoned as a single concept (it conflated three claims of opposite epistemic status). It is replaced throughout by the three explicit statuses of the executive summary: (A) no implementation defect — ESTABLISHED; (B) leaf-wise mechanism — UNPROVEN; (C) tradable / no economic cost — REFUTED. Every conclusion below refers to A, B or C, never to "benign" alone.

Every number below is pulled from Loki (cvntrade-observability, prod), from the canonical harness event=training_complete stream and the event=s22a1_verdict diagnostic stream, over a 22–25 day window ending 2026-05-24. Sample sizes: XGBoost n=1868, LightGBM n=2000, CatBoost n=2000 training_complete events; 5 S22A1 controlled-reproduction verdicts; CatBoost θ-curve envelopes from the S32 probe. Collection method is documented in Appendix A so it is reproducible.

Page map (executive summary above; detailed dossier ≈15 pp): 1. The phenomenon — best_iter=1 as an expected symptom (§1) 2. Data provenance & method (§2) 3. The master cross-model table (§3) 4. Theory ledger — 7 theories: what is eliminated vs what is merely plausible (§4) 5. Status A/B/C + the mechanism hypothesis (plausible, NOT proven) (§5) 6. Cross-model deep dive: best_iter ⊥ signal — a correlation, not a proof (§6) 7. The S22A1 control experiment — decisive against T3, single-fold for T6 (§7) 8. Three distinct problems: discrimination vs calibration vs decision policy; the economic conclusion (§8) 9. Why threshold-tuning and forcing more rounds are the wrong first moves (§9) 10. The candidate levers — to be tested, not assumed (§10) 11. Threats to validity — incl. no-UQ (§11.6) and under-weighted alternatives (§11.7) 12. Recommendation: a 5-block experimental diagnostic plan (§12) 13. Appendices A–D (raw data, configs, queries)

1. The phenomenon¶

In the unified training harness (ADR-89), LightGBM's best_iteration is reported as 1 on essentially every fold of the defi_top5 control group, 5-minute timeframe, PTE ATR0.5_1.5_H4. The operator first flagged this 2026-05-15 and it has recurred on every run since. Concretely, of 2000 LightGBM training_complete events:

best_iteration == 1 in 1790/2000 = 89.5% of runs; 2–3 in 8.5%; 4–9 in 2%; and never above 9.

This is a calm starting point, not an alarm. best_iteration is by definition the iteration of minimum validation loss. That it equals 1 means the regularised optimum is a single tree — which, for an aggressive leaf-wise learner (num_leaves=31, lr=0.1) on a weak signal (AUC≈0.64) with a confirmed train↔val regime shift (C-d), is the expected behaviour, not a sign that "the model is broken" or "the training is broken". The eliminations in §4 (T1/T2/T3/T5) make the defect readings unlikely; status (A) holds. The only legitimate open question about best_iter is mechanistic — excess per-round capacity vs genuinely exhausted signal (status B) — and that is what the Block-2 ablation (§12) exists to settle.

For context: on the same data, same folds, same harness, XGBoost trains to a median of 17 rounds (up to 151) and CatBoost to a median of 17 (up to 593). So "the signal is exhausted after one tree" is not literally true — two of three models keep boosting for dozens of rounds at the same AUC ceiling. Three ascent speeds, one shared plateau.

The document below treats best_iter=1 as a secondary symptom and keeps the headline where it belongs (tradability, §8.2). It eliminates several candidate causes and names the most plausible remaining mechanism, but does not claim to have proven it; it ends with the controlled experiments that would.

2. Data provenance & method¶

Source: Grafana Loki, namespace cvntrade-observability, cvn.env=prod, service cvntrade-backtest. Accessed read-only via kubectl port-forward svc/loki 3100. No training was launched for this study; all data is from runs the operator had already executed (the model_type ablation + the S32 widened-θ experiment + the S22 diagnostic chain).
Canonical stream: event=training_complete, emitted by training.harness.nodes.log_emit.emit_training_complete — the cross-model logging contract locked in CVN-N001-EE-S16 (PR #891). Fields per event: model_type, best_iteration, training_time_sec, theta_picked, f1_buy_val/test, auc_buy_val/test, brier_val/test, ece_val/test, rate_buy_val/test.
Control stream: event=s22a1_verdict, from commun.finetune.diagnostic.s22_a1_reproduction — a deliberately controlled LightGBM fit with early-stopping disabled and 300 rounds forced, measuring argmin(val binary_logloss).
Probe stream: event=theta_curve, from the CVN-N001-EE-S32 operating-point probe (training.harness.nodes.theta_curve).
De-duplication: identical message bodies collapsed; summary statistics computed over the raw event multiset (Appendix A scripts).
Window: 22 days (training_complete), 25 days (s22a1_verdict).

Limitations of the data are stated honestly in §11 (notably: the harness path does not record the per-iteration learning curve — see §11.1 — so the within-run trajectory for production runs is inferred, not directly observed; the S22A1 control is what makes the trajectory observable).

3. The master cross-model table¶

All three models, same control group / timeframe / PTE, summary over all runs in window. Median unless noted.

Metric	XGBoost (n=1868)	LightGBM (n=2000)	CatBoost (n=2000)
`best_iteration` (median)	17	1	17
`best_iteration` (range)	3 – 151	1 – 9	0 – 593
`best_iteration == 1`	0 %	89.5 %	1.9 %
`auc_buy_val` (median)	0.656	0.636	0.669
`auc_buy_test` (median)	0.649	0.628	0.654
`brier_val` (median)	0.146	0.125	0.225
`ece_val` (median)	0.032	0.060	0.289
calibration verdict	well-cal.	well-cal.	mis-cal. (overconfident)
`theta_picked` (median)	0.500 (fixed)	0.300	0.500
`rate_buy_val` (median)	0.004	0.000	0.411
`f1_buy_val` (median)	0.029	0.000	0.366
`f1_buy_val` (max)	0.210	0.433	0.441
early-stop metric	`logloss`	`auc`	`AUC`
early-stop rounds	150	50	150
tree growth	level-wise, depth≤6	leaf-wise, num_leaves=31	oblivious/symmetric

Indexing note (read before comparing the best_iteration rows). The three libraries do not report best_iteration on the same convention: LightGBM/XGBoost are effectively 1-indexed (the harness reports argmin(val_metric)+1 = number of trees retained), whereas CatBoost is 0-indexed (hence the best_iteration=0 runs in §11.5 = immediate stop = one tree retained). Since the entire study turns on a small integer, a ±1 convention skew matters most exactly here. To compare like with like, read every value in this table as "number of trees retained" (LGB median 1 = one tree; CB 0 = one tree). The qualitative gap (1 vs ~17) survives the normalisation; the fine-grained ==1 / 2–3 buckets in §6 are computed on this normalised basis.

Three facts jump out of this table, and they drive the entire analysis:

AUC is model-invariant: 0.636 / 0.656 / 0.669. A ±0.02 band across three architectures, three early-stopping metrics, and best_iter spanning 1→17→17. This is a current observed performance plateau, stable across the tested model family and validation setup — not a demonstrated data-intrinsic ceiling. Sequence/temporal models, longer horizons, cross-asset context, volatility conditioning, representation learning, meta-labeling and different target engineering are untested and could move it (§11.6).
best_iter does not track AUC: LightGBM reaches AUC 0.636 in one tree; CatBoost reaches 0.669 in ~17; XGBoost reaches 0.656 in ~17. More trees ≠ more signal.
f1_buy is dominated by calibration + operating point, not by AUC or best_iter: CatBoost's "best" f1 (0.366) is produced by a mis-calibrated model (ece 0.289) firing 41 % of all candles — the textbook over-trade trap, not skill.

4. Theory ledger¶

Each theory is stated, given a falsifiable prediction, and confronted with the data. Verdict = REJECTED / SUPPORTED / PARTIAL.

4.1 — T1: "The data is corrupted / there is a pipeline bug / leakage collapsing training"¶

Prediction if true: AUC near 0.5 (no signal) or wildly unstable; non-reproducible; other models also degenerate. Data: AUC is 0.64±0.02, stable across val AND test, across 5868 runs and 3 models, and reproduced deterministically (seed 1337) across 5 distinct cryptos in the S22A1 control (§7). XGBoost and CatBoost train normally (17 rounds) on the identical data. A corruption that left AUC=0.64 stable on three models and on a held-out test split is not a corruption. Verdict: REJECTED.

4.2 — T2: "`scale_pos_weight` is not applied / the LGB class-imbalance bug (Track 11 / Bug #6) is back"¶

Prediction if true: class_balance_applied absent or scale_pos_weight≈1; LGB predicts the majority class only. Data: event=class_balance_applied fires every run; the S22A1 logs show scale_pos_weight ∈ {1.72, 1.78, 2.29, 3.18, 4.82} = n_neg/n_pos per fold, correctly applied. The harness CB path also applies it (and CB does not collapse to best_iter=1). Class balancing is working. Verdict: REJECTED.

4.3 — T3 (author's, RETRACTED): "`best_iter=1` is an artefact of early-stopping on AUC: AUC is a rank metric that saturates at round 1, but the logloss would keep improving, so it's a metric-choice cosmetic issue"¶

Prediction if true: with early-stopping disabled and the metric switched to logloss, the val-logloss argmin would land deep (tens of rounds), proving the model was still learning and only the AUC early-stop cut it short. Data — the killer: the S22A1 control does exactly this experiment — early-stopping DISABLED, metric = binary_logloss, 300 rounds forced — and reports argmin(val_logloss) = 1 on all 5 cryptos (AAVE/OP/LDO/ARB/UNI), status=REPRODUCED. The val-logloss itself bottoms at tree 1 and rises for the next 299 rounds. Verdict: REJECTED — and this was the author's own theory. It is falsified: it is not a metric-choice artefact. Even on logloss with no early stopping, one tree is the minimum.

4.4 — T4: "It is purely the train↔val regime shift (diagnostic finding C-d)"¶

Prediction if true: if the only cause were a regime mismatch in the data, it would hit all models equally — XGBoost and CatBoost would also collapse to best_iter=1. Data: C-d (genuine regime heterogeneity) was confirmed in S27/S28 and it does explain why more trees raise val-loss (trees ≥2 fit train-regime structure that anti-transfers). BUT XGBoost (median 17) and CatBoost (median 17) do not collapse to 1 on the same regime-shifted data. So regime shift is necessary for the direction (val-loss rises with depth) but not sufficient to explain the LightGBM-specific "1". The model-architecture term is missing. Verdict: PARTIAL — regime shift is the reason boosting past the optimum hurts; it is not the reason that optimum is at iter 1 for LightGBM specifically. (It is also a plausible reason the AUC plateau sits at ~0.64 — see §10.)

4.5 — T5: "The model is ML-useless / there is no signal"¶

Prediction if true: AUC = 0.5. Data: AUC = 0.64±0.02, stable on test, model-invariant. There is a real, weak, generalising signal. It is just too weak to trade after costs (§8, §9). Verdict: REJECTED (as stated — "no signal"). The honest statement is "weak signal", not "no signal".

4.6 — T6 (MECHANISM HYPOTHESIS — status B, plausible, NOT proven): "`best_iter=1` is LightGBM's leaf-wise per-round aggressiveness reaching the weak-signal plateau in a single 31-leaf tree; subsequent trees overfit regime-specific structure and raise val-loss. Model-specific, orthogonal to signal strength."¶

Predictions tested (consistency checks, not proof — every one is correlational or single-fold): - (a) AUC at best_iter=1 ≈ AUC of models that take many rounds → consistent (0.636 vs 0.656/0.669). But this is cross-model correlation; it does not establish the leaf-wise mechanism. - (b) Within LightGBM, runs that reach best_iter>1 do not have materially higher AUC → consistent: best_iter=1 → AUC 0.631 (n=1790); best_iter>1 → AUC 0.655 (n=210), a 0.024 gap. But these >1 runs are self-selected (HPO/seed variation), not a controlled num_leaves/lr sweep — confounded. - (c) The most aggressive learner (leaf-wise, 31 leaves) has the smallest best_iter → consistent: LGB=1 < XGB(depth≤6)=17 ≈ CB(oblivious)=17. But "leaf-wise vs level-wise" is confounded with every other config difference between the three libraries (regularisation defaults, early-stop metric, binning). This is suggestive, not isolating. - (d) With early stopping removed, LightGBM's val-loss is still minimised at iter 1 → holds, but on fold-3 / 5 cryptos only (S22A1, §7). This is the strongest single piece of evidence and the one that kills T3 — but it is one fold. Verdict: PLAUSIBLE, NOT PROVEN (status B). The four checks are consistent with T6 and none contradicts it — but "consistent with" is not "demonstrated", and checks (a) and (c) must be struck from the evidence for the mechanism: a cross-library comparison cannot isolate leaf-wise growth from the dozen other config differences (early-stop metric, patience, regularisation defaults, binning, θ handling) that vary simultaneously between LGB/XGB/CB (M5). What remains is (b) — confounded and untested (M4) — and (d) — decisive against T3 but single-fold for T6 (M11). The load-bearing causal claim has therefore never been isolated. Only the controlled intra-LightGBM capacity ablation (§12.2 — vary num_leaves/lr/min_child_samples and watch best_iter move while AUC does not) can rule status B in or out. The cross-model table (§3, §6) is retained as a descriptive observation, not as support for the mechanism.

4.7 — T7 (S32 thesis): "The f1_buy plateau is a decision-threshold / operating-point problem, not a best_iter problem"¶

Prediction if true: f1_buy should be explainable by where the decision threshold θ sits relative to the calibrated-probability mass, independent of best_iter. Data: confirmed and refined. XGBoost (calibrated, evaluated at fixed θ=0.5) fires rate_buy≈0.004 → f1_buy≈0.03 — nothing crosses 0.5 because calibrated probabilities mass at the ~16 % base rate. CatBoost (mis-calibrated, θ=0.5) fires rate_buy≈0.41 → f1_buy≈0.37 — it crosses 0.5 only because it is overconfident. Same AUC. f1_buy is an operating-point + calibration artefact. But (the committee's own economic cut, S32 plan_review bc904b13): firing buys by lowering θ produces precision ≈ base rate → negative net expectancy (§8). So T7 explains the metric but does not rescue tradability. Verdict: SUPPORTED (explains f1_buy) — but it is an explanation of the symptom, not a profit lever.

5. The leading hypothesis (plausible, NOT proven), stated precisely¶

5.0 — Provisional ranking of explanations (best current guess, not a verdict)¶

The earlier draft over-weighted a single library's growth policy. Weighing all the signals — all three models plateau at the same AUC; more capacity hurts validation; train→val transfer is unstable; calibration diverges wildly across models; the one-step learner (LightGBM) suffers most — the evidence points more strongly to validation/regime instability than to leaf-wise aggressiveness. The provisional ordering of primary drivers is therefore:

Validation / regime instability — train↔val non-stationarity, confirmed C-d (S27/S28). Most probable primary driver.
Weak intrinsic signal under the current features/target.
Label noise / target instability — untested (§11.7).
High-capacity-learner amplification (LightGBM leaf-wise) — a modifier that explains the model-specific best_iter=1, not the root cause.

The mechanism in steps 1–7 below is how (1) and (4) combine to produce the LightGBM symptom; it is deliberately not a claim that leaf-wise aggressiveness is the leading cause. (This reorders the earlier draft, which inverted 1 and 4.) "Most plausible" throughout means "best current guess pending the §12 tests", not "demonstrated".

Putting T6 + T4(partial) + T7 together, the mechanism consistent with the data is the following. It is stated as a hypothesis to be tested by §12, not as an established causal chain — each step is flagged for the evidence it still needs.

There is a weak but real signal. The triple-barrier BUY label is separable from non-BUY at AUC ≈ 0.64, and this is a performance plateau stable across the tested model family and validation setup — the same (±0.02) for XGBoost, LightGBM and CatBoost (§3, fact 1). It is not yet shown to be data-intrinsic (§11.6): it is bounded by the current features, target, split and model family, any of which could move it. It is weak at least partly because of genuine train↔val regime heterogeneity (C-d, S27/S28) and the feature set's limited forward-predictive content.
LightGBM extracts essentially all of that separable signal in its first tree. LightGBM grows leaf-wise with num_leaves=31 and learning_rate=0.1 (Appendix B). A single 31-leaf leaf-wise tree is a strong learner — far stronger per round than XGBoost's level-wise depth-≤6 tree or CatBoost's oblivious tree. On a signal whose total separable content tops out at AUC 0.64, that first tree reaches the ceiling — and, because of regime heterogeneity, slightly overshoots into train-specific structure.
Therefore every subsequent tree makes validation worse, not better. trees ≥2 add capacity that fits the train regime, which does not transfer to the val regime (C-d). Validation logloss is minimised at tree 1 and rises monotonically thereafter. Early stopping (or, in the control, plain argmin) returns 1.
XGBoost and CatBoost reach the same ceiling more slowly, because each of their trees is a weaker learner. They climb toward AUC 0.64 over ~17 rounds before the regime-overfit term overtakes the marginal gain. Same destination, gentler ascent → larger best_iter.
If the above holds, best_iter measures per-round learning speed, not signal or model quality — suggested (not established) by the weak observed association between best_iter and AUC across models (§6) and within LightGBM (§4.6b — confounded, untested). Needs: the controlled capacity ablation of §12.2 to show best_iter moves with num_leaves/lr while AUC does not.
In the A/B/C frame: status (A) (no implementation defect) is established by the eliminations (T1/T2/T3/T5) — best_iter=1 is the correct early-stopping behaviour of an aggressive leaf-wise learner, not a bug. Status (B) (the leaf-wise-capacity mechanism) remains an open hypothesis: it is ruled in or out only once the capacity ablation (§12.2) and the split ablation (§12.3) show whether cheaper capacity or a better split recovers a materially different best_iter / AUC. The word "benign" is not used because it silently bundled (A), (B) and the false claim (C).
The data does not yet show it can be "fixed" into profit — but it also does not prove it cannot. What is shown: forcing more rounds is harmful (val-loss rises past iter 1 on fold-3, §7); switching the early-stop metric to logloss changes nothing (§7); and CatBoost's higher f1_buy is an over-trade/mis-calibration artefact, negative-expectancy on its own θ-curve (§8). What is not shown: that a gentler LightGBM (lower capacity + regularisation) or a better split leaves AUC unchanged — that is an assumption (§9) that §12.2/§12.3 must test before any "untradable" conclusion is drawn.

6. Cross-model deep dive — `best_iter` appears orthogonal to signal (correlational)¶

The single most important descriptive result of this study (a correlation across models, not a controlled manipulation):

AUC_val: XGB 0.656 (best_iter 17) ≈ LGB 0.636 (best_iter 1) ≈ CB 0.669 (best_iter 17).

best_iter ranges over a 17× spread while AUC moves ±0.02. If best_iter measured signal, this could not happen.

The within-LightGBM control removes the "different architecture" confound entirely:

LightGBM subset	n	mean `auc_buy_val`
`best_iter == 1`	1790	0.6312
`best_iter > 1`	210	0.6552

Even holding the model fixed, the runs that boosted past one tree differ by only +0.024 AUC. This is reported strictly descriptively — not as evidence of "no effect". It is a difference of two group means with no significance test, no bootstrap CI, and no per-fold/per-asset variance decomposition (§11.6); with n=1790 vs n=210 a 0.024 AUC gap may well be significant. Worse, the subsets are not comparable: the >1 runs are self-selected (they boosted further because of HPO/seed variation), and the observations are clustered by fold/crypto/seed, not i.i.d. — so a naïve two-sample test is invalid anyway. The honest statement is: point gap +0.024, dispersion unknown, subsets confounded → not interpretable as present or absent until the clustered bootstrap of §12.2 is run. The earlier "this is noise / boosting buys nothing" wording is withdrawn (M4).

best_iteration distribution (the shape of the phenomenon):

bucket	XGB	LGB	CB
`=1`	0 %	89 %	2 %
`2–3`	0 %	8 %	6 %
`4–9`	16 %	2 %	18 %
`10–30`	67 %	0 %	40 %
`31–100`	15 %	0 %	21 %
`>100`	1 %	0 %	11 %

LightGBM is crushed against 1; XGBoost centres on 10–30; CatBoost has a long tail to 593. Three different ascent speeds, one shared ceiling — as an observation. The causal reading ("speed is set by leaf-wise capacity") is the hypothesis §12.2 must isolate; the cross-model comparison alone cannot separate leaf-wise growth from the dozens of other config differences between the libraries.

7. The S22A1 control experiment — decisive against T3, but single-fold¶

commun.finetune.diagnostic.s22_a1_reproduction was built precisely to answer "is best_iter=1 real or an early-stopping artefact?". Its design (Appendix C):

LightGBM, early-stopping callback removed (# NO early_stopping callback — force the full curve),
metric = binary_logloss (not AUC),
300 boosting rounds forced,
best_iter := 1-indexed argmin(val_logloss) over the full 300-round curve,
run on the captured fold-3 parquet for 5 cryptos, seed 1337.

Result (event=s22a1_verdict, 5 cryptos):

crypto	`best_iter = argmin(val_logloss)`	`val_loss_at_best`	`proba_std_val`	status
LDO	1	0.398	0.210	REPRODUCED
ARB	1	0.410	0.205	REPRODUCED
UNI	1	0.417	0.202	REPRODUCED
AAVE	1	0.440	0.209	REPRODUCED
OP	1	0.453	0.199	REPRODUCED

matches_observed_symptom=true (best_iter ∈ {1,2,3}) on all five; statistically_non_defensible=false (≥200 BUY/fold).

The two uses of S22A1 have opposite evidential burdens — and only one of them is met here. Falsifying an artefact theory (T3: "it's just AUC early-stopping") needs only one clean counter-example, so a single controlled fold is amply sufficient to kill T3. Establishing a generality (T6: "the first tree is the optimum everywhere") needs breadth — many folds, seeds, regimes — which a single fold cannot provide. The document must not let the strength of the first use leak into the second.

What this isolates (and what it does not): with the AUC early-stop gone and logloss as the criterion and 300 rounds available, the validation loss is still minimised at tree 1. This: - kills T3 (the author's metric-choice theory) — decisive, and one fold is enough by the falsification asymmetry above: it is not an AUC-vs-logloss artefact, a clean elimination; - is consistent with T6 prediction (d): LightGBM's first tree is the validation optimum here — but at the default num_leaves=31. The experiment does not vary capacity, so it cannot distinguish "one tree is intrinsically optimal" from "31 leaves is too much capacity per round". That is precisely what §12.2 must add. - shows the model is not dead: proba_std_val ≈ 0.20 means it produces a genuine spread of probabilities — that spread simply does not generalise past one tree at this capacity, on this fold. - Scope limit: fold-3 only, 5 cryptos, one seed. A single-fold result; the production aggregate (§3) is consistent with it but the trajectory was only directly observed here. Generalising "always minimised at tree 1" beyond fold-3 is not yet warranted. The strongest mechanistic evidence in this study therefore comes from a narrow but controlled reproduction environment — one fold, one seed, five assets — not from the full production distribution. That is not weak, but it is nowhere near enough to support a global causal claim; it is presented as a controlled probe, not as population evidence.

8. Three distinct problems — discrimination vs calibration vs decision policy¶

Much of the week-long confusion (including the author's) comes from conflating three independent properties of a model. They must be measured and reasoned about separately; a model can be good at one and bad at another, and f1_buy silently mixes all three:

Level	Question	Metric	What the data shows
1. Discrimination	Can the model rank BUY above non-BUY?	AUC, PR-AUC, ranking	Weak-to-moderate, ~0.64, model-invariant (§3, §6)
2. Calibration	Do its probabilities mean what they say?	Brier, ECE, reliability by decile	XGB/LGB good (ece 0.03–0.06); CB bad (ece 0.29) (§3)
3. Decision policy	Given probabilities, when do we trade?	θ, trade frequency, expected value net of cost	θ vs proba-mass governs `rate_buy`; net expectancy is the only trading-truth (§8.2)

best_iter lives below all three (it is a fitting-process artefact). The error this study originally made was letting a Level-3 metric (f1_buy) and a fitting artefact (best_iter) masquerade as Level-1 signal claims. The rest of this section keeps them apart.

8.1 — Why `f1_buy` is misleading (it mixes Levels 1–3)¶

f1_buy is the metric the operator has been chasing, and it is the most misleading number in the table.

8.2 — Economic truth: net expectancy, not F1¶

F1 is a classification metric, not a trading metric. The trading-truth metrics — to be made primary going forward (§12.4) — are expected value net of costs, precision above breakeven, drawdown, turnover, hit-rate conditional on confidence bucket, calibration by probability decile, and the full PnL-by-threshold curve. The θ-curve probe (S32) is the first step toward that and makes the point concrete. For CatBoost (the only model whose θ-curve was captured in-window), the operating-point envelope across θ∈[0.05, 0.40] — format θ:f1:precision:rate_buy:net_expectancy — is, abridged:

θ=0.05  f1=0.289  precision=0.169  rate=1.000  net_exp=-0.163
θ=0.25  f1=0.298  precision=0.175  rate=0.960  net_exp=-0.150
θ=0.40  f1=0.319  precision=0.196  rate=0.725  net_exp=-0.107

breakeven_precision = 0.25 is the gross floor (PTE ATR0.5_1.5_H4: sl/(tp+sl)=0.5/2.0, zero costs assumed). The net breakeven — once transaction fees + slippage are folded in — is (sl+costs)/(tp+sl), i.e. strictly above 0.25. So the real precision-to-profit gap is under-stated, not over-stated: CatBoost's BUY-precision never exceeds ~0.20 (pinned at the ~0.17 base rate), already below the gross floor and further still below the net one. This is also why a classification metric like F1 — which ignores costs by construction — is the wrong instrument for a trading decision.

So for CatBoost: net expectancy is negative at every θ (positive_expectancy=no, tradeable=no, best_theta=none); the "f1_buy = 0.37" headline is firing 41–100 % of candles at base-rate precision — the over-trade trap, quantified.

Scope of the economic conclusion (M7 — important). The directly demonstrated negative-expectancy result is CatBoost-only — it is the single model whose full θ-curve was captured in-window (§11.2). For LightGBM and XGBoost the conclusion is inferred, not demonstrated: it rests on their rate_buy/theta_picked/calibration rows, not on a captured net-expectancy envelope. Given their good calibration (ece 0.03–0.06) the inference is strong — well-calibrated probabilities clustered at the ~16 % base rate cannot manufacture positive precision by lowering θ — but it remains an inference. A confirmatory LightGBM (and θ-swept XGBoost) θ-curve run is required to generalise "no model is tradable" beyond CatBoost (Block 4).

The calibration split explains the whole f1_buy ranking:

XGBoost / LightGBM are well-calibrated (ece 0.03 / 0.06): probabilities honestly cluster at the 16 % base rate, so at any sane θ very little fires → low f1_buy. Honest, low, correct.
CatBoost is mis-calibrated / overconfident (ece 0.289, brier 0.225 ≈ the constant-base-rate brier): probabilities smeared above 0.5, so lots fires at θ=0.5 → high f1_buy that is an artefact of bad calibration, not skill.

So the apparent ranking CB > LGB > XGB on f1_buy is, economically, an inverse ranking: it ranks the models by how much they over-trade. On the captured evidence (CatBoost) and the strong inference (XGB/LGB), none has positive expectancy at this AUC.

9. Why threshold-tuning and forcing more rounds are the wrong first moves¶

These are ordering claims (what to test first), backed where the data is strong and flagged where it is not:

"Raise best_iter / force more rounds." The data argues against it: S22A1 shows val-loss rises monotonically after tree 1 (fold-3); the 210 LightGBM runs that reached best_iter>1 differ by only +0.024 AUC (a confounded, untested point gap — §6/§11.6, not a demonstrated gain). Forcing rounds without first fixing capacity or split would manufacture a worse model (ADR-25). Not a first move.

"Switch LightGBM's early-stop metric to logloss." Eliminated by S22A1: logloss also bottoms at 1. No change. Not a lever.

"Lower θ so buys fire (close the f1_buy plateau)." Raises f1_buy cosmetically but, per §8, precision sits at base rate → negative net expectancy (CatBoost's own θ-curve confirms this directly). A prettier metric, a losing policy. This is the committee's S32 cut (precision ≥ 0.25 gross breakeven necessary; firing more is not sufficient). Not a profit lever.

"Soften LightGBM (num_leaves 31→7–15, lr 0.1→0.02, raise min_child_samples)." This is the change most likely to move best_iter into the tens and make it interpretable — and it is exactly the controlled experiment §12.2 calls for. The author's prior draft asserted it has "~zero effect on tradability"; that assertion is withdrawn — it is the hypothesis to test, not a known result. Plausibly it only changes the ascent speed (not the AUC ceiling), but that must be measured. All such changes are ADR-59 PG keys (CVN_HPO_LGB_5m_*), Console-only.

10. The candidate levers — to be tested, not assumed¶

If AUC≈0.64 turns out to be the binding constraint (a hypothesis the §12 ablations must confirm, since the data so far only shows AUC is stable across models, not that it is irreducible), then the levers that could move it are on the data/split side, not best_iter and not θ:

Split / regime reconstruction (CVN-N001-EI-S03, supersedes CVN-N001-EE-S31) — the confirmed C-d regime heterogeneity (S27/S28) means train and val are partly different distributions. A regime-aware / purged-walk-forward split (already scaffolded in src/training/cv/) is a candidate fix for both the early-stop behaviour and the depressed ceiling — and it is testable directly: if best_iter and AUC move materially under a better split, the problem was never LightGBM (§12.3). (Story currently On hold pending this study.)
Features / target (CVN-N001-EI-S06, supersedes CVN-N001-EE-S25; M8–M11) — if AUC≈0.64 survives both the capacity ablation and the split ablation, then the conclusion "the information content of the current feature set is the ceiling" becomes defensible, and the lever is new features / a different target.

The provisional reading is that best_iter=1 behaves like a thermometer ("weak, regime-heterogeneous signal, extracted in one aggressive tree") rather than a disease — but a thermometer reading is only trustworthy once §12.2 and §12.3 confirm it does not move when we change capacity yet does move when we change the split. That ordering is the whole point of the plan below.

11. Threats to validity¶

11.1 — The production harness discards the per-iteration learning curve. lightgbm_dag.py:174 uses lgb.log_evaluation(period=0) and no record_evaluation callback, so the val-metric-vs-round trajectory is not emitted for production runs. The within-run trajectory in §5 is therefore inferred from the S22A1 control (which does record it), not directly observed on production runs. This is itself a finding: the observability gap that hid this for a week. Recommendation: add lgb.record_evaluation to the harness and emit the curve (small, non-behavioural change; would make every future best_iter self-explaining). This is the single most actionable engineering item in this study.

11.2 — θ-curve coverage. Only CatBoost θ-curves were in the query window; the LightGBM θ-curve envelope was not captured in-window (the S32 widened-θ run emitted training_complete with low theta_picked≈0.30 but the per-θ theta_curve event for LGB was not retrieved). The LightGBM operating-point claim in §8 rests on its training_complete (rate_buy, theta_picked) rows, not a full envelope. A confirmatory LightGBM θ-curve run would close this.

11.3 — XGBoost evaluated at fixed θ=0.5. Per the harness contract, XGBoost does not run the val-tuned θ-sweep (xgboost_dag.py:182,190 — fixed threshold=0.5, legacy calibrator path, committee verdict 8.3). So XGBoost's f1_buy≈0.03 / rate_buy≈0.004 partly reflect that fixed operating point, not only the proba mass. This strengthens T7 (operating-point dependence) but means XGB f1_buy is not directly comparable to LGB/CB f1_buy. AUC (threshold-free) remains the fair cross-model comparison — and it is the invariant one.

11.4 — Aggregation across folds/cryptos. The training_complete aggregates pool all folds and all defi_top5 cryptos in window. The S22A1 control is fold-3 only. Medians are reported to resist the long CB tail; per-fold breakdowns are available (Appendix D) and do not change the conclusions.

11.5 — best_iteration=0 for CatBoost (min). A handful of CB runs report best_iteration 0 (CatBoost's 0-indexed convention / immediate stop); these are <1 % and do not affect medians.

11.6 — No uncertainty quantification (the main remaining scientific weakness). Every numeric comparison in this study is descriptive, not inferential: there are no confidence intervals, no bootstrap CIs, no per-fold or per-asset variance decomposition, no effect sizes, and no significance tests. The headline gaps — the ±0.02 cross-model AUC band and the +0.024 within-LightGBM gap (§6) — are point differences of medians/means with unknown dispersion; they may be within-fold noise, crypto-specific, or window-specific, and the study cannot currently tell which. Likewise "AUC ≈ 0.64 is stable" means stable across the tested model family and window, not demonstrated irreducible. Closing this requires stddev/IQR, bootstrap CIs, per-fold and per-asset variance, and effect sizes on every reported metric — folded into Blocks 1–2 of the plan (§12). Until then the numeric arguments are descriptive, not inferential, and are presented as such.

11.7 — Under-weighted alternative causal hypotheses. Beyond T1–T7 the theory space is still narrower than it should be. The following are not yet tested and each could (partly) explain the symptom; the plan (§12) must cover them explicitly rather than orienting only toward T6:

Alternative hypothesis	Missing test
Label instability across regimes	label-consistency audit over market periods
Forward / asymmetric feature leakage	forward-only leakage audit
Validation window too short	rolling-horizon sensitivity
Probability compression	calibration curve by fold
Target entropy too high (high Bayes error)	Bayes-error estimation
Insufficient sample size	learning curves vs dataset size
Temporal autocorrelation leakage	embargo sensitivity
Threshold overfitting	nested-CV threshold selection

This table is the substance behind committee question Q5 and should be promoted into Blocks 2–4 of §12.

12. Recommendation — a 5-block experimental diagnostic plan (not a closure)¶

The correct output of this study is not "close best_iter=1 as benign" (the word is retired — see §0). It is: move from narrative diagnosis to controlled ablation. Status (A) is established; status (B) (the leaf-wise mechanism) and status (C) (the full cross-model tradability claim) are not, and must not be asserted. The plan below is ordered so each block can falsify — not merely decorate — the mechanism hypothesis, and it leads with the most probable primary driver (§5.0: validation/regime instability), not with LightGBM.

Block 1 — Instrument the learning curves (immediate quick win)¶

The real prerequisite for any causal claim. For every run, log per-iteration: train AUC, val AUC, train logloss, val logloss, best_iteration, Δ(iter1 → best), the calibration/ECE curve, and the predicted-probability distribution. Without this, "the first tree extracts all the signal" stays fragile. Engineering: add lgb.record_evaluation (and the XGB/CB equivalents) to the harness DAGs — lightgbm_dag.py:174 currently discards it via log_evaluation(period=0). Small, ADR-92-clean, non-behavioural.

Block 2 — A clean LightGBM ablation (isolate "excess capacity" vs "exhausted signal")¶

Vary, one axis at a time, on fixed folds: num_leaves ∈ {3,7,15,31}; learning_rate ∈ {0.01,0.03,0.05,0.1}; min_child_samples ∈ {20,100,300,1000}; lambda_l1/lambda_l2; min_gain_to_split; feature_fraction; bagging_fraction; max_bin; boosting_type gbdt vs dart. Report every metric with a clustered (by fold/crypto) bootstrap CI — this is also where the §6 +0.024 question gets a real answer. Goal: determine whether best_iter=1 comes from excess per-round capacity, an unstable split, or a genuinely exhausted signal.

Pre-registered falsifiable predictions for status B (write both outcomes now, so the experiment can lose): - Confirms B — a gentler LightGBM (lower num_leaves + stronger regularisation + lower lr) moves best_iter into the tens while AUC stays within CI (±~0.02). Then best_iter was a per-round-speed/capacity artefact, consistent with the "thermometer" reading. - Refutes B — a gentler LightGBM raises AUC beyond CI. Then best_iter=1 was masking under-exploration, the behaviour was not performance-cost-free, and the "extracted all the signal in one tree" story is wrong. This outcome must be reported as a refutation, not explained away.

This is the experiment that rules status B in or out — and the one this study cannot substitute with cross-model correlation.

Ablation grid (copy-paste spec — all keys land in PG ftf_config via CVN_HPO_LGB_5m_*, Console-only per ADR-59/90; seeded through scripts/seed_hyperparams_console.py):

block: B2_lgb_capacity_ablation
sweep_axes:                 # one axis at a time, others held at production default
  num_leaves:        [3, 7, 15, 31]
  learning_rate:     [0.01, 0.03, 0.05, 0.1]
  min_child_samples: [20, 100, 300, 1000]
  reg_lambda:        [0.0, 1.0, 10.0]       # lambda_l2
  min_gain_to_split: [0.0, 0.01, 0.1]
held_fixed:
  early_stopping_rounds: 50
  boosting_type: gbdt
  folds: [fold_3]            # + sibling folds once Block 1 logs the curve
  cryptos: defi_top5         # control group (operator policy)
  seed: 1337
eval:
  metrics: [best_iteration, auc_buy_val, auc_buy_test, brier_val, ece_val]
  ci: bootstrap
  bootstrap_samples: 1000
  cluster_keys: [fold, crypto]   # clustered resampling — observations are NOT i.i.d.
decision_rule:                   # see Decision routing runbook below
  confirms_B: "best_iter → tens AND Δauc within ±0.02 CI"
  refutes_B:  "Δauc beyond +0.02 CI for a gentler config"

Block 3 — Test the split before the models (and the two invalidating hypotheses first)¶

The study confirms regime shift (C-d) and §5.0 ranks it the most probable primary driver — so this block, not Block 2, carries the highest prior. Run: purged walk-forward CV; split by market period; split by volatility; split by crypto; train-regime-A / test-regime-B; strict temporal validation. If best_iter changes strongly with the split, the subject is not LightGBM — it is validation design. (CVN-N001-EI-S03 is the home for this.)

Run these two first, because either one — if positive — invalidates the rest of the analysis (promotes the top rows of §11.7 into the critical path, per Q5): - Forward-only leakage audit — verify no feature uses information from ≥ t at decision time t; check label/feature alignment and the embargo. Pass = no future-information path found; Fail = any leak → AUC≈0.64 is partly spurious and every downstream conclusion is suspect. - Label-stability audit across regimes — recompute triple-barrier label consistency per market period / volatility bucket; measure label-flip rate and per-regime base rate. Pass = labels stable across regimes; Fail = label definition is regime-dependent → "weak signal" may be "unstable target", a different problem with a different fix.

Only after both pass does it make sense to attribute the plateau to the model or the features.

Block 4 — Replace F1 with economic metrics (and close the cross-model θ-curve gap)¶

f1_buy is a classification metric, not a trading metric. Make primary: expected value net of costs; precision above breakeven; drawdown; turnover; hit-rate conditional on confidence bucket; calibration by probability decile; PnL-by-threshold curve.

Canonical net_expectancy to log per θ (the number the backtest must emit to Loki, in ATR units of the PTE):

net_expectancy(θ) = P(θ)·net_gain − (1 − P(θ))·net_loss
  net_gain = tp − slippage − fees        # ATR0.5_1.5_H4 → tp = 1.5
  net_loss = sl + slippage + fees        #                sl = 0.5
  P(θ)     = observed BUY-precision at threshold θ
break-even precision (net) = (sl + costs) / (tp + sl) = (0.5 + costs) / 2.0   # > 0.25 once costs > 0
tradeable(θ) ⇔ net_expectancy(θ) > 0 ⇔ P(θ) > break-even precision (net)

slippage and fees are sourced from the PTE cost model, not assumed zero (the gross 0.25 floor in §8.2 is the costs→0 limit). First task: capture the full θ-curve on a continuous grid θ∈[0.05, 0.95] step 0.05 for LightGBM and a θ-swept XGBoost, not just CatBoost — the "no model is tradable" claim is currently demonstrated for CatBoost only and inferred for the other two (§8.2 M7 / §11.2). (Extends the S32 θ-curve / net-expectancy work.)

Block 5 — Keep the three levels decoupled (§8)¶

Report and reason about discrimination (AUC/PR-AUC), calibration (Brier/ECE/decile), and decision policy (θ/turnover/EV) separately, never folded into a single number. Today the analysis still risks mixing them; the framework in §8 is the standing guard.

Recommended conclusion (replaces any "verdict")¶

The headline result is economic, not algorithmic: on the captured evidence the current signal is not tradable (CatBoost net expectancy negative at every θ, precision ~0.20 below the gross-0.25 / higher-net breakeven; strongly inferred for the well-calibrated XGB/LGB pending their θ-curves). best_iter=1 is a secondary, expected symptom. In the A/B/C frame: (A) no implementation defect — established; (B) the leaf-wise-capacity mechanism — unproven, decided only by the Block-2 ablation; (C) tradability — refuted in the current setup. The most probable primary driver is the validation/regime instability that exposes weak out-of-sample transfer (C-d), with LightGBM's high per-round capacity at most an amplifier. The next step is not to tune thresholds or force boosting rounds, but to instrument learning curves, run the forward-leakage + label-stability audits and the split/capacity ablations (split first), and evaluate with economic metrics rather than F1.

Decision routing (runbook — what to do once the data is in)¶

The conclusion above is the current state; this table is the pre-committed action for each experimental outcome, so no re-litigation is needed when results land:

If the result is…	…then the conclusion is…	…and the immediate action is:
Block 3 leakage audit FAILS (a future-information path is found)	AUC≈0.64 is partly spurious — every downstream conclusion is void	HALT CVN-N001-EE. Fix `src/features/`; re-baseline the dataset before any further modeling. (highest-priority gate — run first)
Block 3 label-stability FAILS (label flips across regimes)	"weak signal" is really "unstable target"	Redefine the triple-barrier target (CVN-N001-EI-S06 scope), not the model.
Block 3 split materially moves best_iter + AUC	The subject was never LightGBM — it is validation design (C-d / C-c)	Ship the regime-aware / purged-walk-forward split (CVN-N001-EI-S03); re-run the model comparison on it.
Block 2: gentler LGB raises AUC beyond CI (>0.70)	Status B REFUTED — defaults were over-fitting per round	Replace the production HPO defaults with the validated keys (`CVN_HPO_LGB_5m_*`, Console).
Block 2: best_iter→tens, AUC flat within CI	Status B CONFIRMED — best_iter was a speed artefact	Close the algorithmic line. Move resources to features/target (CVN-N001-EI-S06).
Block 4 θ-curves: LGB/XGB also negative-EV at every θ	Status C generalised — no model tradable on current data	No promotion; the lever is data/features/split, never θ or rounds.

No production code change ships before Blocks 2 and 3 pass (ADR-2 / ADR-25).

The committee is asked to rule on¶

Q1 — Are the eliminations T1 (corruption), T2 (scale_pos_weight), T3 (AUC-early-stop artefact), T5 (no-signal) sound, and is the PARTIAL on T4 (regime) appropriate?
Q2 — Is the A/B/C partition correct — (A) defect-free established, (B) leaf-wise mechanism unproven, (C) tradability refuted — and is dropping the ambiguous word "benign" the right call? Is any load-bearing claim still over-asserted (esp. after striking cross-model checks (a)/(c) from T6's evidence)?
Q3 — Is the ordering right: split ablation (Block 3) and capacity ablation (Block 2) before any threshold-tuning or round-forcing?
Q4 — Approve Block 1 (learning-curve instrumentation) as an immediate, low-risk harness change?
Q5 — Blind spots: of the under-weighted alternative hypotheses now tabled in §11.7 (label instability, forward leakage, short validation window, probability compression, high target entropy, insufficient sample size, temporal-autocorrelation leakage, threshold overfitting), which must be promoted into Blocks 2–4 before T6 can be ruled in or out? And is the §5.0 reordering — regime/validation instability as the most probable primary driver, leaf-wise capacity as a modifier — correct?

Per memory feedback_diagnostic_committee_score_cap, this is a diagnostic artefact — judge it on the soundness of the eliminations, the honesty of the not-yet-proven partition, and the experimental plan, not on a tradability score.

Appendix A — Data collection (reproducible)¶

Loki port-forward + query helper (/tmp/best_iter_study/loki_q.py): query_range over {job=~".+"} |= "event=<name>", 22–25 day window, direction=backward, limit 2000, dropping Loki self-echo lines (caller=, querier). Per-model summary via stats.py; histograms via hist.py; within-LGB AUC split via corr.py. Event counts retrieved: XGB 1868, LGB 2000, CB 2000 training_complete; 5 s22a1_verdict; 4 theta_curve (CB).

Appendix B — Training configs (source of truth: PG `ftf_config`, ADR-59/90; seed `scripts/seed_hyperparams_console.py`)¶

LightGBM (5m): objective=binary, metric=["auc","binary_logloss"], first_metric_only=True (early-stop on AUC), boosting_type=gbdt, num_leaves=31, max_depth=6, learning_rate=0.1, min_child_samples=20, n_estimators=200, early_stopping_rounds=50. Growth: leaf-wise. Source: lightgbm_dag.py:74–96,164–175.
XGBoost (5m): eval_metric="logloss" (early-stop on logloss), max_depth (PG, ≤6 band), early_stopping_rounds=150, evaluated at fixed θ=0.5 (no val θ-sweep). Growth: level-wise. Source: xgboost_dag.py:66–70,145,182,190.
CatBoost (5m): loss_function=Logloss, eval_metric="AUC" (early-stop on AUC), depth (PG), early_stopping_rounds=150, canonical val-tuned θ-sweep. Growth: oblivious/symmetric. Source: catboost_dag.py:79–82,125–132.

Appendix C — S22A1 control design¶

src/commun/finetune/diagnostic/s22_a1_reproduction.py: _train_full_no_earlystop builds lgb.Dataset(train), lgb.Dataset(val, reference=train), callbacks [log_evaluation(period=0), record_evaluation(eval_records)] — no early_stopping — valid_sets=[train,val], num_boost_round=300. best_iter = argmin(eval_records["val"]["binary_logloss"]) + 1. REPRODUCED iff best_iter ≤ 5 (SHALLOW_ARGMIN_THRESHOLD); matches_observed_symptom iff best_iter ∈ {1,2,3}. Diagnostic-tier floor: ≥200 BUY/fold or verdict marked statistically_non_defensible.

Appendix D — Raw event samples (representative)¶

LightGBM training_complete (note best_iteration=1, calibrated brier≈0.14/ece≈0.05):

model_type=lightgbm best_iteration=1 theta_picked=0.2056 f1_buy_val=0.3591 auc_buy_val=0.6434 brier_val=0.1390 ece_val=0.0563 rate_buy_val=0.3636
model_type=lightgbm best_iteration=1 theta_picked=0.2250 f1_buy_val=0.3504 auc_buy_val=0.6425 brier_val=0.1401 ece_val=0.0508 rate_buy_val=0.4477

XGBoost (best_iter 9–40, θ=0.5, rate≈0 → f1≈0, well-calibrated):

model_type=xgboost best_iteration=31 theta_picked=0.5000 f1_buy_val=0.0591 auc_buy_val=0.6473 brier_val=0.1350 ece_val=0.0144 rate_buy_val=0.0091
model_type=xgboost best_iteration=11 theta_picked=0.5000 f1_buy_val=0.0000 auc_buy_val=0.6495 brier_val=0.1349 ece_val=0.0137 rate_buy_val=0.0000

CatBoost (best_iter 9–30, θ=0.5, rate≈0.45 → f1≈0.36, MIS-calibrated brier≈0.23/ece≈0.31):

model_type=catboost best_iteration=14 theta_picked=0.5000 f1_buy_val=0.3509 auc_buy_val=0.6614 brier_val=0.2306 ece_val=0.3103 rate_buy_val=0.4740
model_type=catboost best_iteration=23 theta_picked=0.5000 f1_buy_val=0.3611 auc_buy_val=0.6650 brier_val=0.2266 ece_val=0.3038 rate_buy_val=0.4453

Is the signal tradable — and, secondarily, why does LightGBM stop at best_iter=1? — a multi-model diagnostic and experimental plan¶