S18 Step 2 — Legacy vs Harness LGB code comparison¶

Status — 2026-05-14 : Step 2 of the S18 diagnostic plan §5.2. Inputs : Step 0 PASS verdict (2026-05-13 23:42 UTC, abs_delta=8e-09 vs canary), Step 1 capture artifacts (step1-trials-AAVEUSDC-fold3.json, captured-fold-AAVEUSDC-3.parquet). Output : this dossier identifies the most likely regression locus so Step 3 (parity reproducer) can verify it deterministically.

TL;DR — smoking gun candidate¶

Hypothesis H2 (eval_set incorrect) ranks first. The pre-#891 LGB trainer passed valid_sets=[train_data, val_data] to lgb.train ; the current harness passes valid_sets=[val_set] only. Combined with first_metric_only=True + lgb.early_stopping(stopping_rounds=50) :

Pre-#891 : train AUC always improves (overfitting) → early stop never fires → booster trains until num_boost_round → best_iter ≈ 100-300
Post-#891 harness : only val AUC is watched → val AUC saturates at iter 1-2 (scale_pos_weight=4.71 + small val sample 839 rows) → early stop fires after 50 idle rounds → best_iter ≈ 1 ← matches canary 53/53 trials

The other LGB call surface (params, Dataset construction, callbacks, scale_pos_weight, metric, first_metric_only) is byte-identical between the two paths. The single load-bearing difference is the valid_sets composition.

1. Methodology¶

Per dossier §5.2 Step 2, we extract the pre-#891 production LGB training code from git and compare it side-by-side with the current harness lgb_booster_and_time (src/training/harness/dags/models/lightgbm_dag.py:130-176).

Git refs : - Pre-#891 (legacy autonomous trainers) : dc3d86c6^ — the commit just before PR #891 merge (Hamilton harness migration). Files : src/training/LightGBM/cvntrade_LightGBM_trainer.py. - Post-#891 / pre-#899 : e75418ca^ — same harness as today, BEFORE the autonomous-trainer collapse PR #899. The train_with_fixed_params_lgbm shim in grid_utils.py is preserved as the "canonical legacy reference" (Bug #6 of FTF 7-bug hotfix spec). - Current main : e884fcd3 (post-Step 1).

Extraction commands (reproducible) :

git show dc3d86c6^:src/training/LightGBM/cvntrade_LightGBM_trainer.py > /tmp/s18_pre891/lgbm_trainer.py
git show e75418ca^:src/training/LightGBM/cvntrade_LightGBM_grid_utils.py > /tmp/s18_legacy/lgbm_grid_utils.py

2. Side-by-side : the load-bearing diff¶

The 7 axes from dossier §5.3 (the parity invariants Step 3 will assert against). For each, we mark MATCH (no behavioral difference between pre-#891 and current harness) or DIFF (behavioral divergence — must be tested in Step 3).

#	Axis	Pre-#891 (`dc3d86c6^` trainer.py:78-89)	Current harness (`lightgbm_dag.py:158-174`)	Verdict
1	`objective`	`binary` (or `multiclass`)	`binary` (or `multiclass`)	✅ MATCH
2	`metric`	`["auc", "binary_logloss"]` (binary)	`["auc", "binary_logloss"]` (binary)	✅ MATCH
3	`first_metric_only`	`True` (binary)	`True` (binary)	✅ MATCH
4	`scale_pos_weight`	`n_neg / n_pos` when binary	`n_neg / n_pos` via `ClassBalance` node	✅ MATCH
5	`lgb.Dataset(train)`	`lgb.Dataset(X_train, label=y_train)` (no weight when binary — `sample_weights=None`)	`lgb.Dataset(x_train, label=y_train)` (no weight when binary — `lgb_class_balance.sample_weights is None`)	✅ MATCH
6	`lgb.Dataset(val)`	`lgb.Dataset(X_val, label=y_val, reference=train_data)`	`lgb.Dataset(x_val, label=y_val, reference=train_set)`	✅ MATCH
7	`lgb.train(valid_sets)`	`[train_data, val_data]` (BOTH)	`[val_set]` (val ONLY)	🚨 DIFF
7b	`lgb.train(valid_names)`	`["train", "val"]`	`["val"]`	🚨 DIFF
8	`early_stopping(stopping_rounds)`	`50` (hardcoded)	`50` default (resolved from PG `CVN_HPO_LGB_5M_EARLY_STOPPING_ROUNDS`)	✅ MATCH (same value)
9	`log_evaluation(period)`	`0` (no verbose)	`0` (no verbose)	✅ MATCH
10	`num_boost_round` (fallback)	`200` (legacy hardcoded default)	from PG `CVN_HPO_LGB_5M_N_ESTIMATORS` (range 100-1000 per S17 seeding)	⚠️ DIFF in scale, not in mechanism

Single load-bearing axis : #7 (valid_sets composition). This is the only behavioral difference that explains the best_iter=1 smoking gun observed in 53/53 LGB trials of the canary.

2.1 Other code-level differences (lower priority — present but DO NOT explain `best_iter=1`)¶

#	Axis	Pre-#891	Current harness	Verdict
11	HPO search range `n_estimators`	`(150, 400)` (`hyperoptimizer.py:106`)	`(100, 1000)` (PG `CVN_HPO_LGB_5M_N_ESTIMATORS_RANGE_*`)	⚠️ wider, but irrelevant when training stops at iter 1
12	HPO search range `learning_rate`	`(0.08, 0.15)`	observed `(0.05, 0.15)` (close but slightly wider lower bound)	⚠️ minor — Step 1 capture shows trials at 0.06-0.15 still produce best_iter=1
13	HPO search range `num_leaves`	`(25, 80)`	wider (PG-driven)	⚠️ irrelevant when stopping at iter 1
14	HPO search range `min_child_weight`	`(1e-3, 1e-1)`	not in current harness HPO surface (legacy-only param)	⚠️ minor surface drift, no impact
15	Calibration code path	`if config.calibration != "none"`: `lgb.LGBMClassifier(...)` + `CalibratedClassifierCV(method="isotonic", cv=3)` (lines 58-74)	post-training θ-sweep at the harness layer (no sklearn wrapper)	DIFF — but sklearn's `LGBMClassifier.fit(eval_set=[(X_train, y_train), (X_val, y_val)], eval_names=["train", "val"])` ALSO included train in eval_set, so the early-stop semantics matched the non-calibration path. Both pre-#891 paths protected against early-stop-on-val-only.
16	Trainer wrapper	`cvntrade_LightGBM_autonomous_trainer.py` (313 lines, custom orchestration)	Hamilton DAG with `lgb_booster_and_time` node	DIFF in surface, MATCH in mathematical effect (both produce a `Booster`)
17	Min split gain	`min_split_gain` in HPO range `(...)`	not in harness HPO surface	minor surface drift
18	`n_estimators` default fallback	`100` (config dataclass default) → `200` (overridden in `_screen_grid_point` for grid search)	from PG (no default fallback in harness — fail-fast per ADR-90)	DIFF in mechanism, identical in PRACTICE (Optuna picks a value within the seeded range)

None of these axes explain best_iter=1 consistently across 53/53 trials with widely different params. The HPO ranges (num_leaves, learning_rate, etc.) all influence WHICH params Optuna picks — but if the training mechanism (early stopping) shuts down at iter 1 regardless of those params, the search-space differences are inert.

The single load-bearing diff remains #7 (valid_sets composition).

3. Why `valid_sets=[train, val]` vs `valid_sets=[val]` matters¶

LightGBM's early_stopping callback semantics with first_metric_only=True + multiple valid_sets :

Per-tuple tracking : LGB tracks (valid_set, metric) improvement each iteration. With first_metric_only=True, only the FIRST metric (auc) is tracked → tuples are [(train, auc), (val, auc)] (legacy) vs [(val, auc)] (current).
Stop condition : early_stop fires when NONE of the tracked tuples improves for stopping_rounds consecutive iterations.

Pre-#891 dynamics (legacy)¶

(train, auc) improves at every iteration (boosting overfits the train set monotonically)
(val, auc) may saturate or fluctuate
The OR-condition (train improving) OR (val improving) evaluates True for every iteration → early_stop never fires
Booster trains until num_boost_round=200 → best_iter ≈ 100-200
Final best_iter is selected on val AUC's peak (which may be at iter 5, 10, 50…) — independently of when training stops

This is a band-aid: early stopping is effectively disabled, the model trains to the configured ceiling. f1_buy ≈ 0.42 (the pre-#891 baseline) reflects this deeper training.

Post-#891 dynamics (harness, current)¶

Only (val, auc) is tracked
With imbalanced labels (18.3% positive) + scale_pos_weight=4.71 + small val sample (839 rows per canary log) :
val AUC peaks at iter 1 or 2 (the trivial ranking from a single weighted split)
subsequent iterations refine but val AUC doesn't strictly improve
After 50 idle rounds → early_stop fires → best_iter=1, current_iter=51
f1_buy ≈ 0.35 (the post-#891 regression) reflects this shallow training

The harness is "more correct" in principle (val AUC is the canonical early-stop signal — train AUC monotonically improving is precisely WHY early stopping was invented). But it exposes a pre-existing val saturation issue that the legacy hack masked.

Impact on the 7 hypotheses (§5.1 dossier ranking)¶

H	Hypothesis	Updated likelihood	Rationale
H2	eval_set incorrect (train used as val)	VERY HIGH ↑↑ (was `high`)	Confirmed by code inspection. The bug is structural : harness drops `train_set` from `valid_sets` and inadvertently activates early-stop-on-val-only, which fires on iter 1 due to val AUC saturation.
H1	Early stopping misconfigured	low ↓ (was `highest`)	The early_stopping config (`stopping_rounds=50`) is identical. The issue is the COMPOSITION of valid_sets fed to that config, not the config itself.
H6	Objective / metric mismatch	low ↓ (was `medium`)	Verified identical : `metric=["auc", "binary_logloss"]` + `first_metric_only=True` in both.
H4	Sample weights broken	low ↓ (was `medium`)	Verified identical : binary path sets `scale_pos_weight = n_neg/n_pos` in both, no `sample_weight` array.
H3	Labels misaligned	low (unchanged)	Out of scope of this dossier (data prep upstream of `train_weighted_variant`). Step 3 will assert via dataset hash.
H5	Feature matrix corrupted	low (unchanged)	Same as H3.
H7	HPO override of critical params	low (unchanged)	Step 1 capture shows Optuna picks `learning_rate ∈ [0.05, 0.15]`, `n_estimators ∈ [100, 800]` — wide but not pathological. The shallow-training pattern is observed across 53/53 trials with WIDELY DIFFERENT params, ruling out a single "bad" param combo.

4. Concrete fix candidates (Step 4 + Step 5 will pick one)¶

Option A — restore the legacy hack (1-line change, fastest)¶

   booster = lgb.train(
       params=lgb_native_params,
       train_set=train_set,
       num_boost_round=n_estimators,
-      valid_sets=[val_set],
-      valid_names=["val"],
+      valid_sets=[train_set, val_set],
+      valid_names=["train", "val"],
       callbacks=[lgb.early_stopping(stopping_rounds=es_rounds), lgb.log_evaluation(period=0)],
   )

Pros : 1-line, restores pre-#891 baseline (f1_buy ≈ 0.42) with mathematical certainty. Cons : Re-introduces the "early stopping is effectively disabled" anti-pattern. Booster trains to num_boost_round regardless of val behavior — overfitting risk if train and val distributions diverge (which is the WHOLE POINT of early stopping).

Option B — fix the val saturation root cause (deeper, slower)¶

The val AUC saturates at iter 1-2 because : - val sample is small (839 rows) - 18.3% positive class → ~154 positives in val - with scale_pos_weight=4.71, the trivial first-split classifier already nails most positives - subsequent iterations improve precision/recall but don't always shift AUC

Sub-options : - B1 : switch early-stop metric from auc to binary_logloss (ranking-insensitive, more granular) - B2 : enlarge val (raise val_size_pct from 15% to 20-25%) - B3 : use stratified val split (already the case ?) — verify - B4 : add min_data_in_leaf floor to prevent over-confident first-split

Option C — adaptive early stopping (correct + portable)¶

Watch BOTH (val, auc) and (val, binary_logloss) ; stop only when both saturate
Equivalent to first_metric_only=False + careful selection of secondary metric

Recommendation : start with Option A in Step 4 to confirm the diagnosis (restores baseline → confirms the smoking gun is correct), then assess Options B/C in S19 (the remediation Story).

5. Step 3 deliverables (next)¶

Per dossier §5.2 Step 3, the parity reproducer must :

Load the SAME train/val/test splits from captured-fold-AAVEUSDC-3.parquet (Step 1 output)
Train TWICE :
Path A (legacy) : lgb.train(valid_sets=[train_set, val_set], valid_names=["train", "val"]) with the params recorded in Step 1 trials
Path B (harness) : lgb.train(valid_sets=[val_set], valid_names=["val"]) — same params
Log per-iter (val, auc) + (val, binary_logloss) for both paths
Assert :
Path A best_iter > 50 (deep training)
Path B best_iter ≤ 5 (shallow training, matching canary)
Path A f1_buy_val ≈ 0.40-0.45 (matching pre-#891 baseline)
Path B f1_buy_val ≈ 0.30-0.40 (matching canary)

If both assertions hold → smoking gun confirmed deterministically → Step 5 final dossier closes with H2 verdict + Option A 1-line fix recommendation for S19.

If Path A also produces best_iter=1 → the diagnosis is wrong, escalate to H3-H5 (data path) or H7 (HPO) — but this is unlikely given the code-inspection certainty.

6. Source archive¶

Pre-#891 + post-#891-pre-#899 source files are extracted to /tmp/s18_pre891/ and /tmp/s18_legacy/ on the operator's machine for the duration of S18. They will be uploaded to s3://cvntrade-artifacts/s18/ if Step 3 needs longer-term reference.

Key files : - dc3d86c6^:src/training/LightGBM/cvntrade_LightGBM_trainer.py — pre-#891 LGB training (the load-bearing diff is in train method lines 78-89) - e75418ca^:src/training/LightGBM/cvntrade_LightGBM_grid_utils.py:26-275 — train_with_fixed_params_lgbm (the Bug #6 fix that already had valid_sets=[val] only — confirms the regression is documented) - e884fcd3:src/training/harness/dags/models/lightgbm_dag.py:130-176 — current harness lgb_booster_and_time

7. Sign-off¶

Pre-#891 LGB code extracted via git show dc3d86c6^:...
Side-by-side comparison on 10 axes (params + Dataset + train) — single DIFF on valid_sets
Hypothesis ranking updated : H2 promoted from high to VERY HIGH
3 fix-candidate options documented (A : 1-line restore, B : root cause, C : adaptive)
Step 3 specification (parity reproducer) drafted with concrete assertions
Operator review of this dossier
Step 3 implementation