S18 Step 2 — Legacy vs Harness LGB code comparison¶
Status — 2026-05-14 : Step 2 of the S18 diagnostic plan §5.2. Inputs : Step 0 PASS verdict (2026-05-13 23:42 UTC, abs_delta=8e-09 vs canary), Step 1 capture artifacts (
step1-trials-AAVEUSDC-fold3.json,captured-fold-AAVEUSDC-3.parquet). Output : this dossier identifies the most likely regression locus so Step 3 (parity reproducer) can verify it deterministically.
TL;DR — smoking gun candidate¶
Hypothesis H2 (eval_set incorrect) ranks first. The pre-#891 LGB trainer passed valid_sets=[train_data, val_data] to lgb.train ; the current harness passes valid_sets=[val_set] only. Combined with first_metric_only=True + lgb.early_stopping(stopping_rounds=50) :
- Pre-#891 : train AUC always improves (overfitting) → early stop never fires → booster trains until
num_boost_round→best_iter ≈ 100-300 - Post-#891 harness : only val AUC is watched → val AUC saturates at iter 1-2 (
scale_pos_weight=4.71+ small val sample 839 rows) → early stop fires after 50 idle rounds →best_iter ≈ 1← matches canary 53/53 trials
The other LGB call surface (params, Dataset construction, callbacks, scale_pos_weight, metric, first_metric_only) is byte-identical between the two paths. The single load-bearing difference is the valid_sets composition.
1. Methodology¶
Per dossier §5.2 Step 2, we extract the pre-#891 production LGB training code from git and compare it side-by-side with the current harness lgb_booster_and_time (src/training/harness/dags/models/lightgbm_dag.py:130-176).
Git refs :
- Pre-#891 (legacy autonomous trainers) : dc3d86c6^ — the commit just before PR #891 merge (Hamilton harness migration). Files : src/training/LightGBM/cvntrade_LightGBM_trainer.py.
- Post-#891 / pre-#899 : e75418ca^ — same harness as today, BEFORE the autonomous-trainer collapse PR #899. The train_with_fixed_params_lgbm shim in grid_utils.py is preserved as the "canonical legacy reference" (Bug #6 of FTF 7-bug hotfix spec).
- Current main : e884fcd3 (post-Step 1).
Extraction commands (reproducible) :
git show dc3d86c6^:src/training/LightGBM/cvntrade_LightGBM_trainer.py > /tmp/s18_pre891/lgbm_trainer.py
git show e75418ca^:src/training/LightGBM/cvntrade_LightGBM_grid_utils.py > /tmp/s18_legacy/lgbm_grid_utils.py
2. Side-by-side : the load-bearing diff¶
The 7 axes from dossier §5.3 (the parity invariants Step 3 will assert against). For each, we mark MATCH (no behavioral difference between pre-#891 and current harness) or DIFF (behavioral divergence — must be tested in Step 3).
| # | Axis | Pre-#891 (dc3d86c6^ trainer.py:78-89) |
Current harness (lightgbm_dag.py:158-174) |
Verdict |
|---|---|---|---|---|
| 1 | objective |
binary (or multiclass) |
binary (or multiclass) |
✅ MATCH |
| 2 | metric |
["auc", "binary_logloss"] (binary) |
["auc", "binary_logloss"] (binary) |
✅ MATCH |
| 3 | first_metric_only |
True (binary) |
True (binary) |
✅ MATCH |
| 4 | scale_pos_weight |
n_neg / n_pos when binary |
n_neg / n_pos via ClassBalance node |
✅ MATCH |
| 5 | lgb.Dataset(train) |
lgb.Dataset(X_train, label=y_train) (no weight when binary — sample_weights=None) |
lgb.Dataset(x_train, label=y_train) (no weight when binary — lgb_class_balance.sample_weights is None) |
✅ MATCH |
| 6 | lgb.Dataset(val) |
lgb.Dataset(X_val, label=y_val, reference=train_data) |
lgb.Dataset(x_val, label=y_val, reference=train_set) |
✅ MATCH |
| 7 | lgb.train(valid_sets) |
[train_data, val_data] (BOTH) |
[val_set] (val ONLY) |
🚨 DIFF |
| 7b | lgb.train(valid_names) |
["train", "val"] |
["val"] |
🚨 DIFF |
| 8 | early_stopping(stopping_rounds) |
50 (hardcoded) |
50 default (resolved from PG CVN_HPO_LGB_5M_EARLY_STOPPING_ROUNDS) |
✅ MATCH (same value) |
| 9 | log_evaluation(period) |
0 (no verbose) |
0 (no verbose) |
✅ MATCH |
| 10 | num_boost_round (fallback) |
200 (legacy hardcoded default) |
from PG CVN_HPO_LGB_5M_N_ESTIMATORS (range 100-1000 per S17 seeding) |
⚠️ DIFF in scale, not in mechanism |
Single load-bearing axis : #7 (valid_sets composition). This is the only behavioral difference that explains the best_iter=1 smoking gun observed in 53/53 LGB trials of the canary.
2.1 Other code-level differences (lower priority — present but DO NOT explain best_iter=1)¶
| # | Axis | Pre-#891 | Current harness | Verdict |
|---|---|---|---|---|
| 11 | HPO search range n_estimators |
(150, 400) (hyperoptimizer.py:106) |
(100, 1000) (PG CVN_HPO_LGB_5M_N_ESTIMATORS_RANGE_*) |
⚠️ wider, but irrelevant when training stops at iter 1 |
| 12 | HPO search range learning_rate |
(0.08, 0.15) |
observed (0.05, 0.15) (close but slightly wider lower bound) |
⚠️ minor — Step 1 capture shows trials at 0.06-0.15 still produce best_iter=1 |
| 13 | HPO search range num_leaves |
(25, 80) |
wider (PG-driven) | ⚠️ irrelevant when stopping at iter 1 |
| 14 | HPO search range min_child_weight |
(1e-3, 1e-1) |
not in current harness HPO surface (legacy-only param) | ⚠️ minor surface drift, no impact |
| 15 | Calibration code path | if config.calibration != "none": lgb.LGBMClassifier(...) + CalibratedClassifierCV(method="isotonic", cv=3) (lines 58-74) |
post-training θ-sweep at the harness layer (no sklearn wrapper) | DIFF — but sklearn's LGBMClassifier.fit(eval_set=[(X_train, y_train), (X_val, y_val)], eval_names=["train", "val"]) ALSO included train in eval_set, so the early-stop semantics matched the non-calibration path. Both pre-#891 paths protected against early-stop-on-val-only. |
| 16 | Trainer wrapper | cvntrade_LightGBM_autonomous_trainer.py (313 lines, custom orchestration) |
Hamilton DAG with lgb_booster_and_time node |
DIFF in surface, MATCH in mathematical effect (both produce a Booster) |
| 17 | Min split gain | min_split_gain in HPO range (...) |
not in harness HPO surface | minor surface drift |
| 18 | n_estimators default fallback |
100 (config dataclass default) → 200 (overridden in _screen_grid_point for grid search) |
from PG (no default fallback in harness — fail-fast per ADR-90) | DIFF in mechanism, identical in PRACTICE (Optuna picks a value within the seeded range) |
None of these axes explain best_iter=1 consistently across 53/53 trials with widely different params. The HPO ranges (num_leaves, learning_rate, etc.) all influence WHICH params Optuna picks — but if the training mechanism (early stopping) shuts down at iter 1 regardless of those params, the search-space differences are inert.
The single load-bearing diff remains #7 (valid_sets composition).
3. Why valid_sets=[train, val] vs valid_sets=[val] matters¶
LightGBM's early_stopping callback semantics with first_metric_only=True + multiple valid_sets :
- Per-tuple tracking : LGB tracks
(valid_set, metric)improvement each iteration. Withfirst_metric_only=True, only the FIRST metric (auc) is tracked → tuples are[(train, auc), (val, auc)](legacy) vs[(val, auc)](current). - Stop condition : early_stop fires when NONE of the tracked tuples improves for
stopping_roundsconsecutive iterations.
Pre-#891 dynamics (legacy)¶
(train, auc)improves at every iteration (boosting overfits the train set monotonically)(val, auc)may saturate or fluctuate- The OR-condition
(train improving) OR (val improving)evaluates True for every iteration → early_stop never fires - Booster trains until
num_boost_round=200→best_iter ≈ 100-200 - Final
best_iteris selected on val AUC's peak (which may be at iter 5, 10, 50…) — independently of when training stops
This is a band-aid: early stopping is effectively disabled, the model trains to the configured ceiling. f1_buy ≈ 0.42 (the pre-#891 baseline) reflects this deeper training.
Post-#891 dynamics (harness, current)¶
- Only
(val, auc)is tracked - With imbalanced labels (18.3% positive) +
scale_pos_weight=4.71+ small val sample (839 rows per canary log) : - val AUC peaks at iter 1 or 2 (the trivial ranking from a single weighted split)
- subsequent iterations refine but val AUC doesn't strictly improve
- After 50 idle rounds →
early_stopfires →best_iter=1,current_iter=51 - f1_buy ≈ 0.35 (the post-#891 regression) reflects this shallow training
The harness is "more correct" in principle (val AUC is the canonical early-stop signal — train AUC monotonically improving is precisely WHY early stopping was invented). But it exposes a pre-existing val saturation issue that the legacy hack masked.
Impact on the 7 hypotheses (§5.1 dossier ranking)¶
| H | Hypothesis | Updated likelihood | Rationale |
|---|---|---|---|
| H2 | eval_set incorrect (train used as val) | VERY HIGH ↑↑ (was high) |
Confirmed by code inspection. The bug is structural : harness drops train_set from valid_sets and inadvertently activates early-stop-on-val-only, which fires on iter 1 due to val AUC saturation. |
| H1 | Early stopping misconfigured | low ↓ (was highest) |
The early_stopping config (stopping_rounds=50) is identical. The issue is the COMPOSITION of valid_sets fed to that config, not the config itself. |
| H6 | Objective / metric mismatch | low ↓ (was medium) |
Verified identical : metric=["auc", "binary_logloss"] + first_metric_only=True in both. |
| H4 | Sample weights broken | low ↓ (was medium) |
Verified identical : binary path sets scale_pos_weight = n_neg/n_pos in both, no sample_weight array. |
| H3 | Labels misaligned | low (unchanged) | Out of scope of this dossier (data prep upstream of train_weighted_variant). Step 3 will assert via dataset hash. |
| H5 | Feature matrix corrupted | low (unchanged) | Same as H3. |
| H7 | HPO override of critical params | low (unchanged) | Step 1 capture shows Optuna picks learning_rate ∈ [0.05, 0.15], n_estimators ∈ [100, 800] — wide but not pathological. The shallow-training pattern is observed across 53/53 trials with WIDELY DIFFERENT params, ruling out a single "bad" param combo. |
4. Concrete fix candidates (Step 4 + Step 5 will pick one)¶
Option A — restore the legacy hack (1-line change, fastest)¶
booster = lgb.train(
params=lgb_native_params,
train_set=train_set,
num_boost_round=n_estimators,
- valid_sets=[val_set],
- valid_names=["val"],
+ valid_sets=[train_set, val_set],
+ valid_names=["train", "val"],
callbacks=[lgb.early_stopping(stopping_rounds=es_rounds), lgb.log_evaluation(period=0)],
)
Pros : 1-line, restores pre-#891 baseline (f1_buy ≈ 0.42) with mathematical certainty.
Cons : Re-introduces the "early stopping is effectively disabled" anti-pattern. Booster trains to num_boost_round regardless of val behavior — overfitting risk if train and val distributions diverge (which is the WHOLE POINT of early stopping).
Option B — fix the val saturation root cause (deeper, slower)¶
The val AUC saturates at iter 1-2 because :
- val sample is small (839 rows)
- 18.3% positive class → ~154 positives in val
- with scale_pos_weight=4.71, the trivial first-split classifier already nails most positives
- subsequent iterations improve precision/recall but don't always shift AUC
Sub-options :
- B1 : switch early-stop metric from auc to binary_logloss (ranking-insensitive, more granular)
- B2 : enlarge val (raise val_size_pct from 15% to 20-25%)
- B3 : use stratified val split (already the case ?) — verify
- B4 : add min_data_in_leaf floor to prevent over-confident first-split
Option C — adaptive early stopping (correct + portable)¶
- Watch BOTH
(val, auc)and(val, binary_logloss); stop only when both saturate - Equivalent to
first_metric_only=False+ careful selection of secondary metric
Recommendation : start with Option A in Step 4 to confirm the diagnosis (restores baseline → confirms the smoking gun is correct), then assess Options B/C in S19 (the remediation Story).
5. Step 3 deliverables (next)¶
Per dossier §5.2 Step 3, the parity reproducer must :
- Load the SAME train/val/test splits from
captured-fold-AAVEUSDC-3.parquet(Step 1 output) - Train TWICE :
- Path A (legacy) :
lgb.train(valid_sets=[train_set, val_set], valid_names=["train", "val"])with the params recorded in Step 1 trials - Path B (harness) :
lgb.train(valid_sets=[val_set], valid_names=["val"])— same params - Log per-iter
(val, auc)+(val, binary_logloss)for both paths - Assert :
- Path A
best_iter > 50(deep training) - Path B
best_iter ≤ 5(shallow training, matching canary) - Path A
f1_buy_val ≈ 0.40-0.45(matching pre-#891 baseline) - Path B
f1_buy_val ≈ 0.30-0.40(matching canary)
If both assertions hold → smoking gun confirmed deterministically → Step 5 final dossier closes with H2 verdict + Option A 1-line fix recommendation for S19.
If Path A also produces best_iter=1 → the diagnosis is wrong, escalate to H3-H5 (data path) or H7 (HPO) — but this is unlikely given the code-inspection certainty.
6. Source archive¶
Pre-#891 + post-#891-pre-#899 source files are extracted to /tmp/s18_pre891/ and /tmp/s18_legacy/ on the operator's machine for the duration of S18. They will be uploaded to s3://cvntrade-artifacts/s18/ if Step 3 needs longer-term reference.
Key files :
- dc3d86c6^:src/training/LightGBM/cvntrade_LightGBM_trainer.py — pre-#891 LGB training (the load-bearing diff is in train method lines 78-89)
- e75418ca^:src/training/LightGBM/cvntrade_LightGBM_grid_utils.py:26-275 — train_with_fixed_params_lgbm (the Bug #6 fix that already had valid_sets=[val] only — confirms the regression is documented)
- e884fcd3:src/training/harness/dags/models/lightgbm_dag.py:130-176 — current harness lgb_booster_and_time
7. Sign-off¶
- Pre-#891 LGB code extracted via
git show dc3d86c6^:... - Side-by-side comparison on 10 axes (params + Dataset + train) — single DIFF on
valid_sets - Hypothesis ranking updated : H2 promoted from
hightoVERY HIGH - 3 fix-candidate options documented (A : 1-line restore, B : root cause, C : adaptive)
- Step 3 specification (parity reproducer) drafted with concrete assertions
- Operator review of this dossier
- Step 3 implementation