Skip to content

S18 Step 2 — Legacy vs Harness LGB code comparison

Status — 2026-05-14 : Step 2 of the S18 diagnostic plan §5.2. Inputs : Step 0 PASS verdict (2026-05-13 23:42 UTC, abs_delta=8e-09 vs canary), Step 1 capture artifacts (step1-trials-AAVEUSDC-fold3.json, captured-fold-AAVEUSDC-3.parquet). Output : this dossier identifies the most likely regression locus so Step 3 (parity reproducer) can verify it deterministically.


TL;DR — smoking gun candidate

Hypothesis H2 (eval_set incorrect) ranks first. The pre-#891 LGB trainer passed valid_sets=[train_data, val_data] to lgb.train ; the current harness passes valid_sets=[val_set] only. Combined with first_metric_only=True + lgb.early_stopping(stopping_rounds=50) :

  • Pre-#891 : train AUC always improves (overfitting) → early stop never fires → booster trains until num_boost_roundbest_iter ≈ 100-300
  • Post-#891 harness : only val AUC is watched → val AUC saturates at iter 1-2 (scale_pos_weight=4.71 + small val sample 839 rows) → early stop fires after 50 idle rounds → best_iter ≈ 1 ← matches canary 53/53 trials

The other LGB call surface (params, Dataset construction, callbacks, scale_pos_weight, metric, first_metric_only) is byte-identical between the two paths. The single load-bearing difference is the valid_sets composition.


1. Methodology

Per dossier §5.2 Step 2, we extract the pre-#891 production LGB training code from git and compare it side-by-side with the current harness lgb_booster_and_time (src/training/harness/dags/models/lightgbm_dag.py:130-176).

Git refs : - Pre-#891 (legacy autonomous trainers) : dc3d86c6^ — the commit just before PR #891 merge (Hamilton harness migration). Files : src/training/LightGBM/cvntrade_LightGBM_trainer.py. - Post-#891 / pre-#899 : e75418ca^ — same harness as today, BEFORE the autonomous-trainer collapse PR #899. The train_with_fixed_params_lgbm shim in grid_utils.py is preserved as the "canonical legacy reference" (Bug #6 of FTF 7-bug hotfix spec). - Current main : e884fcd3 (post-Step 1).

Extraction commands (reproducible) :

git show dc3d86c6^:src/training/LightGBM/cvntrade_LightGBM_trainer.py > /tmp/s18_pre891/lgbm_trainer.py
git show e75418ca^:src/training/LightGBM/cvntrade_LightGBM_grid_utils.py > /tmp/s18_legacy/lgbm_grid_utils.py


2. Side-by-side : the load-bearing diff

The 7 axes from dossier §5.3 (the parity invariants Step 3 will assert against). For each, we mark MATCH (no behavioral difference between pre-#891 and current harness) or DIFF (behavioral divergence — must be tested in Step 3).

# Axis Pre-#891 (dc3d86c6^ trainer.py:78-89) Current harness (lightgbm_dag.py:158-174) Verdict
1 objective binary (or multiclass) binary (or multiclass) ✅ MATCH
2 metric ["auc", "binary_logloss"] (binary) ["auc", "binary_logloss"] (binary) ✅ MATCH
3 first_metric_only True (binary) True (binary) ✅ MATCH
4 scale_pos_weight n_neg / n_pos when binary n_neg / n_pos via ClassBalance node ✅ MATCH
5 lgb.Dataset(train) lgb.Dataset(X_train, label=y_train) (no weight when binary — sample_weights=None) lgb.Dataset(x_train, label=y_train) (no weight when binary — lgb_class_balance.sample_weights is None) ✅ MATCH
6 lgb.Dataset(val) lgb.Dataset(X_val, label=y_val, reference=train_data) lgb.Dataset(x_val, label=y_val, reference=train_set) ✅ MATCH
7 lgb.train(valid_sets) [train_data, val_data] (BOTH) [val_set] (val ONLY) 🚨 DIFF
7b lgb.train(valid_names) ["train", "val"] ["val"] 🚨 DIFF
8 early_stopping(stopping_rounds) 50 (hardcoded) 50 default (resolved from PG CVN_HPO_LGB_5M_EARLY_STOPPING_ROUNDS) ✅ MATCH (same value)
9 log_evaluation(period) 0 (no verbose) 0 (no verbose) ✅ MATCH
10 num_boost_round (fallback) 200 (legacy hardcoded default) from PG CVN_HPO_LGB_5M_N_ESTIMATORS (range 100-1000 per S17 seeding) ⚠️ DIFF in scale, not in mechanism

Single load-bearing axis : #7 (valid_sets composition). This is the only behavioral difference that explains the best_iter=1 smoking gun observed in 53/53 LGB trials of the canary.

2.1 Other code-level differences (lower priority — present but DO NOT explain best_iter=1)

# Axis Pre-#891 Current harness Verdict
11 HPO search range n_estimators (150, 400) (hyperoptimizer.py:106) (100, 1000) (PG CVN_HPO_LGB_5M_N_ESTIMATORS_RANGE_*) ⚠️ wider, but irrelevant when training stops at iter 1
12 HPO search range learning_rate (0.08, 0.15) observed (0.05, 0.15) (close but slightly wider lower bound) ⚠️ minor — Step 1 capture shows trials at 0.06-0.15 still produce best_iter=1
13 HPO search range num_leaves (25, 80) wider (PG-driven) ⚠️ irrelevant when stopping at iter 1
14 HPO search range min_child_weight (1e-3, 1e-1) not in current harness HPO surface (legacy-only param) ⚠️ minor surface drift, no impact
15 Calibration code path if config.calibration != "none": lgb.LGBMClassifier(...) + CalibratedClassifierCV(method="isotonic", cv=3) (lines 58-74) post-training θ-sweep at the harness layer (no sklearn wrapper) DIFF — but sklearn's LGBMClassifier.fit(eval_set=[(X_train, y_train), (X_val, y_val)], eval_names=["train", "val"]) ALSO included train in eval_set, so the early-stop semantics matched the non-calibration path. Both pre-#891 paths protected against early-stop-on-val-only.
16 Trainer wrapper cvntrade_LightGBM_autonomous_trainer.py (313 lines, custom orchestration) Hamilton DAG with lgb_booster_and_time node DIFF in surface, MATCH in mathematical effect (both produce a Booster)
17 Min split gain min_split_gain in HPO range (...) not in harness HPO surface minor surface drift
18 n_estimators default fallback 100 (config dataclass default) → 200 (overridden in _screen_grid_point for grid search) from PG (no default fallback in harness — fail-fast per ADR-90) DIFF in mechanism, identical in PRACTICE (Optuna picks a value within the seeded range)

None of these axes explain best_iter=1 consistently across 53/53 trials with widely different params. The HPO ranges (num_leaves, learning_rate, etc.) all influence WHICH params Optuna picks — but if the training mechanism (early stopping) shuts down at iter 1 regardless of those params, the search-space differences are inert.

The single load-bearing diff remains #7 (valid_sets composition).


3. Why valid_sets=[train, val] vs valid_sets=[val] matters

LightGBM's early_stopping callback semantics with first_metric_only=True + multiple valid_sets :

  • Per-tuple tracking : LGB tracks (valid_set, metric) improvement each iteration. With first_metric_only=True, only the FIRST metric (auc) is tracked → tuples are [(train, auc), (val, auc)] (legacy) vs [(val, auc)] (current).
  • Stop condition : early_stop fires when NONE of the tracked tuples improves for stopping_rounds consecutive iterations.

Pre-#891 dynamics (legacy)

  • (train, auc) improves at every iteration (boosting overfits the train set monotonically)
  • (val, auc) may saturate or fluctuate
  • The OR-condition (train improving) OR (val improving) evaluates True for every iteration → early_stop never fires
  • Booster trains until num_boost_round=200best_iter ≈ 100-200
  • Final best_iter is selected on val AUC's peak (which may be at iter 5, 10, 50…) — independently of when training stops

This is a band-aid: early stopping is effectively disabled, the model trains to the configured ceiling. f1_buy ≈ 0.42 (the pre-#891 baseline) reflects this deeper training.

Post-#891 dynamics (harness, current)

  • Only (val, auc) is tracked
  • With imbalanced labels (18.3% positive) + scale_pos_weight=4.71 + small val sample (839 rows per canary log) :
  • val AUC peaks at iter 1 or 2 (the trivial ranking from a single weighted split)
  • subsequent iterations refine but val AUC doesn't strictly improve
  • After 50 idle rounds → early_stop fires → best_iter=1, current_iter=51
  • f1_buy ≈ 0.35 (the post-#891 regression) reflects this shallow training

The harness is "more correct" in principle (val AUC is the canonical early-stop signal — train AUC monotonically improving is precisely WHY early stopping was invented). But it exposes a pre-existing val saturation issue that the legacy hack masked.

Impact on the 7 hypotheses (§5.1 dossier ranking)

H Hypothesis Updated likelihood Rationale
H2 eval_set incorrect (train used as val) VERY HIGH ↑↑ (was high) Confirmed by code inspection. The bug is structural : harness drops train_set from valid_sets and inadvertently activates early-stop-on-val-only, which fires on iter 1 due to val AUC saturation.
H1 Early stopping misconfigured low ↓ (was highest) The early_stopping config (stopping_rounds=50) is identical. The issue is the COMPOSITION of valid_sets fed to that config, not the config itself.
H6 Objective / metric mismatch low ↓ (was medium) Verified identical : metric=["auc", "binary_logloss"] + first_metric_only=True in both.
H4 Sample weights broken low ↓ (was medium) Verified identical : binary path sets scale_pos_weight = n_neg/n_pos in both, no sample_weight array.
H3 Labels misaligned low (unchanged) Out of scope of this dossier (data prep upstream of train_weighted_variant). Step 3 will assert via dataset hash.
H5 Feature matrix corrupted low (unchanged) Same as H3.
H7 HPO override of critical params low (unchanged) Step 1 capture shows Optuna picks learning_rate ∈ [0.05, 0.15], n_estimators ∈ [100, 800] — wide but not pathological. The shallow-training pattern is observed across 53/53 trials with WIDELY DIFFERENT params, ruling out a single "bad" param combo.

4. Concrete fix candidates (Step 4 + Step 5 will pick one)

Option A — restore the legacy hack (1-line change, fastest)

   booster = lgb.train(
       params=lgb_native_params,
       train_set=train_set,
       num_boost_round=n_estimators,
-      valid_sets=[val_set],
-      valid_names=["val"],
+      valid_sets=[train_set, val_set],
+      valid_names=["train", "val"],
       callbacks=[lgb.early_stopping(stopping_rounds=es_rounds), lgb.log_evaluation(period=0)],
   )

Pros : 1-line, restores pre-#891 baseline (f1_buy ≈ 0.42) with mathematical certainty. Cons : Re-introduces the "early stopping is effectively disabled" anti-pattern. Booster trains to num_boost_round regardless of val behavior — overfitting risk if train and val distributions diverge (which is the WHOLE POINT of early stopping).

Option B — fix the val saturation root cause (deeper, slower)

The val AUC saturates at iter 1-2 because : - val sample is small (839 rows) - 18.3% positive class → ~154 positives in val - with scale_pos_weight=4.71, the trivial first-split classifier already nails most positives - subsequent iterations improve precision/recall but don't always shift AUC

Sub-options : - B1 : switch early-stop metric from auc to binary_logloss (ranking-insensitive, more granular) - B2 : enlarge val (raise val_size_pct from 15% to 20-25%) - B3 : use stratified val split (already the case ?) — verify - B4 : add min_data_in_leaf floor to prevent over-confident first-split

Option C — adaptive early stopping (correct + portable)

  • Watch BOTH (val, auc) and (val, binary_logloss) ; stop only when both saturate
  • Equivalent to first_metric_only=False + careful selection of secondary metric

Recommendation : start with Option A in Step 4 to confirm the diagnosis (restores baseline → confirms the smoking gun is correct), then assess Options B/C in S19 (the remediation Story).


5. Step 3 deliverables (next)

Per dossier §5.2 Step 3, the parity reproducer must :

  1. Load the SAME train/val/test splits from captured-fold-AAVEUSDC-3.parquet (Step 1 output)
  2. Train TWICE :
  3. Path A (legacy) : lgb.train(valid_sets=[train_set, val_set], valid_names=["train", "val"]) with the params recorded in Step 1 trials
  4. Path B (harness) : lgb.train(valid_sets=[val_set], valid_names=["val"]) — same params
  5. Log per-iter (val, auc) + (val, binary_logloss) for both paths
  6. Assert :
  7. Path A best_iter > 50 (deep training)
  8. Path B best_iter ≤ 5 (shallow training, matching canary)
  9. Path A f1_buy_val ≈ 0.40-0.45 (matching pre-#891 baseline)
  10. Path B f1_buy_val ≈ 0.30-0.40 (matching canary)

If both assertions hold → smoking gun confirmed deterministically → Step 5 final dossier closes with H2 verdict + Option A 1-line fix recommendation for S19.

If Path A also produces best_iter=1 → the diagnosis is wrong, escalate to H3-H5 (data path) or H7 (HPO) — but this is unlikely given the code-inspection certainty.


6. Source archive

Pre-#891 + post-#891-pre-#899 source files are extracted to /tmp/s18_pre891/ and /tmp/s18_legacy/ on the operator's machine for the duration of S18. They will be uploaded to s3://cvntrade-artifacts/s18/ if Step 3 needs longer-term reference.

Key files : - dc3d86c6^:src/training/LightGBM/cvntrade_LightGBM_trainer.py — pre-#891 LGB training (the load-bearing diff is in train method lines 78-89) - e75418ca^:src/training/LightGBM/cvntrade_LightGBM_grid_utils.py:26-275train_with_fixed_params_lgbm (the Bug #6 fix that already had valid_sets=[val] only — confirms the regression is documented) - e884fcd3:src/training/harness/dags/models/lightgbm_dag.py:130-176 — current harness lgb_booster_and_time


7. Sign-off

  • Pre-#891 LGB code extracted via git show dc3d86c6^:...
  • Side-by-side comparison on 10 axes (params + Dataset + train) — single DIFF on valid_sets
  • Hypothesis ranking updated : H2 promoted from high to VERY HIGH
  • 3 fix-candidate options documented (A : 1-line restore, B : root cause, C : adaptive)
  • Step 3 specification (parity reproducer) drafted with concrete assertions
  • Operator review of this dossier
  • Step 3 implementation