S18 Step 5 — Re-scope dossier (post-NO_DIVERGENCE)¶
Status : committee experiment_review PASSED_WITH_REVISIONS 2026-05-14 — see §9 below.
Story : CVN-N001-EE-S18 (OP wp#154, Plan dossier)
Trigger : Step 4 (PR #937) verdict = NO_DIVERGENCE on AAVEUSDC fold=3, plus Phase A log evidence revealing the actual regression mechanism
Author : Claude Opus 4.7 (under operator review)
Date : 2026-05-14
1. Executive summary (1 paragraph)¶
The S18 diagnostic chain (Steps 0 → 4) ruled out all 7 hypotheses H1-H7 of the parent dossier §5.1 : training-loop config (H1, H6), valid_sets composition (H2), label misalignment (H3), sample weights (H4), feature corruption (H5), HPO param overrides (H7) — all PASS. The captured fold's data is clean. The regression is upstream of the captured parquet in the harness's post-training decision logic. Phase A logs from the chained-DAG run 2026-05-14 14:44 reveal the mechanism : scale_pos_weight=4.71 (auto-injected by the harness ; pre-#891 prod path defaulted to 1.0 and never applied it) inflates positive-class probabilities → the harness θ-sweep over [0.05, 0.95] (19 candidates ; pre-#891 Optuna range [0.30, 0.40]) picks θ=0.2 → 46 % buy rate on val → 1210 raw signals → 251 final trades after concurrency filter → backtest sortino -9.5, return -91.35 %. The "best_iter=1 shallow training" symptom is a side effect of scale_pos_weight saturating val AUC at iteration 1, NOT the bug locus. H8 = scale_pos_weight auto-injection × wide θ-sweep range coupling is the re-scoped hypothesis. This dossier asks the committee to validate H8 and recommend the S19 fix posture (4 options in §6).
2. The chain of evidence (Steps 0-4 + Phase A)¶
2.1 Step 0 (PR #931 merged) — staging replay¶
Confirmed reproducibility on staging. AAVEUSDC fold=3 Phase A produces f1_buy_val=0.3520485 byte-identical to the canonical post-S17 canary (abs_delta=8e-09 << epsilon=0.005). Verdict : PASS.
2.2 Step 1 (PR #932 merged) — verbose capture¶
Monkey-patched lgb.train + lgb.Dataset in the running harness ; serialized to parquet. Captured 51 Optuna trials × per-iter eval traces. n_eval_series=2 per trial = 1 valid_set × 2 metrics (auc + binary_logloss).
2.3 Step 2 (PR #933 merged) — git archaeology + side-by-side¶
Identified valid_sets=[train_set, val_set] (pre-#891) vs valid_sets=[val_set] (harness post-#891) as smoking-gun candidate. Status : analyzed but not the regression locus per Step 3 verdict.
2.4 Step 3 (PR #934 merged) — parity reproducer¶
Ran both valid_sets configurations on the captured fold with GRID_DEFAULT_HP_LGB. Verdict : REFUTED — legacy_best_iter=1, harness_best_iter=1. Both paths produce the same shallow training. H2 ruled out.
2.5 Loki forensics (2026-05-14 14:44 chained DAG run)¶
Histogram of 51 Optuna trials in Phase A :
51 distinct hyperparameter combinations + 2 valid_sets configs (Step 3) = 53 distinct configurations, all early-stopping at iter 1-2. H1 (early stopping config), H6 (metric mismatch), H7 (HPO param override) ruled out — none would affect 53 trainings uniformly with the same outcome.
2.6 Step 4 (PR #937 merged) — data forensics¶
Six checks on the captured fold (run 2026-05-14 17:47-18:14) :
| Check | Status | Key metric |
|---|---|---|
| F1 label-index alignment | PASS | train_index_match=true, val_index_match=true |
| F2 label distribution | PASS | train_pos_rate=0.175, val_pos_rate=0.169 (Δ < 5 %) |
| F3 NaN ratio | PASS | max_train_nan_ratio < 0.5, max_val < 0.5 |
| F4 feature-label correlation | PASS | max_abs_corr < 0.95 (no leak feature) |
| F5 train vs val drift | PASS | max_sigma_shift < 3.0 (no scaling drift) |
| F6 iter-1 single-tree probe | PASS | iter1_train_auc≈0.78, iter1_val_auc≈0.49, no leak / no misalignment / proba distribution healthy |
Verdict : NO_DIVERGENCE → bug is upstream of the captured parquet → escalate to committee, re-scope §5.1 hypothesis space.
2.7 Phase A log evidence (the smoking gun re-scoped)¶
event=training_complete model_type=lightgbm best_iteration=1 training_time_sec=2.465
theta_picked=0.2 f1_buy_val=0.352 auc_buy_val=0.6461
brier_val=0.1395 ece_val=0.0591 rate_buy_val=0.4611
event=signal_funnel raw_buy_signals=1210 final_trades=251
primary_killer=concurrency
event=weighted_variant_evaluated sortino=-9.512 n_trades=251 return=-91.35%
The model is NOT broken at the booster level :
- auc_buy_val=0.6461 — the booster CAN rank positives (above-random)
- f1_buy_val=0.352 — at the harness-chosen θ=0.2, f1 is reasonable
- brier_val=0.1395, ece_val=0.0591 — calibration is acceptable
The model IS broken at the trading level :
- rate_buy_val=0.4611 — the model emits BUY 46 % of the time on val
- 1210 raw signals → 251 final trades (only the concurrency filter saves us from disaster)
- Backtest sortino -9.5, return -91.35 %
The actual regression is the post-training θ selection, NOT the training loop.
3. Diff legacy vs harness — six axes¶
Sources :
- Pre-#891 prod path : /tmp/s18_pre891/lgbm_*.py (vendored read-only copy of the LGB trainer at dc3d86c6^)
- Post-#891 harness path : src/training/harness/{nodes,dags/models}/
- Post-#891 legacy-style path : src/training/LightGBM/cvntrade_LightGBM_grid_utils.py
| # | Axis | Pre-#891 prod | Post-#891 harness | Effect |
|---|---|---|---|---|
| D1 | scale_pos_weight injection |
LightGBMConfig.scale_pos_weight: float = 1.0 (default ; never set by the autonomous trainer ; only added to lgb.train params if != 1.0) → NOT applied in production |
class_balance.compute_class_balance(binary=True) → scale_pos_weight = n_neg / n_pos → ALWAYS applied (= 4.7119 on AAVEUSDC fold=3) |
Inflates positive-class probas downstream |
| D2 | θ selection mechanism | Optuna suggests threshold_buy as ONE of the trial params ; jointly optimised with model HPs |
θ-sweep is a post-hoc separate pass : Optuna picks model HPs only, then theta_sweep.pick_threshold_on_val() runs after the final retrain |
Decouples θ from model-HP joint optimum |
| D3 | θ search range | threshold_buy_range: Tuple[float, float] = (0.30, 0.40) (lgbm_config.py:117) |
19 candidates np.linspace(0.05, 0.95, 19) + [0.5] ; argmax over f1_buy on val (theta_sweep.py:53) |
Lets θ go arbitrarily low (picks 0.2 in our run, outside the legacy bounds) |
| D4 | valid_sets composition |
[train_data, val_data] + valid_names=["train", "val"] |
[val_set] + valid_names=["val"] (lightgbm_dag.py:171) |
Step 3 ruled this out as a dominant cause (both produce best_iter=1 on the captured fold under GRID_DEFAULT_HP_LGB) — but still a real diff |
| D5 | Eval metric / early stopping | lgb_params does NOT explicitly set metric → LGB default for objective=binary is binary_logloss |
metric=["auc", "binary_logloss"] + first_metric_only=True → AUC drives early stop (PR #872 fix per lgbm_grid_utils.py:53-65 Bug #6 comment) |
Switch to AUC was a Bug #6 fix ; ironically, with scale_pos_weight=4.71, AUC saturates at iter 1 anyway |
| D6 | Calibration | Optuna suggests calibration ∈ {none, sigmoid, isotonic} jointly with HPs |
theta_sweep.py docstring claims "always-on calibration node (committee enhancement #2)" — but calibration is NOT applied in our Phase A run (no event=calibration_* in Loki ; ECE 0.059 is the raw model's) |
Pre-#891 explored a 3-way calibration choice ; post-#891 effectively skips it |
4. Re-scoped hypothesis — H8¶
H8 = scale_pos_weight auto-injection (D1) × wide θ-sweep range (D2 + D3) coupling produces a 46 % buy rate → catastrophic backtest, even though the booster's f1_val and AUC scores look acceptable.
H8 is the post-Step 4 escape hatch the parent dossier §7.1 explicitly anticipated : "if §5.3 invariants 5.3.1–5.3.3 all pass without divergence, STOP at end of Step 2 and re-evaluate hypothesis space §5.1 before continuing. Don't chase a phantom into Step 4." — Step 4 confirmed the data is clean ; the bug is in the harness's post-training decision logic, not in any of H1-H7.
4.1 Mechanism (chain of causation)¶
1. class_balance.py:62 → scale_pos_weight = n_neg/n_pos = 4.7119
2. lgb.train trains the booster with scale_pos_weight active
→ positive-class probas inflated (most probas in [0.10, 0.30])
3. AUC saturates at iter 1 (booster ranks positives well immediately)
→ early_stopping(50) fires after 50 idle iters → best_iter = 1
4. theta_sweep.pick_threshold_on_val() runs on inflated probas
→ finds best f1_buy at θ = 0.20 (because that's where the inflated
probas concentrate around the decision boundary)
5. At θ=0.20 : 46% of val candles flagged BUY (rate_buy_val=0.4611)
6. Inference path emits 1210 raw BUY signals on test
7. concurrency filter rejects 959 (the only saving grace)
8. 251 trades execute → catastrophic backtest
4.2 Why pre-#891 didn't show this¶
1. LightGBMConfig.scale_pos_weight = 1.0 (default, never overridden)
→ no class rebalancing → balanced probas
2. lgb.train with valid_sets=[train, val] + binary_logloss default
→ train AUC always-improving keeps early stop from firing
→ trains 50-200 iters → final AUC 0.7-0.8
3. Optuna jointly suggests threshold_buy ∈ [0.30, 0.40]
→ θ stays in the conservative range
4. At θ=0.40 : far fewer val candles cross threshold → moderate buy rate
5. Reasonable trade count → backtest survives
6. f1_buy_val ≈ 0.42 (the canonical pre-#891 reference)
5. Critical files involved (for operator audit)¶
Pre-#891 (vendored at /tmp/s18_pre891/, no longer on main)¶
| File | Lines | What to verify |
|---|---|---|
lgbm_autonomous_trainer.py |
19 K | Confirm it never sets scale_pos_weight ; calls CVNTrade_LightGBMTrainer.train() |
lgbm_trainer.py |
13 K | Lines 75-89 : lgb.train(params=lgb_params, valid_sets=[train_data, val_data]) ; params = config.to_lgb_params() |
lgbm_config.py |
5 K | L31 scale_pos_weight: float = 1.0 ; L83-84 if self.scale_pos_weight != 1.0: params["scale_pos_weight"] = ... ; L117 threshold_buy_range = (0.30, 0.40) |
lgbm_hyperoptimizer.py |
16 K | L217 + L239 confirms Optuna suggests threshold_buy as a trial param within threshold_buy_range |
Post-#891 harness (current main, src/training/harness/)¶
| File | Lines | What to verify |
|---|---|---|
nodes/class_balance.py |
79 | L62 always returns scale_pos_weight = n_neg / n_pos for binary ; never opt-in |
nodes/theta_sweep.py |
64 | L53 candidates = np.linspace(0.05, 0.95, 19) + [0.5] (NOT bounded to legacy [0.30, 0.40]) |
nodes/hpo_optuna.py |
— | Confirms Optuna doesn't suggest threshold_buy (delegated to theta_sweep node) |
dags/models/lightgbm_dag.py |
312 | L171 valid_sets=[val_set] ; L260-296 _hpo_space lists 9 model HPs, NO threshold_buy |
adapters/lgb.py |
— | Inference-side wrapper for lgb.Booster ; predict_proba. NOT the scale_pos_weight injection site (that lives in class_balance.py + lightgbm_dag.py). The adapter is the locus of Bug 3 (DataFrame column-order erasure) — a separate concern from H8. (Corrected 2026-05-14 post-committee.) |
contracts.py |
— | ClassBalance dataclass ; TrainedArtifact carries the picked threshold |
Post-#891 legacy-style on main (src/training/LightGBM/)¶
| File | What to verify |
|---|---|
cvntrade_LightGBM_grid_utils.py |
L86-89 also auto-sets scale_pos_weight = n_neg / n_pos ; L53-65 comment explains Bug #6 (binary_logloss + scale_pos_weight saturating early stop) |
cvntrade_LightGBM_autonomous_trainer.py |
Routes between legacy + harness path via CVN_USE_HARNESS |
Diagnostic infrastructure (already in place from Steps 0-4)¶
src/commun/finetune/diagnostic/s18_step0_replay.py # PR #931
src/commun/finetune/diagnostic/s18_step1_capture.py # PR #932
src/commun/finetune/diagnostic/s18_step3_parity.py # PR #934
src/commun/finetune/diagnostic/s18_step4_invariants.py # PR #937
dags/dag_diagnostic__s18_step0.py
dags/dag_diagnostic__s18_step1_3_chain.py # PR #935
dags/dag_diagnostic__s18_step1_4_chain.py # PR #937
documentation/reviews/2026-05-13-...-s18-...-plan.md # parent dossier
documentation/missions/cvn-n001-ee-s18-diagnostic/ # Step 2/4/5 dossiers
Reference / canary¶
- PG
finetune_results:(run_id=ftf_*, crypto=AAVEUSDC, fold_id=3, variant=lightgbm)→f1_buy=0.3520485(post-#891 canonical) - The pre-#891 reference f1_buy ≈ 0.42 cited in the parent dossier §3 — its source needs to be re-confirmed (likely from a pre-S17 FTF run or from an earlier sweep ; the DB row may have been overwritten by post-S17 runs)
6. Decision matrix — S19 fix options¶
| Option | Change | Files touched | Pros | Cons | Risk |
|---|---|---|---|---|---|
| A — Disable scale_pos_weight (revert to legacy 1.0) | class_balance.py:62 → return scale_pos_weight=1.0 (or skip the field entirely) for the binary branch |
1 file, ~3 lines | Restores pre-#891 proba distribution → θ-sweep picks higher θ → fewer trades, healthier backtest | May lose the f1 lift that scale_pos_weight provides on rare-positive distributions ; XGB and CB are also affected (their adapters consume the same ClassBalance) |
Medium — XGB / CB regression possible |
| B — Restrict θ-sweep range to legacy [0.30, 0.40] | theta_sweep.py:53 → narrow np.linspace(0.30, 0.40, ...) |
1 file, 1 line | Forces conservative θ even if probas are inflated → reasonable trade rate | With scale_pos_weight still active, the bound clip will produce a θ that wastes f1 (sub-optimal for the inflated proba distribution) ; XGB doesn't use theta_sweep so unaffected | Low — narrow surface |
| C — A + B both | class_balance.py + theta_sweep.py |
2 files, ~4 lines | Best mimics legacy ; safest restore | Two simultaneous changes → harder to attribute the recovery ; doubles the diff to verify | Medium |
| D — Add scale_pos_weight to Optuna search | lightgbm_dag.py:_hpo_space() add scale_pos_weight: trial.suggest_float(...) over [1.0, 5.0] ; class_balance.py reads from params instead of computing |
2 files, ~10 lines | Principled — let Optuna jointly optimize scale_pos_weight × model HPs |
Multiplies HPO search space ; longer per-trial ; Optuna will likely converge to the same n_neg/n_pos value because that's the f1_val maximizer (defeats the purpose) |
High — does NOT address the root cause (f1_val maximization on val drives the bad outcome regardless) |
| E — Decouple θ-sweep optimization from f1_val | Replace theta_sweep.pick_threshold_on_val(... f1_buy) with sortino-on-walkforward-backtest or Brier-score or a regularized f1 (penalty on rate_buy > 0.20) |
theta_sweep.py + new node |
Addresses the actual symptom (θ that wins f1 wins the wrong objective) | Major design change ; backtest in θ-sweep is expensive ; needs S20+ scope | Out-of-scope for S19 |
| F — A+B+C + 3 latent bug fixes (BUNDLED) | Pick A / B / C as the H8 fix posture + Bug 1 (class_balance n_neg guard) + Bug 2 (labels=[0,1] in both files) + Bug 3 (lgb.py preserve feature_names) | class_balance.py + theta_sweep.py + eval_metrics.py + adapters/lgb.py + (per A/B/C pick) |
Same PR surface for related defensive fixes ; zero behavioural change on healthy paths ; closes 3 latent prod risks for free | Mixes H8 fix (behavioural) with bug fixes (defensive) — committee may prefer separate PRs for clean revert envelope | Low — each fix is independently small + tested |
6.bis Latent bugs found during code audit (operator review 2026-05-14)¶
A targeted re-read of class_balance.py + theta_sweep.py + eval_metrics.py surfaced 2 latent bugs that do NOT manifest on AAVEUSDC fold=3 (so they don't explain our specific regression) but WILL fire on degenerate-split folds in the FTF sweep and may contribute to the broader regression cited in the parent dossier §3 (e.g. XGB canary f1_buy=0.089).
Bug 1 — class_balance.py:55-62, missing guard on n_neg == 0¶
n_pos = int((y == 1).sum())
n_neg = int((y == 0).sum())
if n_pos == 0:
raise ValueError(...) # only this branch fail-fasts
return ClassBalance(scale_pos_weight=n_neg / n_pos, ...)
When n_neg == 0 (train split is 100 % BUY) → returns scale_pos_weight = 0 / n_pos = 0.0 silently → LGB / XGB train with the positive class effectively un-weighted → barely-trained booster, no exception raised. Per ADR-25 (no silent fallback), this MUST fail-fast.
Patch :
if n_pos == 0 or n_neg == 0:
raise ValueError(
f"compute_class_balance: degenerate training labels — "
f"n_pos={n_pos}, n_neg={n_neg}. Cannot train binary on a single-class split."
)
Does it fire on AAVEUSDC fold=3 ? No — n_pos=1725, n_neg=8128, both strictly positive.
Bug 2 — theta_sweep.py:59 + eval_metrics.py:69, missing labels=[0, 1]¶
_, _, f1, _ = precision_recall_fscore_support(y, y_pred, average=None, zero_division=0)
f1_buy = float(f1[1]) if len(f1) > 1 else 0.0
sklearn's precision_recall_fscore_support(..., average=None) without labels= returns metrics indexed by set(y_true) ∪ set(y_pred) sorted ascending. When that union has only label 1 (e.g. y_true = y_pred = [1, 1, 1]), it returns f1 = [1.0] of length 1 → the len(f1) > 1 else 0.0 guard returns 0.0 instead of the true 1.0. Optuna then optimises on broken signal.
Patch (both files) :
_, _, f1, _ = precision_recall_fscore_support(
y_true_or_y, y_pred, labels=[0, 1], average=None, zero_division=0,
)
f1_buy = float(f1[1]) # always safe with explicit labels
Does it fire on AAVEUSDC fold=3 ? No — y_val has both classes (353 pos + 1740 neg), so the union always contains {0, 1} regardless of y_pred's content → len(f1) == 2 → f1[1] correctly indexes BUY.
Bug 3 — adapters/lgb.py:42-43, DataFrame → ndarray strips column order¶
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
# PR #900 — DataFrame-native at inference, ndarray at training.
if isinstance(x, pd.DataFrame):
x = x.to_numpy() # ← strips column names
if self.best_iteration is not None:
raw = self._native.predict(x, num_iteration=int(self.best_iteration))
...
The docstring claims "DataFrame-native" but the implementation is the opposite — .to_numpy() reduces the input to a positional ndarray. lgb.Booster.predict(ndarray) does NOT validate feature names ; if the DataFrame's column order at inference differs from the training matrix, LightGBM silently predicts on the WRONG features. No exception, no log, no metric anomaly — just silently corrupted probabilities downstream.
Patch (minimal — preserve columns + reorder if needed) :
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
if isinstance(x, pd.DataFrame):
expected = list(self._native.feature_name())
if list(x.columns) != expected:
x = x[expected] # reorder to training-time schema
# keep as DataFrame — lgb.Booster.predict accepts both
if self.best_iteration is not None:
raw = self._native.predict(x, num_iteration=int(self.best_iteration))
else:
raw = self._native.predict(x)
proba = np.asarray(raw, dtype=float)
if proba.ndim == 1:
return np.column_stack([1.0 - proba, proba])
return proba
Does it fire on AAVEUSDC fold=3 ? Probably not at training time (the harness trains on the same ndarray it later predicts on for θ-sweep + eval — order is consistent by construction). Possibly at backtest time : CVNTrade_BacktestEngine calls adapter.predict_proba candle-by-candle on the live FE-pipeline DataFrame. If column ordering of that DataFrame ever diverges from the training cache's order (e.g. after a FE pipeline cache rebuild, or after a feature-selection set update), all backtest predictions are silently corrupted. Our observed auc_buy_val=0.65 rules out total feature scrambling on THIS run (random predictions would give AUC ~0.5), but Bug 3 plausibly drives any feature-order-sensitive cross-crypto / live regression cited in the parent dossier §3.
Operator severity ranking (2026-05-14 audit)¶
class_balance.py—scale_pos_weight=0on single-class train : silently destructive, can pass through CI undetected on rare-pos cryptos.adapters/lgb.py— DataFrame column-order erasure : silently wrong predictions in any path where the DataFrame schema can drift (backtest, live inference, walk-forward predictor).theta_sweep.py/eval_metrics.py— missinglabels=[0, 1]: silently wrong metrics on mono-class splits → Optuna optimises on garbage.
Why all 3 matter even though our specific run is unaffected¶
- Cross-fold risk : the FTF sweep cited in the parent dossier §3 covered
defi_top5× 5m × multiple folds × 3 models. SOME of those folds may have produced a degenerate train (all-BUY) or a degenerate val (all-one-class). Bug 1 silently broke any such training ; Bug 2 returned 0.0 f1_buy that contaminated Optuna's score → cascading bad hyperparam choices ; Bug 3 may have corrupted cross-crypto inference when feature schemas diverged. - Plausibility on the XGB canary : XGB stays at fixed θ=0.5 (no θ-sweep) — but XGB's autonomous trainer also calls
eval_metrics.evaluate_split_binarywhich has Bug 2. If XGB's prediction at θ=0.5 produced a degeneratey_pred, the reported f1_buy=0.089 mean across folds could be a Bug 2 measurement artefact superimposed on the H8 mechanism. - Defensive cost is zero-to-small : Bug 1 patch is 1 line ; Bug 2 patch is 1 line per file ; Bug 3 patch is ~5 lines + a column-reorder. All fail-fast / defensive, no behaviour change on currently-healthy paths, fully consistent with ADR-25.
- Other adapters audited (2026-05-14) :
adapters/xgb.py:57— passes feature names viaxgb.DMatrix(..., feature_names=list(x.columns))→ DMatrix validates column order against training feature_names → Bug 3 does NOT apply to XGB.adapters/cb.py:37— callsself._native.predict_proba(x)directly on the CatBoost classifier ; CatBoost's sklearn-compatible API handles DataFrames natively + validates column order → Bug 3 does NOT apply to CB.- Bug 3 is LGB-only. Patch surface : single file
adapters/lgb.py.
7. Questions for the committee (experiment_review)¶
- Validate H8 : do the experts agree the re-scoped hypothesis (D1 × D3 coupling driving over-trading) is the dominant locus on AAVEUSDC fold=3, or are there hidden axes we missed (D5, D6, calibration node behaviour, walk-forward predict path) ?
- Validate the 3 latent bugs found by operator audit (Bug 1 / Bug 2 / Bug 3 in §6.bis) : do the experts agree each is a real defect ? Should S19 close all 3 in the same PR as the H8 fix (Option F bundle), or split them into a separate defensive-hardening PR with its own CR cycle ? Operator severity ranking : Bug 1 > Bug 3 > Bug 2.
- Source of pre-#891 reference f1≈0.42 : the parent dossier cites this number ; can we recover the exact PG row / Loki window that produced it, or is it extrapolated ? If extrapolated, should we re-establish a pre-#891 canary by running
train_with_fixed_params_lgbmwithscale_pos_weight=1.0+θ=0.4on the captured fold to get an empirical baseline ? - S19 fix posture for H8 : which of A / B / C is least risky for production ? Option D (add to HPO) is rejected by us as theoretically self-defeating ; Option E is out-of-scope ; Option F bundles A/B/C with the 3 bug fixes.
- XGB / CB regression risk under Option A : disabling
scale_pos_weightinclass_balance.pyaffects all 3 adapters. Should S19 be LGB-only (introduce a per-model class_balance override) ? - Test plan for S19 : is a single-fold re-run on the captured parquet sufficient, or should S19 be validated cross-fold (OPUSDC fold=3, LDOUSDC fold=4 per parent dossier reco #6) before merge ?
- Verdict :
PASS / PASS_WITH_REVISIONS / REJECTED— with explicit concrete recommendation for the next concrete step (S19 design draft, additional diagnostic, or re-scope further).
8. What S19 would NOT do (out of scope)¶
- Re-design the harness θ-sweep mechanism end-to-end (committee enhancement #5 was about always-on calibration ; that's a separate Story)
- Rewrite the autonomous trainer entry point
- Touch FTF data prep / labeling pipeline (Step 4 confirmed data is clean)
- Touch CUSUM / FE / inference filter chain
- Modify the harness's training-loop config (ruled out by Steps 1-4)
9. Committee verdict (experiment_review) — 2026-05-14¶
Status : PASSED_WITH_REVISIONS.
The committee accepted the H8 re-scope as the dominant explanation for the AAVEUSDC fold=3 symptom and validated all 3 latent bugs (Bug 1 / Bug 2 / Bug 3 in §6.bis). The recommendation diverges from this dossier's Option F (bundle) toward a 2-PR split with a more conservative H8 posture :
9.1 S19 recommendation — Option B targeted (LGB-only) + separate hardening PR¶
S19 main PR — H8 fix, LGB-scoped only :
1. Restrict the θ-sweep candidates for the LGB path to [0.30, 0.40] (matching the legacy threshold_buy_range from lgbm_config.py:117) — OR, preferred, make the range configurable per model (resolves cleanly via the ADR-90 CVN_HPO_LGB_* keys).
2. Keep scale_pos_weight active — Option A (disabling it globally) is REJECTED : insufficient evidence that turning it off is safe for XGB and CB, and the cross-model regression risk is too high without targeted experimentation.
3. Add a guard : if rate_buy_val > 0.20-0.25, emit event=theta_overtrade_warning (or fail per a configurable mode flag).
Separate hardening PR — 3 latent bugs :
- class_balance.py — fail-fast when n_pos == 0 OR n_neg == 0 (extend the existing guard).
- theta_sweep.py:59 + eval_metrics.py:69 — add labels=[0, 1] to precision_recall_fscore_support(...) so the f1 array is always length-2 indexed by class.
- adapters/lgb.py:42-43 — preserve column order or reorder to self._native.feature_name() when the input is a DataFrame ; do NOT strip names via .to_numpy().
Option D rejected (HPO over scale_pos_weight) : same f1-on-val optimisation target → same over-trading attractor.
Option F rejected : bundling H8 fix with defensive hardening in a single PR muddies the revert envelope ; the committee prefers a clean split so each PR has its own CR cycle + audit trail.
9.2 Validation before merge¶
Run the fix candidate on 3 captured folds :
- AAVEUSDC fold=3 (the canary)
- OPUSDC fold=3
- LDOUSDC fold=4
For each cell, compare BEFORE / AFTER on : theta_picked, rate_buy_val, raw_buy_signals, final_trades, sortino, return.
9.3 Dossier correction adopted¶
§5 row for adapters/lgb.py was wrong (called it the "scale_pos_weight injection point"). The actual injection is in class_balance.py + lightgbm_dag.py. The adapter is the locus of Bug 3 (DataFrame column-order erasure) — a separate concern from H8. Corrected in this commit.
9.4 Next concrete steps (action plan)¶
- Open PR for this dossier (Step 5 = analytical artefact, no code).
- Open S19 main PR : θ-sweep range restriction + over-trade guard, LGB-scoped.
- Open S19-hardening PR : 3 latent bug fixes (independent CR cycle).
- Cross-fold validation : capture
OPUSDC fold=3+LDOUSDC fold=4viadiagnostic__s18_step1_4_chainruns (~40 min wall time total) ; verify pre/post deltas before either S19 PR merges. - Transition OP wp#154 :
Developed → In testingafter both S19 PRs merge ;Closedafter cross-fold validation passes.
10. References¶
- Parent plan dossier :
2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md - Step 2 dossier :
step2-legacy-vs-harness.md - Step 4 design :
step4-design.md - ADR-89 (training harness as plugin registry) :
documentation/adr/0089-training-harness-as-plugin-registry.md - ADR-90 (training hyperparams in PG / Console only) :
documentation/adr/0090-training-hyperparameters-in-pg-console-only.md - PR #891 (the harness migration that introduced the regression) : 87 changed files
- PR #872 (the FTF 7-bug hotfix where Bug #6 first surfaced the
scale_pos_weight×binary_loglossearly-stop issue and switched to AUC) - PR #934 (Step 3 parity) — verdict REFUTED on H2
- PR #937 (Step 4 forensics) — verdict NO_DIVERGENCE on H3-H5