S18 Step 5 — Re-scope dossier (post-NO_DIVERGENCE)¶

Status : committee experiment_review PASSED_WITH_REVISIONS 2026-05-14 — see §9 below. Story : CVN-N001-EE-S18 (OP wp#154, Plan dossier) Trigger : Step 4 (PR #937) verdict = NO_DIVERGENCE on AAVEUSDC fold=3, plus Phase A log evidence revealing the actual regression mechanism Author : Claude Opus 4.7 (under operator review) Date : 2026-05-14

1. Executive summary (1 paragraph)¶

The S18 diagnostic chain (Steps 0 → 4) ruled out all 7 hypotheses H1-H7 of the parent dossier §5.1 : training-loop config (H1, H6), valid_sets composition (H2), label misalignment (H3), sample weights (H4), feature corruption (H5), HPO param overrides (H7) — all PASS. The captured fold's data is clean. The regression is upstream of the captured parquet in the harness's post-training decision logic. Phase A logs from the chained-DAG run 2026-05-14 14:44 reveal the mechanism : scale_pos_weight=4.71 (auto-injected by the harness ; pre-#891 prod path defaulted to 1.0 and never applied it) inflates positive-class probabilities → the harness θ-sweep over [0.05, 0.95] (19 candidates ; pre-#891 Optuna range [0.30, 0.40]) picks θ=0.2 → 46 % buy rate on val → 1210 raw signals → 251 final trades after concurrency filter → backtest sortino -9.5, return -91.35 %. The "best_iter=1 shallow training" symptom is a side effect of scale_pos_weight saturating val AUC at iteration 1, NOT the bug locus. H8 = scale_pos_weight auto-injection × wide θ-sweep range coupling is the re-scoped hypothesis. This dossier asks the committee to validate H8 and recommend the S19 fix posture (4 options in §6).

2. The chain of evidence (Steps 0-4 + Phase A)¶

2.1 Step 0 (PR #931 merged) — staging replay¶

Confirmed reproducibility on staging. AAVEUSDC fold=3 Phase A produces f1_buy_val=0.3520485 byte-identical to the canonical post-S17 canary (abs_delta=8e-09 << epsilon=0.005). Verdict : PASS.

2.2 Step 1 (PR #932 merged) — verbose capture¶

Monkey-patched lgb.train + lgb.Dataset in the running harness ; serialized to parquet. Captured 51 Optuna trials × per-iter eval traces. n_eval_series=2 per trial = 1 valid_set × 2 metrics (auc + binary_logloss).

2.3 Step 2 (PR #933 merged) — git archaeology + side-by-side¶

Identified valid_sets=[train_set, val_set] (pre-#891) vs valid_sets=[val_set] (harness post-#891) as smoking-gun candidate. Status : analyzed but not the regression locus per Step 3 verdict.

2.4 Step 3 (PR #934 merged) — parity reproducer¶

Ran both valid_sets configurations on the captured fold with GRID_DEFAULT_HP_LGB. Verdict : REFUTED — legacy_best_iter=1, harness_best_iter=1. Both paths produce the same shallow training. H2 ruled out.

2.5 Loki forensics (2026-05-14 14:44 chained DAG run)¶

Histogram of 51 Optuna trials in Phase A :

best_iter=1 : 49 trials
best_iter=2 :  2 trials

51 distinct hyperparameter combinations + 2 valid_sets configs (Step 3) = 53 distinct configurations, all early-stopping at iter 1-2. H1 (early stopping config), H6 (metric mismatch), H7 (HPO param override) ruled out — none would affect 53 trainings uniformly with the same outcome.

2.6 Step 4 (PR #937 merged) — data forensics¶

Six checks on the captured fold (run 2026-05-14 17:47-18:14) :

Check	Status	Key metric
F1 label-index alignment	PASS	`train_index_match=true, val_index_match=true`
F2 label distribution	PASS	`train_pos_rate=0.175, val_pos_rate=0.169` (Δ < 5 %)
F3 NaN ratio	PASS	`max_train_nan_ratio < 0.5, max_val < 0.5`
F4 feature-label correlation	PASS	`max_abs_corr < 0.95` (no leak feature)
F5 train vs val drift	PASS	`max_sigma_shift < 3.0` (no scaling drift)
F6 iter-1 single-tree probe	PASS	`iter1_train_auc≈0.78, iter1_val_auc≈0.49`, no leak / no misalignment / proba distribution healthy

Verdict : NO_DIVERGENCE → bug is upstream of the captured parquet → escalate to committee, re-scope §5.1 hypothesis space.

2.7 Phase A log evidence (the smoking gun re-scoped)¶

event=training_complete model_type=lightgbm best_iteration=1 training_time_sec=2.465
  theta_picked=0.2 f1_buy_val=0.352 auc_buy_val=0.6461
  brier_val=0.1395 ece_val=0.0591 rate_buy_val=0.4611

event=signal_funnel raw_buy_signals=1210 final_trades=251
  primary_killer=concurrency

event=weighted_variant_evaluated sortino=-9.512 n_trades=251 return=-91.35%

The model is NOT broken at the booster level : - auc_buy_val=0.6461 — the booster CAN rank positives (above-random) - f1_buy_val=0.352 — at the harness-chosen θ=0.2, f1 is reasonable - brier_val=0.1395, ece_val=0.0591 — calibration is acceptable

The model IS broken at the trading level : - rate_buy_val=0.4611 — the model emits BUY 46 % of the time on val - 1210 raw signals → 251 final trades (only the concurrency filter saves us from disaster) - Backtest sortino -9.5, return -91.35 %

The actual regression is the post-training θ selection, NOT the training loop.

3. Diff legacy vs harness — six axes¶

Sources : - Pre-#891 prod path : /tmp/s18_pre891/lgbm_*.py (vendored read-only copy of the LGB trainer at dc3d86c6^) - Post-#891 harness path : src/training/harness/{nodes,dags/models}/ - Post-#891 legacy-style path : src/training/LightGBM/cvntrade_LightGBM_grid_utils.py

#	Axis	Pre-#891 prod	Post-#891 harness	Effect
D1	`scale_pos_weight` injection	`LightGBMConfig.scale_pos_weight: float = 1.0` (default ; never set by the autonomous trainer ; only added to `lgb.train` params if `!= 1.0`) → NOT applied in production	`class_balance.compute_class_balance(binary=True)` → `scale_pos_weight = n_neg / n_pos` → ALWAYS applied (= 4.7119 on AAVEUSDC fold=3)	Inflates positive-class probas downstream
D2	θ selection mechanism	Optuna suggests `threshold_buy` as ONE of the trial params ; jointly optimised with model HPs	θ-sweep is a post-hoc separate pass : Optuna picks model HPs only, then `theta_sweep.pick_threshold_on_val()` runs after the final retrain	Decouples θ from model-HP joint optimum
D3	θ search range	`threshold_buy_range: Tuple[float, float] = (0.30, 0.40)` (`lgbm_config.py:117`)	19 candidates `np.linspace(0.05, 0.95, 19) + [0.5]` ; argmax over f1_buy on val (`theta_sweep.py:53`)	Lets θ go arbitrarily low (picks 0.2 in our run, outside the legacy bounds)
D4	`valid_sets` composition	`[train_data, val_data]` + `valid_names=["train", "val"]`	`[val_set]` + `valid_names=["val"]` (`lightgbm_dag.py:171`)	Step 3 ruled this out as a dominant cause (both produce best_iter=1 on the captured fold under GRID_DEFAULT_HP_LGB) — but still a real diff
D5	Eval metric / early stopping	`lgb_params` does NOT explicitly set `metric` → LGB default for `objective=binary` is `binary_logloss`	`metric=["auc", "binary_logloss"]` + `first_metric_only=True` → AUC drives early stop (PR #872 fix per `lgbm_grid_utils.py:53-65` Bug #6 comment)	Switch to AUC was a Bug #6 fix ; ironically, with `scale_pos_weight=4.71`, AUC saturates at iter 1 anyway
D6	Calibration	Optuna suggests `calibration ∈ {none, sigmoid, isotonic}` jointly with HPs	`theta_sweep.py` docstring claims "always-on calibration node (committee enhancement #2)" — but calibration is NOT applied in our Phase A run (no `event=calibration_*` in Loki ; ECE 0.059 is the raw model's)	Pre-#891 explored a 3-way calibration choice ; post-#891 effectively skips it

4. Re-scoped hypothesis — H8¶

H8 = scale_pos_weight auto-injection (D1) × wide θ-sweep range (D2 + D3) coupling produces a 46 % buy rate → catastrophic backtest, even though the booster's f1_val and AUC scores look acceptable.

H8 is the post-Step 4 escape hatch the parent dossier §7.1 explicitly anticipated : "if §5.3 invariants 5.3.1–5.3.3 all pass without divergence, STOP at end of Step 2 and re-evaluate hypothesis space §5.1 before continuing. Don't chase a phantom into Step 4." — Step 4 confirmed the data is clean ; the bug is in the harness's post-training decision logic, not in any of H1-H7.

4.1 Mechanism (chain of causation)¶

1. class_balance.py:62 → scale_pos_weight = n_neg/n_pos = 4.7119
2. lgb.train trains the booster with scale_pos_weight active
   → positive-class probas inflated (most probas in [0.10, 0.30])
3. AUC saturates at iter 1 (booster ranks positives well immediately)
   → early_stopping(50) fires after 50 idle iters → best_iter = 1
4. theta_sweep.pick_threshold_on_val() runs on inflated probas
   → finds best f1_buy at θ = 0.20 (because that's where the inflated
     probas concentrate around the decision boundary)
5. At θ=0.20 : 46% of val candles flagged BUY (rate_buy_val=0.4611)
6. Inference path emits 1210 raw BUY signals on test
7. concurrency filter rejects 959 (the only saving grace)
8. 251 trades execute → catastrophic backtest

4.2 Why pre-#891 didn't show this¶

1. LightGBMConfig.scale_pos_weight = 1.0 (default, never overridden)
   → no class rebalancing → balanced probas
2. lgb.train with valid_sets=[train, val] + binary_logloss default
   → train AUC always-improving keeps early stop from firing
   → trains 50-200 iters → final AUC 0.7-0.8
3. Optuna jointly suggests threshold_buy ∈ [0.30, 0.40]
   → θ stays in the conservative range
4. At θ=0.40 : far fewer val candles cross threshold → moderate buy rate
5. Reasonable trade count → backtest survives
6. f1_buy_val ≈ 0.42 (the canonical pre-#891 reference)

5. Critical files involved (for operator audit)¶

Pre-#891 (vendored at `/tmp/s18_pre891/`, no longer on main)¶

File	Lines	What to verify
`lgbm_autonomous_trainer.py`	19 K	Confirm it never sets `scale_pos_weight` ; calls `CVNTrade_LightGBMTrainer.train()`
`lgbm_trainer.py`	13 K	Lines 75-89 : `lgb.train(params=lgb_params, valid_sets=[train_data, val_data])` ; `params = config.to_lgb_params()`
`lgbm_config.py`	5 K	L31 `scale_pos_weight: float = 1.0` ; L83-84 `if self.scale_pos_weight != 1.0: params["scale_pos_weight"] = ...` ; L117 `threshold_buy_range = (0.30, 0.40)`
`lgbm_hyperoptimizer.py`	16 K	L217 + L239 confirms Optuna suggests `threshold_buy` as a trial param within `threshold_buy_range`

Post-#891 harness (current main, `src/training/harness/`)¶

File	Lines	What to verify
`nodes/class_balance.py`	79	L62 always returns `scale_pos_weight = n_neg / n_pos` for binary ; never opt-in
`nodes/theta_sweep.py`	64	L53 candidates = `np.linspace(0.05, 0.95, 19) + [0.5]` (NOT bounded to legacy [0.30, 0.40])
`nodes/hpo_optuna.py`	—	Confirms Optuna doesn't suggest `threshold_buy` (delegated to theta_sweep node)
`dags/models/lightgbm_dag.py`	312	L171 `valid_sets=[val_set]` ; L260-296 `_hpo_space` lists 9 model HPs, NO `threshold_buy`
`adapters/lgb.py`	—	Inference-side wrapper for `lgb.Booster` ; `predict_proba`. NOT the `scale_pos_weight` injection site (that lives in `class_balance.py` + `lightgbm_dag.py`). The adapter is the locus of Bug 3 (DataFrame column-order erasure) — a separate concern from H8. (Corrected 2026-05-14 post-committee.)
`contracts.py`	—	`ClassBalance` dataclass ; `TrainedArtifact` carries the picked threshold

Post-#891 legacy-style on main (`src/training/LightGBM/`)¶

File	What to verify
`cvntrade_LightGBM_grid_utils.py`	L86-89 also auto-sets `scale_pos_weight = n_neg / n_pos` ; L53-65 comment explains Bug #6 (binary_logloss + scale_pos_weight saturating early stop)
`cvntrade_LightGBM_autonomous_trainer.py`	Routes between legacy + harness path via `CVN_USE_HARNESS`

Diagnostic infrastructure (already in place from Steps 0-4)¶

src/commun/finetune/diagnostic/s18_step0_replay.py      # PR #931
src/commun/finetune/diagnostic/s18_step1_capture.py     # PR #932
src/commun/finetune/diagnostic/s18_step3_parity.py      # PR #934
src/commun/finetune/diagnostic/s18_step4_invariants.py  # PR #937
dags/dag_diagnostic__s18_step0.py
dags/dag_diagnostic__s18_step1_3_chain.py               # PR #935
dags/dag_diagnostic__s18_step1_4_chain.py               # PR #937
documentation/reviews/2026-05-13-...-s18-...-plan.md    # parent dossier
documentation/missions/cvn-n001-ee-s18-diagnostic/      # Step 2/4/5 dossiers

Reference / canary¶

PG finetune_results : (run_id=ftf_*, crypto=AAVEUSDC, fold_id=3, variant=lightgbm) → f1_buy=0.3520485 (post-#891 canonical)
The pre-#891 reference f1_buy ≈ 0.42 cited in the parent dossier §3 — its source needs to be re-confirmed (likely from a pre-S17 FTF run or from an earlier sweep ; the DB row may have been overwritten by post-S17 runs)

6. Decision matrix — S19 fix options¶

Option	Change	Files touched	Pros	Cons	Risk
A — Disable scale_pos_weight (revert to legacy 1.0)	`class_balance.py:62` → return `scale_pos_weight=1.0` (or skip the field entirely) for the binary branch	1 file, ~3 lines	Restores pre-#891 proba distribution → θ-sweep picks higher θ → fewer trades, healthier backtest	May lose the f1 lift that scale_pos_weight provides on rare-positive distributions ; XGB and CB are also affected (their adapters consume the same `ClassBalance`)	Medium — XGB / CB regression possible
B — Restrict θ-sweep range to legacy [0.30, 0.40]	`theta_sweep.py:53` → narrow `np.linspace(0.30, 0.40, ...)`	1 file, 1 line	Forces conservative θ even if probas are inflated → reasonable trade rate	With scale_pos_weight still active, the bound clip will produce a θ that wastes f1 (sub-optimal for the inflated proba distribution) ; XGB doesn't use theta_sweep so unaffected	Low — narrow surface
C — A + B both	`class_balance.py` + `theta_sweep.py`	2 files, ~4 lines	Best mimics legacy ; safest restore	Two simultaneous changes → harder to attribute the recovery ; doubles the diff to verify	Medium
D — Add scale_pos_weight to Optuna search	`lightgbm_dag.py:_hpo_space()` add `scale_pos_weight: trial.suggest_float(...)` over [1.0, 5.0] ; `class_balance.py` reads from params instead of computing	2 files, ~10 lines	Principled — let Optuna jointly optimize `scale_pos_weight` × model HPs	Multiplies HPO search space ; longer per-trial ; Optuna will likely converge to the same `n_neg/n_pos` value because that's the f1_val maximizer (defeats the purpose)	High — does NOT address the root cause (f1_val maximization on val drives the bad outcome regardless)
E — Decouple θ-sweep optimization from f1_val	Replace `theta_sweep.pick_threshold_on_val(... f1_buy)` with sortino-on-walkforward-backtest or Brier-score or a regularized f1 (penalty on rate_buy > 0.20)	`theta_sweep.py` + new node	Addresses the actual symptom (θ that wins f1 wins the wrong objective)	Major design change ; backtest in θ-sweep is expensive ; needs S20+ scope	Out-of-scope for S19
F — A+B+C + 3 latent bug fixes (BUNDLED)	Pick A / B / C as the H8 fix posture + Bug 1 (class_balance n_neg guard) + Bug 2 (labels=[0,1] in both files) + Bug 3 (lgb.py preserve feature_names)	`class_balance.py` + `theta_sweep.py` + `eval_metrics.py` + `adapters/lgb.py` + (per A/B/C pick)	Same PR surface for related defensive fixes ; zero behavioural change on healthy paths ; closes 3 latent prod risks for free	Mixes H8 fix (behavioural) with bug fixes (defensive) — committee may prefer separate PRs for clean revert envelope	Low — each fix is independently small + tested

6.bis Latent bugs found during code audit (operator review 2026-05-14)¶

A targeted re-read of class_balance.py + theta_sweep.py + eval_metrics.py surfaced 2 latent bugs that do NOT manifest on AAVEUSDC fold=3 (so they don't explain our specific regression) but WILL fire on degenerate-split folds in the FTF sweep and may contribute to the broader regression cited in the parent dossier §3 (e.g. XGB canary f1_buy=0.089).

Bug 1 — `class_balance.py:55-62`, missing guard on `n_neg == 0`¶

n_pos = int((y == 1).sum())
n_neg = int((y == 0).sum())
if n_pos == 0:
    raise ValueError(...)        # only this branch fail-fasts
return ClassBalance(scale_pos_weight=n_neg / n_pos, ...)

When n_neg == 0 (train split is 100 % BUY) → returns scale_pos_weight = 0 / n_pos = 0.0 silently → LGB / XGB train with the positive class effectively un-weighted → barely-trained booster, no exception raised. Per ADR-25 (no silent fallback), this MUST fail-fast.

Patch :

if n_pos == 0 or n_neg == 0:
    raise ValueError(
        f"compute_class_balance: degenerate training labels — "
        f"n_pos={n_pos}, n_neg={n_neg}. Cannot train binary on a single-class split."
    )

Does it fire on AAVEUSDC fold=3 ? No — n_pos=1725, n_neg=8128, both strictly positive.

Bug 2 — `theta_sweep.py:59` + `eval_metrics.py:69`, missing `labels=[0, 1]`¶

_, _, f1, _ = precision_recall_fscore_support(y, y_pred, average=None, zero_division=0)
f1_buy = float(f1[1]) if len(f1) > 1 else 0.0

sklearn's precision_recall_fscore_support(..., average=None) without labels= returns metrics indexed by set(y_true) ∪ set(y_pred) sorted ascending. When that union has only label 1 (e.g. y_true = y_pred = [1, 1, 1]), it returns f1 = [1.0] of length 1 → the len(f1) > 1 else 0.0 guard returns 0.0 instead of the true 1.0. Optuna then optimises on broken signal.

Patch (both files) :

_, _, f1, _ = precision_recall_fscore_support(
    y_true_or_y, y_pred, labels=[0, 1], average=None, zero_division=0,
)
f1_buy = float(f1[1])    # always safe with explicit labels

Does it fire on AAVEUSDC fold=3 ? No — y_val has both classes (353 pos + 1740 neg), so the union always contains {0, 1} regardless of y_pred's content → len(f1) == 2 → f1[1] correctly indexes BUY.

Bug 3 — `adapters/lgb.py:42-43`, DataFrame → ndarray strips column order¶

def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    # PR #900 — DataFrame-native at inference, ndarray at training.
    if isinstance(x, pd.DataFrame):
        x = x.to_numpy()                     # ← strips column names
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    ...

The docstring claims "DataFrame-native" but the implementation is the opposite — .to_numpy() reduces the input to a positional ndarray. lgb.Booster.predict(ndarray) does NOT validate feature names ; if the DataFrame's column order at inference differs from the training matrix, LightGBM silently predicts on the WRONG features. No exception, no log, no metric anomaly — just silently corrupted probabilities downstream.

Patch (minimal — preserve columns + reorder if needed) :

def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    if isinstance(x, pd.DataFrame):
        expected = list(self._native.feature_name())
        if list(x.columns) != expected:
            x = x[expected]                  # reorder to training-time schema
        # keep as DataFrame — lgb.Booster.predict accepts both
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    else:
        raw = self._native.predict(x)
    proba = np.asarray(raw, dtype=float)
    if proba.ndim == 1:
        return np.column_stack([1.0 - proba, proba])
    return proba

Does it fire on AAVEUSDC fold=3 ? Probably not at training time (the harness trains on the same ndarray it later predicts on for θ-sweep + eval — order is consistent by construction). Possibly at backtest time : CVNTrade_BacktestEngine calls adapter.predict_proba candle-by-candle on the live FE-pipeline DataFrame. If column ordering of that DataFrame ever diverges from the training cache's order (e.g. after a FE pipeline cache rebuild, or after a feature-selection set update), all backtest predictions are silently corrupted. Our observed auc_buy_val=0.65 rules out total feature scrambling on THIS run (random predictions would give AUC ~0.5), but Bug 3 plausibly drives any feature-order-sensitive cross-crypto / live regression cited in the parent dossier §3.

Operator severity ranking (2026-05-14 audit)¶

class_balance.py — scale_pos_weight=0 on single-class train : silently destructive, can pass through CI undetected on rare-pos cryptos.
adapters/lgb.py — DataFrame column-order erasure : silently wrong predictions in any path where the DataFrame schema can drift (backtest, live inference, walk-forward predictor).
theta_sweep.py / eval_metrics.py — missing labels=[0, 1] : silently wrong metrics on mono-class splits → Optuna optimises on garbage.

Why all 3 matter even though our specific run is unaffected¶

Cross-fold risk : the FTF sweep cited in the parent dossier §3 covered defi_top5 × 5m × multiple folds × 3 models. SOME of those folds may have produced a degenerate train (all-BUY) or a degenerate val (all-one-class). Bug 1 silently broke any such training ; Bug 2 returned 0.0 f1_buy that contaminated Optuna's score → cascading bad hyperparam choices ; Bug 3 may have corrupted cross-crypto inference when feature schemas diverged.
Plausibility on the XGB canary : XGB stays at fixed θ=0.5 (no θ-sweep) — but XGB's autonomous trainer also calls eval_metrics.evaluate_split_binary which has Bug 2. If XGB's prediction at θ=0.5 produced a degenerate y_pred, the reported f1_buy=0.089 mean across folds could be a Bug 2 measurement artefact superimposed on the H8 mechanism.
Defensive cost is zero-to-small : Bug 1 patch is 1 line ; Bug 2 patch is 1 line per file ; Bug 3 patch is ~5 lines + a column-reorder. All fail-fast / defensive, no behaviour change on currently-healthy paths, fully consistent with ADR-25.
Other adapters audited (2026-05-14) :
adapters/xgb.py:57 — passes feature names via xgb.DMatrix(..., feature_names=list(x.columns)) → DMatrix validates column order against training feature_names → Bug 3 does NOT apply to XGB.
adapters/cb.py:37 — calls self._native.predict_proba(x) directly on the CatBoost classifier ; CatBoost's sklearn-compatible API handles DataFrames natively + validates column order → Bug 3 does NOT apply to CB.
Bug 3 is LGB-only. Patch surface : single file adapters/lgb.py.

7. Questions for the committee (`experiment_review`)¶

Validate H8 : do the experts agree the re-scoped hypothesis (D1 × D3 coupling driving over-trading) is the dominant locus on AAVEUSDC fold=3, or are there hidden axes we missed (D5, D6, calibration node behaviour, walk-forward predict path) ?
Validate the 3 latent bugs found by operator audit (Bug 1 / Bug 2 / Bug 3 in §6.bis) : do the experts agree each is a real defect ? Should S19 close all 3 in the same PR as the H8 fix (Option F bundle), or split them into a separate defensive-hardening PR with its own CR cycle ? Operator severity ranking : Bug 1 > Bug 3 > Bug 2.
Source of pre-#891 reference f1≈0.42 : the parent dossier cites this number ; can we recover the exact PG row / Loki window that produced it, or is it extrapolated ? If extrapolated, should we re-establish a pre-#891 canary by running train_with_fixed_params_lgbm with scale_pos_weight=1.0 + θ=0.4 on the captured fold to get an empirical baseline ?
S19 fix posture for H8 : which of A / B / C is least risky for production ? Option D (add to HPO) is rejected by us as theoretically self-defeating ; Option E is out-of-scope ; Option F bundles A/B/C with the 3 bug fixes.
XGB / CB regression risk under Option A : disabling scale_pos_weight in class_balance.py affects all 3 adapters. Should S19 be LGB-only (introduce a per-model class_balance override) ?
Test plan for S19 : is a single-fold re-run on the captured parquet sufficient, or should S19 be validated cross-fold (OPUSDC fold=3, LDOUSDC fold=4 per parent dossier reco #6) before merge ?
Verdict : PASS / PASS_WITH_REVISIONS / REJECTED — with explicit concrete recommendation for the next concrete step (S19 design draft, additional diagnostic, or re-scope further).

8. What S19 would NOT do (out of scope)¶

Re-design the harness θ-sweep mechanism end-to-end (committee enhancement #5 was about always-on calibration ; that's a separate Story)
Rewrite the autonomous trainer entry point
Touch FTF data prep / labeling pipeline (Step 4 confirmed data is clean)
Touch CUSUM / FE / inference filter chain
Modify the harness's training-loop config (ruled out by Steps 1-4)

9. Committee verdict (`experiment_review`) — 2026-05-14¶

Status : PASSED_WITH_REVISIONS.

The committee accepted the H8 re-scope as the dominant explanation for the AAVEUSDC fold=3 symptom and validated all 3 latent bugs (Bug 1 / Bug 2 / Bug 3 in §6.bis). The recommendation diverges from this dossier's Option F (bundle) toward a 2-PR split with a more conservative H8 posture :

9.1 S19 recommendation — Option B targeted (LGB-only) + separate hardening PR¶

S19 main PR — H8 fix, LGB-scoped only : 1. Restrict the θ-sweep candidates for the LGB path to [0.30, 0.40] (matching the legacy threshold_buy_range from lgbm_config.py:117) — OR, preferred, make the range configurable per model (resolves cleanly via the ADR-90 CVN_HPO_LGB_* keys). 2. Keep scale_pos_weight active — Option A (disabling it globally) is REJECTED : insufficient evidence that turning it off is safe for XGB and CB, and the cross-model regression risk is too high without targeted experimentation. 3. Add a guard : if rate_buy_val > 0.20-0.25, emit event=theta_overtrade_warning (or fail per a configurable mode flag).

Separate hardening PR — 3 latent bugs : - class_balance.py — fail-fast when n_pos == 0 OR n_neg == 0 (extend the existing guard). - theta_sweep.py:59 + eval_metrics.py:69 — add labels=[0, 1] to precision_recall_fscore_support(...) so the f1 array is always length-2 indexed by class. - adapters/lgb.py:42-43 — preserve column order or reorder to self._native.feature_name() when the input is a DataFrame ; do NOT strip names via .to_numpy().

Option D rejected (HPO over scale_pos_weight) : same f1-on-val optimisation target → same over-trading attractor.

Option F rejected : bundling H8 fix with defensive hardening in a single PR muddies the revert envelope ; the committee prefers a clean split so each PR has its own CR cycle + audit trail.

9.2 Validation before merge¶

Run the fix candidate on 3 captured folds : - AAVEUSDC fold=3 (the canary) - OPUSDC fold=3 - LDOUSDC fold=4

For each cell, compare BEFORE / AFTER on : theta_picked, rate_buy_val, raw_buy_signals, final_trades, sortino, return.

9.3 Dossier correction adopted¶

§5 row for adapters/lgb.py was wrong (called it the "scale_pos_weight injection point"). The actual injection is in class_balance.py + lightgbm_dag.py. The adapter is the locus of Bug 3 (DataFrame column-order erasure) — a separate concern from H8. Corrected in this commit.

9.4 Next concrete steps (action plan)¶

Open PR for this dossier (Step 5 = analytical artefact, no code).
Open S19 main PR : θ-sweep range restriction + over-trade guard, LGB-scoped.
Open S19-hardening PR : 3 latent bug fixes (independent CR cycle).
Cross-fold validation : capture OPUSDC fold=3 + LDOUSDC fold=4 via diagnostic__s18_step1_4_chain runs (~40 min wall time total) ; verify pre/post deltas before either S19 PR merges.
Transition OP wp#154 : Developed → In testing after both S19 PRs merge ; Closed after cross-fold validation passes.

10. References¶

Parent plan dossier : 2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md
Step 2 dossier : step2-legacy-vs-harness.md
Step 4 design : step4-design.md
ADR-89 (training harness as plugin registry) : documentation/adr/0089-training-harness-as-plugin-registry.md
ADR-90 (training hyperparams in PG / Console only) : documentation/adr/0090-training-hyperparameters-in-pg-console-only.md
PR #891 (the harness migration that introduced the regression) : 87 changed files
PR #872 (the FTF 7-bug hotfix where Bug #6 first surfaced the scale_pos_weight × binary_logloss early-stop issue and switched to AUC)
PR #934 (Step 3 parity) — verdict REFUTED on H2
PR #937 (Step 4 forensics) — verdict NO_DIVERGENCE on H3-H5