Skip to content

S18 Step 5 — Re-scope dossier (post-NO_DIVERGENCE)

Status : committee experiment_review PASSED_WITH_REVISIONS 2026-05-14 — see §9 below. Story : CVN-N001-EE-S18 (OP wp#154, Plan dossier) Trigger : Step 4 (PR #937) verdict = NO_DIVERGENCE on AAVEUSDC fold=3, plus Phase A log evidence revealing the actual regression mechanism Author : Claude Opus 4.7 (under operator review) Date : 2026-05-14


1. Executive summary (1 paragraph)

The S18 diagnostic chain (Steps 0 → 4) ruled out all 7 hypotheses H1-H7 of the parent dossier §5.1 : training-loop config (H1, H6), valid_sets composition (H2), label misalignment (H3), sample weights (H4), feature corruption (H5), HPO param overrides (H7) — all PASS. The captured fold's data is clean. The regression is upstream of the captured parquet in the harness's post-training decision logic. Phase A logs from the chained-DAG run 2026-05-14 14:44 reveal the mechanism : scale_pos_weight=4.71 (auto-injected by the harness ; pre-#891 prod path defaulted to 1.0 and never applied it) inflates positive-class probabilities → the harness θ-sweep over [0.05, 0.95] (19 candidates ; pre-#891 Optuna range [0.30, 0.40]) picks θ=0.2 → 46 % buy rate on val → 1210 raw signals → 251 final trades after concurrency filter → backtest sortino -9.5, return -91.35 %. The "best_iter=1 shallow training" symptom is a side effect of scale_pos_weight saturating val AUC at iteration 1, NOT the bug locus. H8 = scale_pos_weight auto-injection × wide θ-sweep range coupling is the re-scoped hypothesis. This dossier asks the committee to validate H8 and recommend the S19 fix posture (4 options in §6).

2. The chain of evidence (Steps 0-4 + Phase A)

2.1 Step 0 (PR #931 merged) — staging replay

Confirmed reproducibility on staging. AAVEUSDC fold=3 Phase A produces f1_buy_val=0.3520485 byte-identical to the canonical post-S17 canary (abs_delta=8e-09 << epsilon=0.005). Verdict : PASS.

2.2 Step 1 (PR #932 merged) — verbose capture

Monkey-patched lgb.train + lgb.Dataset in the running harness ; serialized to parquet. Captured 51 Optuna trials × per-iter eval traces. n_eval_series=2 per trial = 1 valid_set × 2 metrics (auc + binary_logloss).

2.3 Step 2 (PR #933 merged) — git archaeology + side-by-side

Identified valid_sets=[train_set, val_set] (pre-#891) vs valid_sets=[val_set] (harness post-#891) as smoking-gun candidate. Status : analyzed but not the regression locus per Step 3 verdict.

2.4 Step 3 (PR #934 merged) — parity reproducer

Ran both valid_sets configurations on the captured fold with GRID_DEFAULT_HP_LGB. Verdict : REFUTEDlegacy_best_iter=1, harness_best_iter=1. Both paths produce the same shallow training. H2 ruled out.

2.5 Loki forensics (2026-05-14 14:44 chained DAG run)

Histogram of 51 Optuna trials in Phase A :

best_iter=1 : 49 trials
best_iter=2 :  2 trials

51 distinct hyperparameter combinations + 2 valid_sets configs (Step 3) = 53 distinct configurations, all early-stopping at iter 1-2. H1 (early stopping config), H6 (metric mismatch), H7 (HPO param override) ruled out — none would affect 53 trainings uniformly with the same outcome.

2.6 Step 4 (PR #937 merged) — data forensics

Six checks on the captured fold (run 2026-05-14 17:47-18:14) :

Check Status Key metric
F1 label-index alignment PASS train_index_match=true, val_index_match=true
F2 label distribution PASS train_pos_rate=0.175, val_pos_rate=0.169 (Δ < 5 %)
F3 NaN ratio PASS max_train_nan_ratio < 0.5, max_val < 0.5
F4 feature-label correlation PASS max_abs_corr < 0.95 (no leak feature)
F5 train vs val drift PASS max_sigma_shift < 3.0 (no scaling drift)
F6 iter-1 single-tree probe PASS iter1_train_auc≈0.78, iter1_val_auc≈0.49, no leak / no misalignment / proba distribution healthy

Verdict : NO_DIVERGENCE → bug is upstream of the captured parquet → escalate to committee, re-scope §5.1 hypothesis space.

2.7 Phase A log evidence (the smoking gun re-scoped)

event=training_complete model_type=lightgbm best_iteration=1 training_time_sec=2.465
  theta_picked=0.2 f1_buy_val=0.352 auc_buy_val=0.6461
  brier_val=0.1395 ece_val=0.0591 rate_buy_val=0.4611

event=signal_funnel raw_buy_signals=1210 final_trades=251
  primary_killer=concurrency

event=weighted_variant_evaluated sortino=-9.512 n_trades=251 return=-91.35%

The model is NOT broken at the booster level : - auc_buy_val=0.6461 — the booster CAN rank positives (above-random) - f1_buy_val=0.352 — at the harness-chosen θ=0.2, f1 is reasonable - brier_val=0.1395, ece_val=0.0591 — calibration is acceptable

The model IS broken at the trading level : - rate_buy_val=0.4611 — the model emits BUY 46 % of the time on val - 1210 raw signals → 251 final trades (only the concurrency filter saves us from disaster) - Backtest sortino -9.5, return -91.35 %

The actual regression is the post-training θ selection, NOT the training loop.

3. Diff legacy vs harness — six axes

Sources : - Pre-#891 prod path : /tmp/s18_pre891/lgbm_*.py (vendored read-only copy of the LGB trainer at dc3d86c6^) - Post-#891 harness path : src/training/harness/{nodes,dags/models}/ - Post-#891 legacy-style path : src/training/LightGBM/cvntrade_LightGBM_grid_utils.py

# Axis Pre-#891 prod Post-#891 harness Effect
D1 scale_pos_weight injection LightGBMConfig.scale_pos_weight: float = 1.0 (default ; never set by the autonomous trainer ; only added to lgb.train params if != 1.0) → NOT applied in production class_balance.compute_class_balance(binary=True)scale_pos_weight = n_neg / n_posALWAYS applied (= 4.7119 on AAVEUSDC fold=3) Inflates positive-class probas downstream
D2 θ selection mechanism Optuna suggests threshold_buy as ONE of the trial params ; jointly optimised with model HPs θ-sweep is a post-hoc separate pass : Optuna picks model HPs only, then theta_sweep.pick_threshold_on_val() runs after the final retrain Decouples θ from model-HP joint optimum
D3 θ search range threshold_buy_range: Tuple[float, float] = (0.30, 0.40) (lgbm_config.py:117) 19 candidates np.linspace(0.05, 0.95, 19) + [0.5] ; argmax over f1_buy on val (theta_sweep.py:53) Lets θ go arbitrarily low (picks 0.2 in our run, outside the legacy bounds)
D4 valid_sets composition [train_data, val_data] + valid_names=["train", "val"] [val_set] + valid_names=["val"] (lightgbm_dag.py:171) Step 3 ruled this out as a dominant cause (both produce best_iter=1 on the captured fold under GRID_DEFAULT_HP_LGB) — but still a real diff
D5 Eval metric / early stopping lgb_params does NOT explicitly set metric → LGB default for objective=binary is binary_logloss metric=["auc", "binary_logloss"] + first_metric_only=True → AUC drives early stop (PR #872 fix per lgbm_grid_utils.py:53-65 Bug #6 comment) Switch to AUC was a Bug #6 fix ; ironically, with scale_pos_weight=4.71, AUC saturates at iter 1 anyway
D6 Calibration Optuna suggests calibration ∈ {none, sigmoid, isotonic} jointly with HPs theta_sweep.py docstring claims "always-on calibration node (committee enhancement #2)" — but calibration is NOT applied in our Phase A run (no event=calibration_* in Loki ; ECE 0.059 is the raw model's) Pre-#891 explored a 3-way calibration choice ; post-#891 effectively skips it

4. Re-scoped hypothesis — H8

H8 = scale_pos_weight auto-injection (D1) × wide θ-sweep range (D2 + D3) coupling produces a 46 % buy rate → catastrophic backtest, even though the booster's f1_val and AUC scores look acceptable.

H8 is the post-Step 4 escape hatch the parent dossier §7.1 explicitly anticipated : "if §5.3 invariants 5.3.1–5.3.3 all pass without divergence, STOP at end of Step 2 and re-evaluate hypothesis space §5.1 before continuing. Don't chase a phantom into Step 4." — Step 4 confirmed the data is clean ; the bug is in the harness's post-training decision logic, not in any of H1-H7.

4.1 Mechanism (chain of causation)

1. class_balance.py:62 → scale_pos_weight = n_neg/n_pos = 4.7119
2. lgb.train trains the booster with scale_pos_weight active
   → positive-class probas inflated (most probas in [0.10, 0.30])
3. AUC saturates at iter 1 (booster ranks positives well immediately)
   → early_stopping(50) fires after 50 idle iters → best_iter = 1
4. theta_sweep.pick_threshold_on_val() runs on inflated probas
   → finds best f1_buy at θ = 0.20 (because that's where the inflated
     probas concentrate around the decision boundary)
5. At θ=0.20 : 46% of val candles flagged BUY (rate_buy_val=0.4611)
6. Inference path emits 1210 raw BUY signals on test
7. concurrency filter rejects 959 (the only saving grace)
8. 251 trades execute → catastrophic backtest

4.2 Why pre-#891 didn't show this

1. LightGBMConfig.scale_pos_weight = 1.0 (default, never overridden)
   → no class rebalancing → balanced probas
2. lgb.train with valid_sets=[train, val] + binary_logloss default
   → train AUC always-improving keeps early stop from firing
   → trains 50-200 iters → final AUC 0.7-0.8
3. Optuna jointly suggests threshold_buy ∈ [0.30, 0.40]
   → θ stays in the conservative range
4. At θ=0.40 : far fewer val candles cross threshold → moderate buy rate
5. Reasonable trade count → backtest survives
6. f1_buy_val ≈ 0.42 (the canonical pre-#891 reference)

5. Critical files involved (for operator audit)

Pre-#891 (vendored at /tmp/s18_pre891/, no longer on main)

File Lines What to verify
lgbm_autonomous_trainer.py 19 K Confirm it never sets scale_pos_weight ; calls CVNTrade_LightGBMTrainer.train()
lgbm_trainer.py 13 K Lines 75-89 : lgb.train(params=lgb_params, valid_sets=[train_data, val_data]) ; params = config.to_lgb_params()
lgbm_config.py 5 K L31 scale_pos_weight: float = 1.0 ; L83-84 if self.scale_pos_weight != 1.0: params["scale_pos_weight"] = ... ; L117 threshold_buy_range = (0.30, 0.40)
lgbm_hyperoptimizer.py 16 K L217 + L239 confirms Optuna suggests threshold_buy as a trial param within threshold_buy_range

Post-#891 harness (current main, src/training/harness/)

File Lines What to verify
nodes/class_balance.py 79 L62 always returns scale_pos_weight = n_neg / n_pos for binary ; never opt-in
nodes/theta_sweep.py 64 L53 candidates = np.linspace(0.05, 0.95, 19) + [0.5] (NOT bounded to legacy [0.30, 0.40])
nodes/hpo_optuna.py Confirms Optuna doesn't suggest threshold_buy (delegated to theta_sweep node)
dags/models/lightgbm_dag.py 312 L171 valid_sets=[val_set] ; L260-296 _hpo_space lists 9 model HPs, NO threshold_buy
adapters/lgb.py Inference-side wrapper for lgb.Booster ; predict_proba. NOT the scale_pos_weight injection site (that lives in class_balance.py + lightgbm_dag.py). The adapter is the locus of Bug 3 (DataFrame column-order erasure) — a separate concern from H8. (Corrected 2026-05-14 post-committee.)
contracts.py ClassBalance dataclass ; TrainedArtifact carries the picked threshold

Post-#891 legacy-style on main (src/training/LightGBM/)

File What to verify
cvntrade_LightGBM_grid_utils.py L86-89 also auto-sets scale_pos_weight = n_neg / n_pos ; L53-65 comment explains Bug #6 (binary_logloss + scale_pos_weight saturating early stop)
cvntrade_LightGBM_autonomous_trainer.py Routes between legacy + harness path via CVN_USE_HARNESS

Diagnostic infrastructure (already in place from Steps 0-4)

src/commun/finetune/diagnostic/s18_step0_replay.py      # PR #931
src/commun/finetune/diagnostic/s18_step1_capture.py     # PR #932
src/commun/finetune/diagnostic/s18_step3_parity.py      # PR #934
src/commun/finetune/diagnostic/s18_step4_invariants.py  # PR #937
dags/dag_diagnostic__s18_step0.py
dags/dag_diagnostic__s18_step1_3_chain.py               # PR #935
dags/dag_diagnostic__s18_step1_4_chain.py               # PR #937
documentation/reviews/2026-05-13-...-s18-...-plan.md    # parent dossier
documentation/missions/cvn-n001-ee-s18-diagnostic/      # Step 2/4/5 dossiers

Reference / canary

  • PG finetune_results : (run_id=ftf_*, crypto=AAVEUSDC, fold_id=3, variant=lightgbm)f1_buy=0.3520485 (post-#891 canonical)
  • The pre-#891 reference f1_buy ≈ 0.42 cited in the parent dossier §3 — its source needs to be re-confirmed (likely from a pre-S17 FTF run or from an earlier sweep ; the DB row may have been overwritten by post-S17 runs)

6. Decision matrix — S19 fix options

Option Change Files touched Pros Cons Risk
A — Disable scale_pos_weight (revert to legacy 1.0) class_balance.py:62 → return scale_pos_weight=1.0 (or skip the field entirely) for the binary branch 1 file, ~3 lines Restores pre-#891 proba distribution → θ-sweep picks higher θ → fewer trades, healthier backtest May lose the f1 lift that scale_pos_weight provides on rare-positive distributions ; XGB and CB are also affected (their adapters consume the same ClassBalance) Medium — XGB / CB regression possible
B — Restrict θ-sweep range to legacy [0.30, 0.40] theta_sweep.py:53 → narrow np.linspace(0.30, 0.40, ...) 1 file, 1 line Forces conservative θ even if probas are inflated → reasonable trade rate With scale_pos_weight still active, the bound clip will produce a θ that wastes f1 (sub-optimal for the inflated proba distribution) ; XGB doesn't use theta_sweep so unaffected Low — narrow surface
C — A + B both class_balance.py + theta_sweep.py 2 files, ~4 lines Best mimics legacy ; safest restore Two simultaneous changes → harder to attribute the recovery ; doubles the diff to verify Medium
D — Add scale_pos_weight to Optuna search lightgbm_dag.py:_hpo_space() add scale_pos_weight: trial.suggest_float(...) over [1.0, 5.0] ; class_balance.py reads from params instead of computing 2 files, ~10 lines Principled — let Optuna jointly optimize scale_pos_weight × model HPs Multiplies HPO search space ; longer per-trial ; Optuna will likely converge to the same n_neg/n_pos value because that's the f1_val maximizer (defeats the purpose) High — does NOT address the root cause (f1_val maximization on val drives the bad outcome regardless)
E — Decouple θ-sweep optimization from f1_val Replace theta_sweep.pick_threshold_on_val(... f1_buy) with sortino-on-walkforward-backtest or Brier-score or a regularized f1 (penalty on rate_buy > 0.20) theta_sweep.py + new node Addresses the actual symptom (θ that wins f1 wins the wrong objective) Major design change ; backtest in θ-sweep is expensive ; needs S20+ scope Out-of-scope for S19
F — A+B+C + 3 latent bug fixes (BUNDLED) Pick A / B / C as the H8 fix posture + Bug 1 (class_balance n_neg guard) + Bug 2 (labels=[0,1] in both files) + Bug 3 (lgb.py preserve feature_names) class_balance.py + theta_sweep.py + eval_metrics.py + adapters/lgb.py + (per A/B/C pick) Same PR surface for related defensive fixes ; zero behavioural change on healthy paths ; closes 3 latent prod risks for free Mixes H8 fix (behavioural) with bug fixes (defensive) — committee may prefer separate PRs for clean revert envelope Low — each fix is independently small + tested

6.bis Latent bugs found during code audit (operator review 2026-05-14)

A targeted re-read of class_balance.py + theta_sweep.py + eval_metrics.py surfaced 2 latent bugs that do NOT manifest on AAVEUSDC fold=3 (so they don't explain our specific regression) but WILL fire on degenerate-split folds in the FTF sweep and may contribute to the broader regression cited in the parent dossier §3 (e.g. XGB canary f1_buy=0.089).

Bug 1 — class_balance.py:55-62, missing guard on n_neg == 0

n_pos = int((y == 1).sum())
n_neg = int((y == 0).sum())
if n_pos == 0:
    raise ValueError(...)        # only this branch fail-fasts
return ClassBalance(scale_pos_weight=n_neg / n_pos, ...)

When n_neg == 0 (train split is 100 % BUY) → returns scale_pos_weight = 0 / n_pos = 0.0 silently → LGB / XGB train with the positive class effectively un-weighted → barely-trained booster, no exception raised. Per ADR-25 (no silent fallback), this MUST fail-fast.

Patch :

if n_pos == 0 or n_neg == 0:
    raise ValueError(
        f"compute_class_balance: degenerate training labels — "
        f"n_pos={n_pos}, n_neg={n_neg}. Cannot train binary on a single-class split."
    )

Does it fire on AAVEUSDC fold=3 ? No — n_pos=1725, n_neg=8128, both strictly positive.

Bug 2 — theta_sweep.py:59 + eval_metrics.py:69, missing labels=[0, 1]

_, _, f1, _ = precision_recall_fscore_support(y, y_pred, average=None, zero_division=0)
f1_buy = float(f1[1]) if len(f1) > 1 else 0.0

sklearn's precision_recall_fscore_support(..., average=None) without labels= returns metrics indexed by set(y_true) ∪ set(y_pred) sorted ascending. When that union has only label 1 (e.g. y_true = y_pred = [1, 1, 1]), it returns f1 = [1.0] of length 1 → the len(f1) > 1 else 0.0 guard returns 0.0 instead of the true 1.0. Optuna then optimises on broken signal.

Patch (both files) :

_, _, f1, _ = precision_recall_fscore_support(
    y_true_or_y, y_pred, labels=[0, 1], average=None, zero_division=0,
)
f1_buy = float(f1[1])    # always safe with explicit labels

Does it fire on AAVEUSDC fold=3 ? No — y_val has both classes (353 pos + 1740 neg), so the union always contains {0, 1} regardless of y_pred's content → len(f1) == 2f1[1] correctly indexes BUY.

Bug 3 — adapters/lgb.py:42-43, DataFrame → ndarray strips column order

def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    # PR #900 — DataFrame-native at inference, ndarray at training.
    if isinstance(x, pd.DataFrame):
        x = x.to_numpy()                     # ← strips column names
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    ...

The docstring claims "DataFrame-native" but the implementation is the opposite — .to_numpy() reduces the input to a positional ndarray. lgb.Booster.predict(ndarray) does NOT validate feature names ; if the DataFrame's column order at inference differs from the training matrix, LightGBM silently predicts on the WRONG features. No exception, no log, no metric anomaly — just silently corrupted probabilities downstream.

Patch (minimal — preserve columns + reorder if needed) :

def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    if isinstance(x, pd.DataFrame):
        expected = list(self._native.feature_name())
        if list(x.columns) != expected:
            x = x[expected]                  # reorder to training-time schema
        # keep as DataFrame — lgb.Booster.predict accepts both
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    else:
        raw = self._native.predict(x)
    proba = np.asarray(raw, dtype=float)
    if proba.ndim == 1:
        return np.column_stack([1.0 - proba, proba])
    return proba

Does it fire on AAVEUSDC fold=3 ? Probably not at training time (the harness trains on the same ndarray it later predicts on for θ-sweep + eval — order is consistent by construction). Possibly at backtest time : CVNTrade_BacktestEngine calls adapter.predict_proba candle-by-candle on the live FE-pipeline DataFrame. If column ordering of that DataFrame ever diverges from the training cache's order (e.g. after a FE pipeline cache rebuild, or after a feature-selection set update), all backtest predictions are silently corrupted. Our observed auc_buy_val=0.65 rules out total feature scrambling on THIS run (random predictions would give AUC ~0.5), but Bug 3 plausibly drives any feature-order-sensitive cross-crypto / live regression cited in the parent dossier §3.

Operator severity ranking (2026-05-14 audit)

  1. class_balance.pyscale_pos_weight=0 on single-class train : silently destructive, can pass through CI undetected on rare-pos cryptos.
  2. adapters/lgb.py — DataFrame column-order erasure : silently wrong predictions in any path where the DataFrame schema can drift (backtest, live inference, walk-forward predictor).
  3. theta_sweep.py / eval_metrics.py — missing labels=[0, 1] : silently wrong metrics on mono-class splits → Optuna optimises on garbage.

Why all 3 matter even though our specific run is unaffected

  • Cross-fold risk : the FTF sweep cited in the parent dossier §3 covered defi_top5 × 5m × multiple folds × 3 models. SOME of those folds may have produced a degenerate train (all-BUY) or a degenerate val (all-one-class). Bug 1 silently broke any such training ; Bug 2 returned 0.0 f1_buy that contaminated Optuna's score → cascading bad hyperparam choices ; Bug 3 may have corrupted cross-crypto inference when feature schemas diverged.
  • Plausibility on the XGB canary : XGB stays at fixed θ=0.5 (no θ-sweep) — but XGB's autonomous trainer also calls eval_metrics.evaluate_split_binary which has Bug 2. If XGB's prediction at θ=0.5 produced a degenerate y_pred, the reported f1_buy=0.089 mean across folds could be a Bug 2 measurement artefact superimposed on the H8 mechanism.
  • Defensive cost is zero-to-small : Bug 1 patch is 1 line ; Bug 2 patch is 1 line per file ; Bug 3 patch is ~5 lines + a column-reorder. All fail-fast / defensive, no behaviour change on currently-healthy paths, fully consistent with ADR-25.
  • Other adapters audited (2026-05-14) :
  • adapters/xgb.py:57 — passes feature names via xgb.DMatrix(..., feature_names=list(x.columns)) → DMatrix validates column order against training feature_names → Bug 3 does NOT apply to XGB.
  • adapters/cb.py:37 — calls self._native.predict_proba(x) directly on the CatBoost classifier ; CatBoost's sklearn-compatible API handles DataFrames natively + validates column order → Bug 3 does NOT apply to CB.
  • Bug 3 is LGB-only. Patch surface : single file adapters/lgb.py.

7. Questions for the committee (experiment_review)

  1. Validate H8 : do the experts agree the re-scoped hypothesis (D1 × D3 coupling driving over-trading) is the dominant locus on AAVEUSDC fold=3, or are there hidden axes we missed (D5, D6, calibration node behaviour, walk-forward predict path) ?
  2. Validate the 3 latent bugs found by operator audit (Bug 1 / Bug 2 / Bug 3 in §6.bis) : do the experts agree each is a real defect ? Should S19 close all 3 in the same PR as the H8 fix (Option F bundle), or split them into a separate defensive-hardening PR with its own CR cycle ? Operator severity ranking : Bug 1 > Bug 3 > Bug 2.
  3. Source of pre-#891 reference f1≈0.42 : the parent dossier cites this number ; can we recover the exact PG row / Loki window that produced it, or is it extrapolated ? If extrapolated, should we re-establish a pre-#891 canary by running train_with_fixed_params_lgbm with scale_pos_weight=1.0 + θ=0.4 on the captured fold to get an empirical baseline ?
  4. S19 fix posture for H8 : which of A / B / C is least risky for production ? Option D (add to HPO) is rejected by us as theoretically self-defeating ; Option E is out-of-scope ; Option F bundles A/B/C with the 3 bug fixes.
  5. XGB / CB regression risk under Option A : disabling scale_pos_weight in class_balance.py affects all 3 adapters. Should S19 be LGB-only (introduce a per-model class_balance override) ?
  6. Test plan for S19 : is a single-fold re-run on the captured parquet sufficient, or should S19 be validated cross-fold (OPUSDC fold=3, LDOUSDC fold=4 per parent dossier reco #6) before merge ?
  7. Verdict : PASS / PASS_WITH_REVISIONS / REJECTED — with explicit concrete recommendation for the next concrete step (S19 design draft, additional diagnostic, or re-scope further).

8. What S19 would NOT do (out of scope)

  • Re-design the harness θ-sweep mechanism end-to-end (committee enhancement #5 was about always-on calibration ; that's a separate Story)
  • Rewrite the autonomous trainer entry point
  • Touch FTF data prep / labeling pipeline (Step 4 confirmed data is clean)
  • Touch CUSUM / FE / inference filter chain
  • Modify the harness's training-loop config (ruled out by Steps 1-4)

9. Committee verdict (experiment_review) — 2026-05-14

Status : PASSED_WITH_REVISIONS.

The committee accepted the H8 re-scope as the dominant explanation for the AAVEUSDC fold=3 symptom and validated all 3 latent bugs (Bug 1 / Bug 2 / Bug 3 in §6.bis). The recommendation diverges from this dossier's Option F (bundle) toward a 2-PR split with a more conservative H8 posture :

9.1 S19 recommendation — Option B targeted (LGB-only) + separate hardening PR

S19 main PR — H8 fix, LGB-scoped only : 1. Restrict the θ-sweep candidates for the LGB path to [0.30, 0.40] (matching the legacy threshold_buy_range from lgbm_config.py:117) — OR, preferred, make the range configurable per model (resolves cleanly via the ADR-90 CVN_HPO_LGB_* keys). 2. Keep scale_pos_weight active — Option A (disabling it globally) is REJECTED : insufficient evidence that turning it off is safe for XGB and CB, and the cross-model regression risk is too high without targeted experimentation. 3. Add a guard : if rate_buy_val > 0.20-0.25, emit event=theta_overtrade_warning (or fail per a configurable mode flag).

Separate hardening PR — 3 latent bugs : - class_balance.py — fail-fast when n_pos == 0 OR n_neg == 0 (extend the existing guard). - theta_sweep.py:59 + eval_metrics.py:69 — add labels=[0, 1] to precision_recall_fscore_support(...) so the f1 array is always length-2 indexed by class. - adapters/lgb.py:42-43 — preserve column order or reorder to self._native.feature_name() when the input is a DataFrame ; do NOT strip names via .to_numpy().

Option D rejected (HPO over scale_pos_weight) : same f1-on-val optimisation target → same over-trading attractor.

Option F rejected : bundling H8 fix with defensive hardening in a single PR muddies the revert envelope ; the committee prefers a clean split so each PR has its own CR cycle + audit trail.

9.2 Validation before merge

Run the fix candidate on 3 captured folds : - AAVEUSDC fold=3 (the canary) - OPUSDC fold=3 - LDOUSDC fold=4

For each cell, compare BEFORE / AFTER on : theta_picked, rate_buy_val, raw_buy_signals, final_trades, sortino, return.

9.3 Dossier correction adopted

§5 row for adapters/lgb.py was wrong (called it the "scale_pos_weight injection point"). The actual injection is in class_balance.py + lightgbm_dag.py. The adapter is the locus of Bug 3 (DataFrame column-order erasure) — a separate concern from H8. Corrected in this commit.

9.4 Next concrete steps (action plan)

  1. Open PR for this dossier (Step 5 = analytical artefact, no code).
  2. Open S19 main PR : θ-sweep range restriction + over-trade guard, LGB-scoped.
  3. Open S19-hardening PR : 3 latent bug fixes (independent CR cycle).
  4. Cross-fold validation : capture OPUSDC fold=3 + LDOUSDC fold=4 via diagnostic__s18_step1_4_chain runs (~40 min wall time total) ; verify pre/post deltas before either S19 PR merges.
  5. Transition OP wp#154 : Developed → In testing after both S19 PRs merge ; Closed after cross-fold validation passes.

10. References