Skip to content

Plan dossier — CVN-N001-EE-S19 : Harness over-trade fix

Story : CVN-N001-EE-S19 (OP wp#165, GH #940) Parent epic : CVN-N001-EE — F1_buy boost (10-track plan) Predecessor : CVN-N001-EE-S18 (closed 2026-05-14, OP wp#154) Author : Operator + Claude Date : 2026-05-14 Status : v2 — committee plan_review PASSED_WITH_REVISIONS 2026-05-14 (session 1f4335a2, OP Meeting #135, strong consensus, 0 blockers, 8 recommendations integrated below — see §13)


1. Context

S18 (closed 2026-05-14) ran a 4-step diagnostic chain on the AAVEUSDC fold=3 cell to localise the post-#891 harness regression. After Steps 0-4 ruled out hypotheses H1-H7 of the parent dossier §5.1 :

Step Verdict What it ruled out
Step 0 PASS — replay reproduces canonical f1=0.3520 reproducibility OK
Step 1 (capture, no verdict) full Optuna trial trace serialized to parquet
Step 2 (analysis dossier) side-by-side legacy vs harness code diff
Step 3 REFUTED on H2 (valid_sets composition) both [train,val] and [val] produce best_iter=1
Step 4 NO_DIVERGENCE on F1-F6 data is clean (label, features, drift, iter-1 probe all PASS)

Phase A logs (chained DAG run 2026-05-14 14:44) revealed the actual mechanism :

event=training_complete model_type=lightgbm best_iteration=1 training_time_sec=2.465
  theta_picked=0.2 f1_buy_val=0.352 auc_buy_val=0.6461 rate_buy_val=0.4611
event=signal_funnel raw_buy_signals=1210 final_trades=251 primary_killer=concurrency
event=weighted_variant_evaluated sortino=-9.512 n_trades=251 return=-91.35%

The model is NOT broken at the booster level (AUC 0.65, calibration acceptable) — it's broken at the post-training θ selection : scale_pos_weight=4.71 (auto-injected by class_balance.py) inflates positive-class probas, the harness θ-sweep over [0.05, 0.95] finds θ=0.2 because that's where f1_val maxes on the inflated distribution, and the model emits BUY 46 % of the time on val → catastrophic backtest.

The S18 Step 5 dossier (committee experiment_review PASSED_WITH_REVISIONS) re-scoped this as H8 = scale_pos_weight auto-injection × wide θ-sweep range coupling. H8 was not in the parent dossier's H1-H7 enumeration — gap in the original analysis.

The committee also validated 3 latent bugs found by operator code audit (§6.bis of Step 5 dossier) : - Bug 1 — class_balance.py:55-62 fails-fast on n_pos == 0 only, NOT n_neg == 0 → silent scale_pos_weight=0 on degenerate splits - Bug 2 — theta_sweep.py:59 + eval_metrics.py:69 missing labels=[0, 1] → wrong f1 on mono-class splits - Bug 3 — adapters/lgb.py:42-43 strips DataFrame column names via .to_numpy() → silently wrong predictions if column order drifts. LGB-only (XGB + CB safe).

S19 is the remediation Story.

2. Goals

  1. Restore reasonable trading rate on the LGB harness path : rate_buy_val ≤ 0.20 (from current 0.46) on the AAVEUSDC fold=3 cell, with no degradation > 5 % on f1_buy_val vs the post-S17 canonical reference (0.3520485).
  2. Backtest sortino > 0 on AAVEUSDC fold=3 + cross-fold validation cells.
  3. Fail-fast on degenerate training splits (Bug 1) — no more silent scale_pos_weight=0.
  4. Correct f1 measurement on mono-class splits (Bug 2) — Optuna no longer optimises on broken signal.
  5. Preserve DataFrame column order at LGB inference (Bug 3) — eliminate the silent feature-scrambling risk in backtest / live inference.
  6. Per-model θ-sweep range configurable via the existing ADR-90 PG-keys mechanism — no hardcoded constants.

3. Non-goals

  • Disabling scale_pos_weight globally (S18 committee Option A REJECTED — insufficient evidence on XGB/CB regression risk).
  • Adding scale_pos_weight to Optuna search (S18 committee Option D REJECTED — same f1-on-val attractor, would re-converge to the same over-trading optimum).
  • Re-designing the θ-sweep to optimise sortino instead of f1_val (Option E — major scope, future Story).
  • Touching XGB or CB adapters / θ-sweep (the regression is LGB-specific per the S18 evidence ; XGB has its own f1=0.089 issue covered in a separate Story scope).
  • Modifying the FTF data prep, labeling, or feature pipeline (Step 4 confirmed the data is clean).
  • Touching the autonomous trainer entry point (the regression is in the harness post-training nodes).

4. Architecture

4.1 Two-PR split (committee directive)

The committee verdict (Step 5 dossier §9) explicitly rejected Option F (bundling H8 + bug fixes in one PR) for clean revert envelope + independent CR cycles. S19 ships as 2 PRs :

PR Scope Surface Behavioural change
S19 main H8 fix (LGB θ range + over-trade guard) + Bug 2 (BOTH files) 5 files : theta_sweep.py, eval_metrics.py, lightgbm_dag.py, hyperparams.py registration, PG seed Yes — alters the θ pick on LGB
S19-hardening 3 defensive fixes : Bug 1 (class_balance.py) + Bug 3 (adapters/lgb.py) + bool extension to hyperparams.py 3 files : class_balance.py, adapters/lgb.py, hyperparams.py (bool branch + _PARAM_TYPES entry) No on healthy paths ; defensive fail-fast on degenerate paths

CR PR #941 r3 BLOCKER #1 resolved : the v3 split had Bug 2 (theta_sweep.py:59 + eval_metrics.py:69) in the hardening PR, but theta_sweep.py is also modified by the main PR (θ range + guard). That contradicted the "independently mergeable" claim and would have produced a merge conflict regardless of merge order. Resolution : Bug 2 (both files) MOVED to S19 main, where theta_sweep.py is touched anyway and eval_metrics.py shares the same precision_recall_fscore_support signature. S19-hardening becomes strictly orthogonal to S19 main on the file surface.

The 2 PRs are now truly independently mergeable — zero file overlap. Recommended merge order remains : hardening first (defensive, low risk), then main (behavioural, higher review cost). The hardening PR also lands the bool extension to hyperparams.py which the main PR depends on for FEATURE_ORDER_STRICT (per CR r3 reco #5 — must NOT use the temporary os.environ.get bridge in main).

Concern Pre-r3 (v3) Post-r3 (v4)
theta_sweep.py Touched by BOTH (Bug 2 + θ range) — merge conflict Touched ONLY by main
eval_metrics.py Touched by hardening (Bug 2) Touched ONLY by main (bundled with theta_sweep.py Bug 2)
class_balance.py Touched ONLY by hardening (Bug 1) Unchanged — only by hardening
adapters/lgb.py Touched ONLY by hardening (Bug 3) Unchanged — only by hardening
hyperparams.py Touched by main (param registration) Touched by BOTH but on DIFFERENT lines — main adds new keys to _PARAM_TYPES, hardening adds bool branch + FEATURE_ORDER_STRICT entry. Merge order : hardening first to avoid conflict in _PARAM_TYPES dict.
lightgbm_dag.py Touched by main only Touched by main only

4.2 S19 main PR — H8 fix design

4.2.1 Per-model θ-sweep range via ADR-90 keys

The current theta_sweep.pick_threshold_on_val() hard-codes the candidates np.linspace(0.05, 0.95, 19) + [0.5]. S19 main adds two kwargs theta_min + theta_max :

# src/training/harness/nodes/theta_sweep.py
def pick_threshold_on_val(
    y_val: np.ndarray,
    p_buy_val: np.ndarray,
    *,
    theta_min: float = 0.05,
    theta_max: float = 0.95,
) -> ThetaPick:
    """..."""
    # CR PR #941 r2 reco #1 — when (theta_min, theta_max) is restricted (e.g.
    # [0.30, 0.40] for the LGB tightening), the canonical [0.5] anchor MUST
    # NOT be appended unconditionally — otherwise the picker can return θ=0.5
    # outside the configured bounds, contradicting the AC.
    if not (0.0 < theta_min < theta_max < 1.0):
        raise ValueError(
            f"pick_threshold_on_val: invalid bounds — "
            f"requires 0 < theta_min ({theta_min}) < theta_max ({theta_max}) < 1"
        )
    candidates = np.linspace(theta_min, theta_max, 19)
    if theta_min <= 0.5 <= theta_max:
        candidates = np.concatenate([candidates, [0.5]])
    candidates = np.unique(candidates)
    # rest unchanged

The caller (lightgbm_dag.py) resolves the LGB bounds via the ADR-90 resolver :

# src/training/harness/dags/models/lightgbm_dag.py
from commun.finetune.hyperparams import resolve

theta_min = resolve("LGB", tf, "THETA_MIN", fallback=0.30)  # legacy lgbm_config.py:117
theta_max = resolve("LGB", tf, "THETA_MAX", fallback=0.40)
lgb_theta_pick = pick_threshold_on_val(y_val, p_buy_val, theta_min=theta_min, theta_max=theta_max)

PG seeding (Console UI, NOT git per ADR-59) : - CVN_HPO_LGB_5M_THETA_MIN = 0.30 - CVN_HPO_LGB_5M_THETA_MAX = 0.40 - (and per-timeframe entries if needed)

theta_sweep.py stays generic ; only LGB callers tighten the range. CB / future model can keep the wide default by not setting the kwargs.

4.2.2 Over-trade guard (committee enhancement)

Inside theta_sweep.pick_threshold_on_val(), after the best θ is computed, calculate the resulting rate_buy_val and emit a structured warning if it exceeds a threshold :

# src/training/harness/nodes/theta_sweep.py — additional logic
def pick_threshold_on_val(
    y_val,
    p_buy_val,
    *,
    theta_min=0.05,
    theta_max=0.95,
    rate_buy_warn_threshold: float = 0.20,
    rate_buy_fail_threshold: float = 0.25,
) -> ThetaPick:
    # ... pick best_t as before ...
    rate_buy_val = float(((p_buy_val >= best_t).astype(int) == 1).mean())
    if rate_buy_val > rate_buy_warn_threshold:
        log_event(
            logger,
            "theta_overtrade_warning",
            theta_picked=best_t,
            rate_buy_val=rate_buy_val,
            rate_buy_warn_threshold=rate_buy_warn_threshold,
            rate_buy_fail_threshold=rate_buy_fail_threshold,
            f1_buy=best_f1,
        )
    if rate_buy_val > rate_buy_fail_threshold:
        raise RuntimeError(
            f"theta_sweep over-trade guard fired : rate_buy_val={rate_buy_val:.3f} > "
            f"fail_threshold={rate_buy_fail_threshold:.3f} (theta={best_t})"
        )
    return ThetaPick(threshold=best_t, f1_buy=best_f1, n_candidates=len(candidates))

Both thresholds are PG-mandatory (no optional / fallback=None semantics — the existing hyperparams.resolve API treats fallback=None as "raise if unset", which is incompatible with an opt-in unset state ; CR PR #941 r2 reco #3) : - CVN_HPO_LGB_5M_RATE_BUY_WARN_THRESHOLD = 0.20 (committee reco #2 — initial heuristic, see also §4.2.2 justification) - CVN_HPO_LGB_5M_RATE_BUY_FAIL_THRESHOLD = 0.25 (committee reco #5 — 5 % above warn, enables fail-fast on extreme over-trade)

The behaviour band is : - rate_buy_val ≤ 0.20 → run proceeds, no log - 0.20 < rate_buy_val ≤ 0.25 → run proceeds, event=theta_overtrade_warning emitted - rate_buy_val > 0.25 → run fails with RuntimeError

To restore warn-only behaviour temporarily (e.g. for a debugging window), the operator sets CVN_HPO_LGB_5M_OVERTRADE_GUARD_MODE = warn_only via Console UI (CR r3 reco #7 — replaces the previous FAIL=1.0 magic-number escape with an explicit semantic flag, aligning with ADR-90 patterns). The two accepted values :

  • warn_only — only the event=theta_overtrade_warning is emitted ; the RATE_BUY_FAIL_THRESHOLD is ignored, no RuntimeError is ever raised
  • fail (default) — full behaviour as described above ; warn at 0.20, fail at 0.25

The OVERTRADE_GUARD_MODE is a string-typed PG key, parsed by hyperparams.resolve with an enum guard :

# in pick_threshold_on_val(...)
mode = resolve("LGB", tf, "OVERTRADE_GUARD_MODE", fallback="fail")
if mode not in {"warn_only", "fail"}:
    raise ValueError(f"OVERTRADE_GUARD_MODE must be 'warn_only' or 'fail', got {mode!r}")
if rate_buy_val > rate_buy_warn_threshold:
    log_event(logger, "theta_overtrade_warning", ..., guard_mode=mode)
if mode == "fail" and rate_buy_val > rate_buy_fail_threshold:
    raise RuntimeError(...)

_PARAM_TYPES extension (in S19-hardening alongside the bool branch) :

"OVERTRADE_GUARD_MODE": str,  # enum-validated at the call site

4.2.3 Files touched (S19 main)

src/training/harness/nodes/theta_sweep.py          # add 4 kwargs + over-trade guard logic
src/training/harness/dags/models/lightgbm_dag.py   # resolve LGB θ bounds + warn threshold ; pass to pick_threshold_on_val
src/commun/finetune/hyperparams.py                 # register THETA_MIN, THETA_MAX, RATE_BUY_WARN_THRESHOLD, RATE_BUY_FAIL_THRESHOLD in _PARAM_TYPES
documentation/adr/0090-...md                       # (optional) add a note about the new param family
tests/unit/training_harness/nodes/test_theta_sweep.py  # tests for the bounds + over-trade guard

PG migration : seed 4 PG keys via Console UI (per ADR-59 — Console-only, no git PR).

4.3 S19-hardening PR — 3 bug fixes

Bug 1 — class_balance.py:55-62 :

# BEFORE
if n_pos == 0:
    raise ValueError("compute_class_balance: degenerate training labels — n_pos=0. ...")

# AFTER
if n_pos == 0 or n_neg == 0:
    raise ValueError(
        f"compute_class_balance: degenerate training labels — "
        f"n_pos={n_pos}, n_neg={n_neg}. Cannot train binary on a single-class split (ADR-25 fail-fast)."
    )

Bug 2 — theta_sweep.py:59 + eval_metrics.py:69 :

Both call sites :

# BEFORE
_, _, f1, _ = precision_recall_fscore_support(y, y_pred, average=None, zero_division=0)
f1_buy = float(f1[1]) if len(f1) > 1 else 0.0

# AFTER
_, _, f1, _ = precision_recall_fscore_support(
    y, y_pred, labels=[0, 1], average=None, zero_division=0,
)
f1_buy = float(f1[1])  # always safe with explicit labels

eval_metrics.evaluate_split_binary line 69 gets the same treatment for the precision/recall/f1 unpacking.

Bug 3 — adapters/lgb.py:42-43 :

# BEFORE
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    if isinstance(x, pd.DataFrame):
        x = x.to_numpy()                       # ← strips column names
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    ...

# AFTER
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    if isinstance(x, pd.DataFrame):
        expected = list(self._native.feature_name())
        actual = list(x.columns)
        if actual != expected:
            missing = set(expected) - set(actual)
            extra = set(actual) - set(expected)
            if missing or extra:
                raise ValueError(
                    f"LGBAdapter.predict_proba: feature mismatch — "
                    f"missing={sorted(missing)}, extra={sorted(extra)}"
                )
            x = x[expected]                    # reorder to training-time schema
        # leave as DataFrame — lgb.Booster.predict accepts both
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    else:
        raw = self._native.predict(x)
    proba = np.asarray(raw, dtype=float)
    if proba.ndim == 1:
        return np.column_stack([1.0 - proba, proba])
    return proba

Defensive : raises if columns don't match the training schema (rather than silently .to_numpy() and predict on whatever is passed).

Committee reco #3 (expert-crypto-trader, expert-ops) — add a strict mode flag to allow a transition window. The default is strict=True (raise as above) but operators can flip to strict=False via PG (CVN_HPO_LGB_5M_FEATURE_ORDER_STRICT=false) for an emergency graceful fallback.

Implementation note (CR PR #941 r2 reco #4) : the existing commun.finetune.hyperparams.resolve() API only handles int / float per _PARAM_TYPES ; passing "false" would crash on the float-fallback parse. S19-hardening MUST extend _PARAM_TYPES with a bool entry for FEATURE_ORDER_STRICT :

# src/commun/finetune/hyperparams.py — additional case in the parser
_PARAM_TYPES: Final = {
    # ... existing entries ...
    "FEATURE_ORDER_STRICT": bool,
}

# In the parse function — explicit bool branch
def _parse_value(raw: str, expected_type: type, key: str):
    ...
    if expected_type is bool:
        s = raw.strip().lower()
        if s in {"true", "1", "yes", "on"}:
            return True
        if s in {"false", "0", "no", "off"}:
            return False
        raise RuntimeError(f"hyperparams.resolve: cannot parse {raw!r} as bool for {key!r}")
    ...

CR r3 reco #5 + #6 — bool extension is now a HARD prerequisite. The bool entry + parse branch land in S19-hardening (alongside Bug 1 + Bug 3) — NOT delayed. The temporary os.environ.get bridge in the LGB adapter is REMOVED before S19-hardening merges ; the adapter calls hyperparams.resolve("LGB", "5M", "FEATURE_ORDER_STRICT", fallback=True) from the start. Documented in the S19-hardening PR body as a hard merge-order dependency for S19 main.

The S19-hardening PR body explicitly lists, in its description : - "S19 main PR (#NNN) MUST NOT merge before this PR — lightgbm_dag.py depends on the bool extension landing here" - "If the deadline (30 days from this PR's merge) elapses without S19 main landing, this PR is reverted to roll back the unused bool branch — schedules in the team agenda"

The patch :

# AFTER (final v2 with strict mode)
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
    if isinstance(x, pd.DataFrame):
        expected = list(self._native.feature_name())
        actual = list(x.columns)
        if actual != expected:
            missing = set(expected) - set(actual)
            extra = set(actual) - set(expected)
            if missing or extra:
                # Hard mismatch — schema is wrong, not just reordered.
                # Raise regardless of strict (no safe reorder possible).
                raise ValueError(
                    f"LGBAdapter.predict_proba: feature mismatch — "
                    f"missing={sorted(missing)}, extra={sorted(extra)}"
                )
            # Same columns, different order — strict vs warn+reorder
            if _is_strict_mode():  # reads CVN_HPO_LGB_5M_FEATURE_ORDER_STRICT, defaults True
                raise ValueError(
                    f"LGBAdapter.predict_proba: column order mismatch (strict mode). "
                    f"expected={expected[:5]}... actual={actual[:5]}..."
                )
            log_event(
                logger,
                "lgb_adapter_column_reorder",
                expected_first5=expected[:5],
                actual_first5=actual[:5],
            )
            x = x[expected]  # reorder to training-time schema
        # leave as DataFrame — lgb.Booster.predict accepts both
    if self.best_iteration is not None:
        raw = self._native.predict(x, num_iteration=int(self.best_iteration))
    else:
        raw = self._native.predict(x)
    proba = np.asarray(raw, dtype=float)
    if proba.ndim == 1:
        return np.column_stack([1.0 - proba, proba])
    return proba

Default strict=True keeps the ADR-25 fail-fast posture as the contract ; the strict=False PG escape hatch is a transition-window concession (mirrors ADR-90 fallback pattern). Hard mismatches (missing / extra features) ALWAYS raise — only column-order divergence on identical sets falls back to WARN+reorder.

Files touched (S19-hardening) :

src/training/harness/nodes/class_balance.py        # Bug 1 — extend n_pos guard with n_neg
src/training/harness/nodes/theta_sweep.py          # Bug 2 — add labels=[0, 1] (line 59)
src/training/harness/nodes/eval_metrics.py         # Bug 2 — add labels=[0, 1] (line 69)
src/training/harness/adapters/lgb.py               # Bug 3 — preserve / reorder columns + raise on mismatch
tests/unit/training_harness/nodes/test_class_balance.py    # n_neg=0 raises
tests/unit/training_harness/nodes/test_theta_sweep.py      # mono-class split returns correct f1
tests/unit/training_harness/nodes/test_eval_metrics.py     # ditto
tests/unit/training_harness/adapters/test_lgb_adapter.py   # column-order preservation + mismatch raise

5. Implementation plan

Phase 1 — Plan dossier + committee review (this PR)

  • Plan dossier finalised, committee plan_review PASSED.
  • Story transitions New → Specified.

Phase 2 — S19-hardening PR (defensive first, lower risk)

  1. Branch fix/CVN-N001-EE-S19-hardening
  2. 3 bug fixes (1 per file) + unit tests for each (4 test files)
  3. CR cycle (4-5 rounds per memory rule)
  4. Committee pr_review per ADR-68 (touches src/training/harness/)
  5. Merge

Phase 3 — S19 main PR (H8 behavioural fix)

  1. Branch feat/CVN-N001-EE-S19-theta-range-and-overtrade-guard
  2. ADR-90 param registration (THETA_MIN, THETA_MAX, RATE_BUY_WARN_THRESHOLD, RATE_BUY_FAIL_THRESHOLD)
  3. theta_sweep.py API extension (4 new kwargs)
  4. lightgbm_dag.py wiring (resolve bounds + warn threshold + pass through)
  5. Unit tests for bounds + guard
  6. CR cycle
  7. Committee pr_review
  8. Merge — but gate by cross-fold validation success before pushing to prod

Phase 4 — Cross-fold validation (operator-driven)

Run diagnostic__s18_step1_4_chain with the SAME parameters on 4 cells : - AAVEUSDC fold=3 (the canary) - OPUSDC fold=3 - LDOUSDC fold=4 - ETHUSDC fold=3 (added per committee reco #1 — control cell with healthier baseline, ADR-14 alignment)

For each cell, capture BEFORE/AFTER metrics :

Metric Source event Acceptance threshold
theta_picked event=training_complete 0.30 ≤ θ ≤ 0.40 (the new bound)
rate_buy_val event=training_complete ≤ 0.20 (down from 0.46)
raw_buy_signals event=signal_funnel down at least 50 % vs BEFORE
final_trades event=signal_funnel down — exact target depends on cell
sortino event=weighted_variant_evaluated > 0 (up from -9.5)
return event=weighted_variant_evaluated > -50% (loss-bounded)
f1_buy_val event=training_complete within 5 % of post-S17 reference (no model-quality regression)

PG seeding for the new ADR-90 keys (Console UI) BEFORE the validation runs.

Committee reco #5 (expert-ops) : RATE_BUY_FAIL_THRESHOLD is set to a default of 0.25 in PG seeding (5 % above the warn threshold), enabling fail-fast on extreme over-trade cases instead of leaving it unset (silent-degradation risk). The warn-only mode is preserved in the [0.20, 0.25] band ; above 0.25 the run fails.

Committee reco #6 (expert-architect, expert-ops) : a pre-deploy CI check verifies that the 4 new ADR-90 keys + FEATURE_ORDER_STRICT (5 keys total per CR r3 reco #8) are seeded in PG before any deploy that includes the Phase 3 PR. Implementation : a new gate in the deploy CI that calls hyperparams.resolve("LGB", "5M", "THETA_MIN") etc. without the fallback ; failure to resolve = build fails with an actionable message. Keys checked : - CVN_HPO_LGB_5M_THETA_MIN - CVN_HPO_LGB_5M_THETA_MAX - CVN_HPO_LGB_5M_RATE_BUY_WARN_THRESHOLD - CVN_HPO_LGB_5M_RATE_BUY_FAIL_THRESHOLD - CVN_HPO_LGB_5M_OVERTRADE_GUARD_MODE - CVN_HPO_LGB_5M_FEATURE_ORDER_STRICT

Loki query {namespace="cvntrade"} |~ "event=hpo_fallback_applied" is monitored post-deploy ; any hit on these 6 keys triggers an oncall page.

Committee reco #11 (expert-ops) — pre-deploy LGB strict-mode compatibility check : before flipping FEATURE_ORDER_STRICT=true in prod, run a dry-run backtest with strict=false for 1 hour against live FE-pipeline output ; verify that event=lgb_adapter_column_reorder does NOT fire (= no upstream caller silently reorders columns). If any reorder event is logged, fix the upstream caller before flipping strict=true. Documented as a Phase 4 prerequisite step.

Committee reco #7 (expert-architect) : Phase 4 cross-fold validation includes a rollback test — manually trigger an over-trade fail (set RATE_BUY_FAIL_THRESHOLD=0.10 so the guard fires), verify the run errors as expected, then re-set to 0.25 via Console UI in seconds. Documents the rollback latency.

Committee reco #8 (expert-architect, expert-ml-engineer) : the Bug 3 raise behaviour (LGB column-mismatch) is documented in a new ADR (e.g. ADR-91 — to be drafted in a follow-up PR after S19-hardening merges) formalising the column-order contract between the harness and the backtest engine. Until the ADR lands, the contract is documented in the S19-hardening PR body + in the LGBAdapter.predict_proba docstring.

Phase 5 — Post-validation deploy + Story closure

  • If all 4 cells PASS the acceptance thresholds → push to prod via the standard deploy CI path
  • Trigger an FTF mini-sweep (1 crypto, 1 fold) post-deploy ; verify event=theta_overtrade_warning does NOT fire spuriously on a fresh prod run
  • OP wp#165 transitions In testing → Tested → Closed per ADR-81
  • Closure note on wp#165 with PR SHA + Loki snapshot of the 3 cell verdicts

6. Risk analysis

Risk Likelihood Impact Mitigation
[0.30, 0.40] range too narrow → LGB f1 degrades > 5 % on some folds Medium Medium (model-quality regression) Cross-fold validation pre-merge ; if any cell fails the f1 threshold, widen the range to [0.25, 0.45] and re-validate
Over-trade guard fires spuriously on healthy folds (false positive) Low Low (warning only ; no behavioural change in default mode) Default to warn-only ; operator can flip to fail-mode via PG ; post-deploy mini-sweep verifies no spurious fires
Bug 3 fix raises on a healthy backtest call where columns happen to be reordered (false alarm) Medium Medium (backtest fails instead of silently mispredicting) Defensive raise IS the contract per ADR-25 ; the alternative (silent mispredict) is the actual bug. If the backtest engine reorders columns, the engine MUST be fixed to pass them in training order. Document this in PR body.
The 2 PRs land in the wrong order (main before hardening) → main is harder to revert Low Low Merge strategy doc'd in PR body of S19 main : "DO NOT MERGE before S19-hardening (#NNN)"
PG seeding forgotten before deploy → fallback to legacy 0.05-0.95 range fires (the bug we're fixing) Medium High (regression re-emerges silently) lightgbm_dag.py uses resolve(..., fallback=0.30) and fallback=0.40 so the fix IS the default ; event=hpo_fallback_applied Loki-queryable to detect any unintended fallback path
Cross-fold validation reveals H8 isn't the dominant cause on OPUSDC/LDOUSDC Medium Medium Treat as new finding ; halt merge ; reopen S18 (or open S20) to investigate per-crypto variance
Bug 3 fix exposes a hidden upstream bug (FE pipeline reorders columns sometimes) Medium High The exposure IS the goal — fail-fast surfaces it. Fix the upstream caller separately if it manifests.

7. Rollback procedure

Per phase :

  • Phase 2 (hardening) PR not merged : git close PR.
  • Phase 2 merged but Bug 3 starts raising in backtest : revert PR via git revert <SHA> + emergency PR ; investigate the upstream caller that passes mis-ordered columns.
  • Phase 3 (main) PR not merged : git close PR.
  • Phase 3 merged but cross-fold validation reveals problem : revert PR + rollback PG seeds via Console (set THETA_MIN/MAX back to 0.05/0.95 or unset).
  • Cutover after Phase 4 validation passes but issue surfaces in prod : git revert + helm rollback if needed ; PG seeds are Console-rollback in seconds.

RTO : git revert + CI build + helm upgrade ≈ 15 min.

8. Definition of done

Hard criteria for closing the Story (all must pass) :

  • S19-hardening PR merged on main (3 bug fixes + 4 unit test files)
  • S19 main PR merged on main (θ-sweep range + over-trade guard + ADR-90 registrations + unit tests)
  • PG seeds CVN_HPO_LGB_5M_THETA_MIN/MAX + CVN_HPO_LGB_5M_RATE_BUY_WARN_THRESHOLD set via Console UI
  • Cross-fold validation passed on 4 cells (AAVEUSDC fold=3 + OPUSDC fold=3 + LDOUSDC fold=4 + ETHUSDC fold=3 control) per §5 Phase 4 thresholds
  • Post-deploy FTF mini-sweep emits no spurious event=theta_overtrade_warning
  • Loki query {namespace="cvntrade"} |~ "event=theta_overtrade_warning" returns the validation runs only (audit trail)
  • OP wp#165 transitions In testing → Tested → Closed per ADR-81 with closure note + PR SHAs

9. ADR alignment

  • ADR-25 (no silent fallback) — Bug 1 fix (fail-fast on degenerate splits) + over-trade guard fail-mode + Bug 3 raise on column mismatch
  • ADR-31/32/33 (structured logging) — event=theta_overtrade_warning follows key=value format
  • ADR-59 (Console-only PG params) — new THETA_MIN/MAX + warn/fail thresholds via Console, NOT in git
  • ADR-68 (Expert Committee) — pr_review mandatory per PR (touches src/training/harness/)
  • ADR-77 (MkDocs SSoT) — this dossier under documentation/reviews/
  • ADR-89 (training harness as plugin registry) — preserved ; per-model θ range fits the existing plugin model
  • ADR-90 (training hyperparams in PG) — extends the existing CVN_HPO_<MODEL>_<TF>_<PARAM> scheme to a new param family (THETA_*)

10. Out-of-scope follow-ups (filed separately)

  • CVN-N001-EE-S20 : XGB-specific f1=0.089 fix (parent dossier §3.1 cited the canary number ; the XGB regression is structurally different — fixed θ=0.5, no θ-sweep — and likely a different mechanism than H8) — to be opened post-S19 close
  • CVN-N001-EE-S21 (committee r2 reco #9 — confirmed) : data-driven calibration spike for the over-trade thresholds + θ-sweep bounds per crypto, using the 4 cross-fold validation cells as the starting dataset. Replaces the current "initial heuristic" basis for 0.20 / 0.25 with empirical per-crypto bounds. Out-of-scope for S19 (heuristic is good enough for the H8 unblock).
  • Pre-#891 baseline empirical re-establishment : run train_with_fixed_params_lgbm with scale_pos_weight=1.0 + θ=0.4 on the captured fold to anchor the f1≈0.42 reference cited in S18 parent dossier §3 (S18 Step 5 §7 question #3) — could be a 1-day spike Story.
  • Live inference / execution kill switch (CR r3 reco #3) — already in scope under ADR-71 + Epic CVN-N001-EG (kill-switch implementation Story tracked separately ; design dossier documentation/design/CVN-N001-EF-S02-kill-switch-design.md). S19 does NOT implement the kill switch ; it relies on the pre-existing kill-switch contract for emergency halt independent of code rollback.
  • CVN-N001-EE-S22 (NEW, CR r3 reco #4) : continuous data / label / concept drift detection for production LGB models — metrics (KS test on feature distribution, label-prior drift, prediction-distribution drift), thresholds, alerting, runbooks. Currently no automated detection ; relies on FTF sweep cadence + operator manual inspection. Major scope ; out-of-scope for S19.
  • ADR-91 : formalise the LGB column-order contract between harness adapter and backtest engine (S18 Step 5 reco #8) — already filed as a follow-up after S19-hardening merges.

11. Test plan

Unit (S19-hardening)

tests/unit/training_harness/nodes/test_class_balance.py
  - test_compute_class_balance_raises_on_n_neg_zero  # NEW
  - test_compute_class_balance_raises_on_n_pos_zero  # existing — keep as regression guard

tests/unit/training_harness/nodes/test_theta_sweep.py
  - test_pick_threshold_returns_correct_f1_when_y_all_positive  # NEW (Bug 2 regression)
  - test_pick_threshold_returns_correct_f1_when_y_pred_all_positive  # NEW
  - test_pick_threshold_default_bounds_unchanged  # regression guard

tests/unit/training_harness/nodes/test_eval_metrics.py
  - test_evaluate_split_binary_returns_correct_f1_mono_class  # NEW

tests/unit/training_harness/adapters/test_lgb_adapter.py
  - test_predict_proba_strict_true_raises_on_column_order_mismatch    # NEW (CR PR #941 r2 reco split)
  - test_predict_proba_strict_false_reorders_and_logs                 # NEW (CR PR #941 r2 reco split)
  - test_predict_proba_raises_on_missing_feature_regardless_of_strict # NEW (hard mismatch always raises)
  - test_predict_proba_raises_on_extra_feature_regardless_of_strict   # NEW (hard mismatch always raises)
  - test_predict_proba_accepts_ndarray_unchanged                      # regression guard

Unit (S19 main)

tests/unit/training_harness/nodes/test_theta_sweep.py
  - test_pick_threshold_respects_theta_min_kwarg                       # NEW
  - test_pick_threshold_respects_theta_max_kwarg                       # NEW
  - test_pick_threshold_excludes_anchor_05_when_outside_bounds         # NEW (CR PR #941 r2 reco #1 — bounds violation regression)
  - test_pick_threshold_includes_anchor_05_when_inside_bounds          # NEW (positive case for the same logic)
  - test_pick_threshold_raises_on_invalid_bounds                       # NEW (theta_min >= theta_max OR bounds outside (0, 1))
  - test_pick_threshold_emits_overtrade_warning_above_warn_threshold   # NEW
  - test_pick_threshold_raises_on_overtrade_above_fail_threshold       # NEW (mandatory fail threshold per CR r2 reco #2/3)
  - test_pick_threshold_no_warning_below_warn_threshold                # NEW

tests/unit/finetune/test_hyperparams_bool_extension.py
  - test_resolve_parses_true_variants_when_param_type_bool             # NEW (CR PR #941 r2 reco #4 — bool extension)
  - test_resolve_parses_false_variants_when_param_type_bool            # NEW
  - test_resolve_raises_on_unparseable_bool_literal                    # NEW

Unit (S19 main — CR r3 additions)

tests/unit/training_harness/nodes/test_theta_sweep.py (additions to v3 plan)
  - test_pick_threshold_warn_only_mode_does_not_raise_above_fail_threshold  # NEW (CR r3 reco #10 — explicit mode flag coverage)
  - test_pick_threshold_fail_mode_raises_above_fail_threshold               # NEW (positive complement)
  - test_pick_threshold_invalid_overtrade_guard_mode_raises                 # NEW (enum guard regression)
  - test_pick_threshold_includes_eval_metrics_labels_zero_one               # NEW (CR r2 Bug 2 — moved from hardening to main per r3 reco #1)

tests/unit/training_harness/nodes/test_eval_metrics.py (moved from hardening)
  - test_evaluate_split_binary_returns_correct_f1_mono_class                # NEW (Bug 2 — moved here from hardening)

Downstream integration (CR r3 reco #2 — explicit test cases)

tests/integration/training_harness/test_lgb_downstream_integration.py
  - test_autonomous_trainer_consumes_lgb_artifact_with_new_theta_range
    # Run autonomous_orchestrator.train_one_crypto for AAVEUSDC fold=3 ;
    # assert TrainedArtifact.threshold_buy ∈ [0.30, 0.40] ; assert
    # autonomous_trained log event shows the new theta_picked field
  - test_regime_trainer_propagates_overtrade_guard_event
    # Run weighted_variant_trained on a synthetic over-trade case ;
    # assert event=theta_overtrade_warning bubbles up the regime trainer
    # log chain ; assert Loki-queryable
  - test_walk_forward_predictor_uses_picked_theta_from_artifact
    # Load a TrainedArtifact with threshold_buy=0.35 ; run
    # WalkForwardPredictor.predict on a synthetic feature window ;
    # assert the predictor gates buy signals at θ=0.35 (NOT the legacy 0.40)

These 3 tests gate the S19 main PR merge. They run as pytest -m integration and are added to the medium tier (per CLAUDE.md Pytest markers).

Integration (cross-fold validation, manual via DAG)

3 runs of diagnostic__s18_step1_4_chain : - crypto=AAVEUSDC, fold_id=3 - crypto=OPUSDC, fold_id=3 - crypto=LDOUSDC, fold_id=4

Per cell, compare BEFORE/AFTER on the 7 metrics in §5 Phase 4.

Smoke (post-deploy)

  • 1-crypto FTF mini-sweep ; verify Loki shows the new theta_overtrade_warning event registry but no actual warnings on a healthy run
  • Grafana FTF dashboard panels still render correctly (no LogQL parse errors with the new events)

Regression (post-merge for both PRs)

  • Existing parity tests in tests/unit/training_harness/parity/test_lgb_harness_vs_legacy.py still PASS (the bug fixes are defensive, no behavioural change on currently-tested healthy paths)
  • Existing tests/unit/training_harness/test_phase4_lgb_cutover.py still PASS

12. Committee questions (for plan_review)

  1. Per-model θ range via ADR-90 keys vs hardcoded constants : the dossier proposes the ADR-90 path (CVN_HPO_LGB_5M_THETA_MIN/MAX) — is this the right level of indirection, or should the bounds live in code with the hardcoded LGB legacy values for clarity ? ADR-90 path adds Console UI complexity but unifies the pattern.
  2. Over-trade guard threshold : 0.20 is the operator's recommendation from S18 Step 5 §9. Should the default warn threshold come from the legacy LightGBMConfig.threshold_buy=0.4 × n_pos/n_train ≈ 0.1750.20, or be data-driven per crypto ?
  3. PR merge order : hardening-first (defensive) then main (behavioural). Any objection ?
  4. Bug 3 raise behaviour : the proposed fix RAISES on column mismatch. Should it be a WARN+reorder instead (graceful fix) ? The committee opinion matters because raising could break currently-running backtests if they happen to reorder columns silently.
  5. Cross-fold validation gating : 3 cells before merge. Is this enough or should we add e.g. ETHUSDC fold=3 as a "control" with a healthier baseline ?
  6. Verdict : PASS / PASS_WITH_REVISIONS / REJECTED with explicit AC for the next concrete step (S19 implementation kickoff).

13. Committee verdict (plan_review) — 2026-05-14

Status : PASSED_WITH_REVISIONS / OK / strong consensus / 0 blockers (session 1f4335a2, OP Meeting #135, 5 experts).

13.1 Areas of agreement (5)

  • 2-PR split (hardening first, main second) excellent for risk management + clean revert envelope
  • Strong ADR-25 alignment via Bug 1 fail-fast + Bug 3 raise on column mismatch
  • ADR-90 extension for THETA_MIN/MAX + RATE_BUY_*_THRESHOLD consistent with existing pattern
  • Over-trade guard with structured logging is a crucial operational control plane
  • Comprehensive risk analysis + rollback procedure

13.2 Areas of dissent (4)

Topic Pro / Against Resolution
rate_buy_val=0.20 warn threshold lacks data-driven justification 3 / 2 (data-scientist + ops) reco #2 — document as initial heuristic, schedule follow-up calibration spike
Bug 3 raise vs graceful reorder 4 / 1 (ops) reco #3 — strict=True default + opt-in strict=False via PG
Cross-fold size of 3 cells (ADR-14 robustness) 3 / 2 (data-scientist + ops) reco #1 — add ETHUSDC fold=3 as control (4 cells total)
Downstream system integration not fully validated 3 / 1 (data-scientist) reco #4 — add explicit downstream integration tests in Phase 4

13.3 8 recommendations integrated

# Reco Section updated
1 Add ETHUSDC fold=3 to cross-fold validation set §5 Phase 4
2 Document rate_buy_val=0.20 as initial heuristic (justification : n_pos / n_train ≈ 0.175 ≈ 0.20, follow-up data-driven calibration tracked separately) §4.2.2 + §10
3 Bug 3 add strict=False PG flag for graceful WARN+reorder transition mode (default strict=True) §4.3 (Bug 3 patch updated)
4 Validate downstream systems (autonomous trainer, regime trainer, walk-forward predictor) — add integration tests in Phase 4 §5 Phase 4 + §11
5 Set RATE_BUY_FAIL_THRESHOLD=0.25 default in PG seeding (5 % above warn) §4.2.2 / §5 Phase 4
6 Pre-deploy CI check for PG seeding + monitor event=hpo_fallback_applied Loki §5 Phase 4 + §6 risk row
7 Add rollback test in Phase 4 (synthetic fail trigger) §5 Phase 4
8 Document Bug 3 raise behaviour in a new ADR-91 (follow-up after S19-hardening merges) §10 (out-of-scope follow-up)

13.4 Decision

The committee unanimously approves the 2-PR split and the H8 fix posture. No blockers. The 8 revisions integrated above tighten validation scope (1 control cell + downstream integration tests + rollback test) and operational ergonomics (graceful reorder mode + default fail threshold + pre-deploy gate).

Story transition : OP wp#165 → New → Specified after this dossier merges to main.

13.5 Round 2 verdict (2026-05-14, session aa76ed46, OP Meeting #136)

Status : PASSED_WITH_REVISIONS / EXECUTION_RISK / strong consensus / 1 BLOCKER + 10 recommendations.

Round 2 was triggered by 5 user-surfaced corrections to v3 that have all been addressed in v4 (the current revision of this dossier). The committee then ran a fresh review against v3 and surfaced :

13.5.1 BLOCKER (resolved in v4)

theta_sweep.py modified by both PRs (Bug 2 in hardening + θ range in main) → contradicts "independently mergeable" claim, would produce a merge conflict regardless of merge order. Resolution : Bug 2 (both theta_sweep.py:59 AND eval_metrics.py:69) MOVED to S19 main. S19-hardening becomes strictly orthogonal on the file surface (3 files : class_balance.py for Bug 1, adapters/lgb.py for Bug 3, hyperparams.py for the bool extension). See §4.1 updated table for the post-r3 file split.

13.5.2 11 recommendations integrated in v4

# Recommendation Section updated
1 Resolve theta_sweep.py PR conflict (BLOCKER) — Bug 2 moved to main §4.1 (table) + §11 (test files moved)
2 Detail downstream integration tests (autonomous trainer + regime trainer + walk-forward predictor) — explicit test cases §11 new "Downstream integration" subsection (3 explicit tests)
3 Live inference / execution kill switch — REFERENCE existing ADR-71 + Epic CVN-N001-EG, NOT in S19 scope §10 (out-of-scope reference)
4 Continuous data / label / concept drift detection — NEW Story CVN-N001-EE-S22 filed §10 (out-of-scope)
5 bool extension MUST land in S19-hardening — os.environ.get bridge REMOVED before merge §4.3 (hard prerequisite)
6 Document the bool-parsing bridge as temporary with hard 30-day deadline + revert plan §4.3 (PR body lock)
7 Replace RATE_BUY_FAIL_THRESHOLD=1.0 magic-number escape with explicit OVERTRADE_GUARD_MODE = warn_only \| fail flag §4.2.2 (new explicit flag + enum guard)
8 Pre-deploy CI check extended to verify FEATURE_ORDER_STRICT is seeded (5 keys → 6 keys total) §5 Phase 4 (key list updated)
9 Schedule CVN-N001-EE-S21 calibration spike (data-driven thresholds) — confirmed §10 (out-of-scope)
10 Unit test for warn-only mode (verifies FAIL path is unreachable in warn_only) §11 (3 tests added : warn_only / fail / invalid mode)
11 Pre-deploy LGB strict-mode dry-run (strict=False 1h-backtest, verify no lgb_adapter_column_reorder event) §5 Phase 4 (added prerequisite step)

13.5.3 Areas of dissent (4)

Topic Resolution
theta_sweep.py cross-PR modification reco #1 — file moved (BLOCKER resolved)
RATE_BUY_FAIL_THRESHOLD=1.0 magic-number escape ergonomics reco #7 — explicit OVERTRADE_GUARD_MODE flag
Downstream integration test specificity reco #2 — 3 explicit test cases added in §11
Live inference kill switch + drift detection absence recos #3 + #4 — referenced ADR-71 / filed S22

13.5.4 Round 2 decision

The committee unanimously approves the v4 plan once the 11 recommendations are integrated (now done). The EXECUTION_RISK code reflected the v3 BLOCKER + the operational gaps ; both addressed in v4. The dossier is READY FOR IMPLEMENTATION KICKOFF post-merge of PR #941.

Story transition (post-merge) : OP wp#165 → New → Specified.

14. References