Plan dossier — CVN-N001-EE-S19 : Harness over-trade fix¶
Story : CVN-N001-EE-S19 (OP wp#165, GH #940)
Parent epic : CVN-N001-EE — F1_buy boost (10-track plan)
Predecessor : CVN-N001-EE-S18 (closed 2026-05-14, OP wp#154)
Author : Operator + Claude
Date : 2026-05-14
Status : v2 — committee plan_review PASSED_WITH_REVISIONS 2026-05-14 (session 1f4335a2, OP Meeting #135, strong consensus, 0 blockers, 8 recommendations integrated below — see §13)
1. Context¶
S18 (closed 2026-05-14) ran a 4-step diagnostic chain on the AAVEUSDC fold=3 cell to localise the post-#891 harness regression. After Steps 0-4 ruled out hypotheses H1-H7 of the parent dossier §5.1 :
| Step | Verdict | What it ruled out |
|---|---|---|
| Step 0 | PASS — replay reproduces canonical f1=0.3520 | reproducibility OK |
| Step 1 | (capture, no verdict) | full Optuna trial trace serialized to parquet |
| Step 2 | (analysis dossier) | side-by-side legacy vs harness code diff |
| Step 3 | REFUTED on H2 (valid_sets composition) |
both [train,val] and [val] produce best_iter=1 |
| Step 4 | NO_DIVERGENCE on F1-F6 | data is clean (label, features, drift, iter-1 probe all PASS) |
Phase A logs (chained DAG run 2026-05-14 14:44) revealed the actual mechanism :
event=training_complete model_type=lightgbm best_iteration=1 training_time_sec=2.465
theta_picked=0.2 f1_buy_val=0.352 auc_buy_val=0.6461 rate_buy_val=0.4611
event=signal_funnel raw_buy_signals=1210 final_trades=251 primary_killer=concurrency
event=weighted_variant_evaluated sortino=-9.512 n_trades=251 return=-91.35%
The model is NOT broken at the booster level (AUC 0.65, calibration acceptable) — it's broken at the post-training θ selection : scale_pos_weight=4.71 (auto-injected by class_balance.py) inflates positive-class probas, the harness θ-sweep over [0.05, 0.95] finds θ=0.2 because that's where f1_val maxes on the inflated distribution, and the model emits BUY 46 % of the time on val → catastrophic backtest.
The S18 Step 5 dossier (committee experiment_review PASSED_WITH_REVISIONS) re-scoped this as H8 = scale_pos_weight auto-injection × wide θ-sweep range coupling. H8 was not in the parent dossier's H1-H7 enumeration — gap in the original analysis.
The committee also validated 3 latent bugs found by operator code audit (§6.bis of Step 5 dossier) :
- Bug 1 — class_balance.py:55-62 fails-fast on n_pos == 0 only, NOT n_neg == 0 → silent scale_pos_weight=0 on degenerate splits
- Bug 2 — theta_sweep.py:59 + eval_metrics.py:69 missing labels=[0, 1] → wrong f1 on mono-class splits
- Bug 3 — adapters/lgb.py:42-43 strips DataFrame column names via .to_numpy() → silently wrong predictions if column order drifts. LGB-only (XGB + CB safe).
S19 is the remediation Story.
2. Goals¶
- Restore reasonable trading rate on the LGB harness path :
rate_buy_val ≤ 0.20(from current 0.46) on the AAVEUSDC fold=3 cell, with no degradation > 5 % onf1_buy_valvs the post-S17 canonical reference (0.3520485). - Backtest sortino > 0 on AAVEUSDC fold=3 + cross-fold validation cells.
- Fail-fast on degenerate training splits (Bug 1) — no more silent
scale_pos_weight=0. - Correct f1 measurement on mono-class splits (Bug 2) — Optuna no longer optimises on broken signal.
- Preserve DataFrame column order at LGB inference (Bug 3) — eliminate the silent feature-scrambling risk in backtest / live inference.
- Per-model θ-sweep range configurable via the existing ADR-90 PG-keys mechanism — no hardcoded constants.
3. Non-goals¶
- Disabling
scale_pos_weightglobally (S18 committee Option A REJECTED — insufficient evidence on XGB/CB regression risk). - Adding
scale_pos_weightto Optuna search (S18 committee Option D REJECTED — same f1-on-val attractor, would re-converge to the same over-trading optimum). - Re-designing the θ-sweep to optimise sortino instead of f1_val (Option E — major scope, future Story).
- Touching XGB or CB adapters / θ-sweep (the regression is LGB-specific per the S18 evidence ; XGB has its own f1=0.089 issue covered in a separate Story scope).
- Modifying the FTF data prep, labeling, or feature pipeline (Step 4 confirmed the data is clean).
- Touching the autonomous trainer entry point (the regression is in the harness post-training nodes).
4. Architecture¶
4.1 Two-PR split (committee directive)¶
The committee verdict (Step 5 dossier §9) explicitly rejected Option F (bundling H8 + bug fixes in one PR) for clean revert envelope + independent CR cycles. S19 ships as 2 PRs :
| PR | Scope | Surface | Behavioural change |
|---|---|---|---|
| S19 main | H8 fix (LGB θ range + over-trade guard) + Bug 2 (BOTH files) | 5 files : theta_sweep.py, eval_metrics.py, lightgbm_dag.py, hyperparams.py registration, PG seed |
Yes — alters the θ pick on LGB |
| S19-hardening | 3 defensive fixes : Bug 1 (class_balance.py) + Bug 3 (adapters/lgb.py) + bool extension to hyperparams.py |
3 files : class_balance.py, adapters/lgb.py, hyperparams.py (bool branch + _PARAM_TYPES entry) |
No on healthy paths ; defensive fail-fast on degenerate paths |
CR PR #941 r3 BLOCKER #1 resolved : the v3 split had Bug 2 (theta_sweep.py:59 + eval_metrics.py:69) in the hardening PR, but theta_sweep.py is also modified by the main PR (θ range + guard). That contradicted the "independently mergeable" claim and would have produced a merge conflict regardless of merge order. Resolution : Bug 2 (both files) MOVED to S19 main, where theta_sweep.py is touched anyway and eval_metrics.py shares the same precision_recall_fscore_support signature. S19-hardening becomes strictly orthogonal to S19 main on the file surface.
The 2 PRs are now truly independently mergeable — zero file overlap. Recommended merge order remains : hardening first (defensive, low risk), then main (behavioural, higher review cost). The hardening PR also lands the bool extension to hyperparams.py which the main PR depends on for FEATURE_ORDER_STRICT (per CR r3 reco #5 — must NOT use the temporary os.environ.get bridge in main).
| Concern | Pre-r3 (v3) | Post-r3 (v4) |
|---|---|---|
theta_sweep.py |
Touched by BOTH (Bug 2 + θ range) — merge conflict | Touched ONLY by main |
eval_metrics.py |
Touched by hardening (Bug 2) | Touched ONLY by main (bundled with theta_sweep.py Bug 2) |
class_balance.py |
Touched ONLY by hardening (Bug 1) | Unchanged — only by hardening |
adapters/lgb.py |
Touched ONLY by hardening (Bug 3) | Unchanged — only by hardening |
hyperparams.py |
Touched by main (param registration) | Touched by BOTH but on DIFFERENT lines — main adds new keys to _PARAM_TYPES, hardening adds bool branch + FEATURE_ORDER_STRICT entry. Merge order : hardening first to avoid conflict in _PARAM_TYPES dict. |
lightgbm_dag.py |
Touched by main only | Touched by main only |
4.2 S19 main PR — H8 fix design¶
4.2.1 Per-model θ-sweep range via ADR-90 keys
The current theta_sweep.pick_threshold_on_val() hard-codes the candidates np.linspace(0.05, 0.95, 19) + [0.5]. S19 main adds two kwargs theta_min + theta_max :
# src/training/harness/nodes/theta_sweep.py
def pick_threshold_on_val(
y_val: np.ndarray,
p_buy_val: np.ndarray,
*,
theta_min: float = 0.05,
theta_max: float = 0.95,
) -> ThetaPick:
"""..."""
# CR PR #941 r2 reco #1 — when (theta_min, theta_max) is restricted (e.g.
# [0.30, 0.40] for the LGB tightening), the canonical [0.5] anchor MUST
# NOT be appended unconditionally — otherwise the picker can return θ=0.5
# outside the configured bounds, contradicting the AC.
if not (0.0 < theta_min < theta_max < 1.0):
raise ValueError(
f"pick_threshold_on_val: invalid bounds — "
f"requires 0 < theta_min ({theta_min}) < theta_max ({theta_max}) < 1"
)
candidates = np.linspace(theta_min, theta_max, 19)
if theta_min <= 0.5 <= theta_max:
candidates = np.concatenate([candidates, [0.5]])
candidates = np.unique(candidates)
# rest unchanged
The caller (lightgbm_dag.py) resolves the LGB bounds via the ADR-90 resolver :
# src/training/harness/dags/models/lightgbm_dag.py
from commun.finetune.hyperparams import resolve
theta_min = resolve("LGB", tf, "THETA_MIN", fallback=0.30) # legacy lgbm_config.py:117
theta_max = resolve("LGB", tf, "THETA_MAX", fallback=0.40)
lgb_theta_pick = pick_threshold_on_val(y_val, p_buy_val, theta_min=theta_min, theta_max=theta_max)
PG seeding (Console UI, NOT git per ADR-59) :
- CVN_HPO_LGB_5M_THETA_MIN = 0.30
- CVN_HPO_LGB_5M_THETA_MAX = 0.40
- (and per-timeframe entries if needed)
theta_sweep.py stays generic ; only LGB callers tighten the range. CB / future model can keep the wide default by not setting the kwargs.
4.2.2 Over-trade guard (committee enhancement)
Inside theta_sweep.pick_threshold_on_val(), after the best θ is computed, calculate the resulting rate_buy_val and emit a structured warning if it exceeds a threshold :
# src/training/harness/nodes/theta_sweep.py — additional logic
def pick_threshold_on_val(
y_val,
p_buy_val,
*,
theta_min=0.05,
theta_max=0.95,
rate_buy_warn_threshold: float = 0.20,
rate_buy_fail_threshold: float = 0.25,
) -> ThetaPick:
# ... pick best_t as before ...
rate_buy_val = float(((p_buy_val >= best_t).astype(int) == 1).mean())
if rate_buy_val > rate_buy_warn_threshold:
log_event(
logger,
"theta_overtrade_warning",
theta_picked=best_t,
rate_buy_val=rate_buy_val,
rate_buy_warn_threshold=rate_buy_warn_threshold,
rate_buy_fail_threshold=rate_buy_fail_threshold,
f1_buy=best_f1,
)
if rate_buy_val > rate_buy_fail_threshold:
raise RuntimeError(
f"theta_sweep over-trade guard fired : rate_buy_val={rate_buy_val:.3f} > "
f"fail_threshold={rate_buy_fail_threshold:.3f} (theta={best_t})"
)
return ThetaPick(threshold=best_t, f1_buy=best_f1, n_candidates=len(candidates))
Both thresholds are PG-mandatory (no optional / fallback=None semantics — the existing hyperparams.resolve API treats fallback=None as "raise if unset", which is incompatible with an opt-in unset state ; CR PR #941 r2 reco #3) :
- CVN_HPO_LGB_5M_RATE_BUY_WARN_THRESHOLD = 0.20 (committee reco #2 — initial heuristic, see also §4.2.2 justification)
- CVN_HPO_LGB_5M_RATE_BUY_FAIL_THRESHOLD = 0.25 (committee reco #5 — 5 % above warn, enables fail-fast on extreme over-trade)
The behaviour band is :
- rate_buy_val ≤ 0.20 → run proceeds, no log
- 0.20 < rate_buy_val ≤ 0.25 → run proceeds, event=theta_overtrade_warning emitted
- rate_buy_val > 0.25 → run fails with RuntimeError
To restore warn-only behaviour temporarily (e.g. for a debugging window), the operator sets CVN_HPO_LGB_5M_OVERTRADE_GUARD_MODE = warn_only via Console UI (CR r3 reco #7 — replaces the previous FAIL=1.0 magic-number escape with an explicit semantic flag, aligning with ADR-90 patterns). The two accepted values :
warn_only— only theevent=theta_overtrade_warningis emitted ; theRATE_BUY_FAIL_THRESHOLDis ignored, noRuntimeErroris ever raisedfail(default) — full behaviour as described above ; warn at 0.20, fail at 0.25
The OVERTRADE_GUARD_MODE is a string-typed PG key, parsed by hyperparams.resolve with an enum guard :
# in pick_threshold_on_val(...)
mode = resolve("LGB", tf, "OVERTRADE_GUARD_MODE", fallback="fail")
if mode not in {"warn_only", "fail"}:
raise ValueError(f"OVERTRADE_GUARD_MODE must be 'warn_only' or 'fail', got {mode!r}")
if rate_buy_val > rate_buy_warn_threshold:
log_event(logger, "theta_overtrade_warning", ..., guard_mode=mode)
if mode == "fail" and rate_buy_val > rate_buy_fail_threshold:
raise RuntimeError(...)
_PARAM_TYPES extension (in S19-hardening alongside the bool branch) :
4.2.3 Files touched (S19 main)
src/training/harness/nodes/theta_sweep.py # add 4 kwargs + over-trade guard logic
src/training/harness/dags/models/lightgbm_dag.py # resolve LGB θ bounds + warn threshold ; pass to pick_threshold_on_val
src/commun/finetune/hyperparams.py # register THETA_MIN, THETA_MAX, RATE_BUY_WARN_THRESHOLD, RATE_BUY_FAIL_THRESHOLD in _PARAM_TYPES
documentation/adr/0090-...md # (optional) add a note about the new param family
tests/unit/training_harness/nodes/test_theta_sweep.py # tests for the bounds + over-trade guard
PG migration : seed 4 PG keys via Console UI (per ADR-59 — Console-only, no git PR).
4.3 S19-hardening PR — 3 bug fixes¶
Bug 1 — class_balance.py:55-62 :
# BEFORE
if n_pos == 0:
raise ValueError("compute_class_balance: degenerate training labels — n_pos=0. ...")
# AFTER
if n_pos == 0 or n_neg == 0:
raise ValueError(
f"compute_class_balance: degenerate training labels — "
f"n_pos={n_pos}, n_neg={n_neg}. Cannot train binary on a single-class split (ADR-25 fail-fast)."
)
Bug 2 — theta_sweep.py:59 + eval_metrics.py:69 :
Both call sites :
# BEFORE
_, _, f1, _ = precision_recall_fscore_support(y, y_pred, average=None, zero_division=0)
f1_buy = float(f1[1]) if len(f1) > 1 else 0.0
# AFTER
_, _, f1, _ = precision_recall_fscore_support(
y, y_pred, labels=[0, 1], average=None, zero_division=0,
)
f1_buy = float(f1[1]) # always safe with explicit labels
eval_metrics.evaluate_split_binary line 69 gets the same treatment for the precision/recall/f1 unpacking.
Bug 3 — adapters/lgb.py:42-43 :
# BEFORE
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
if isinstance(x, pd.DataFrame):
x = x.to_numpy() # ← strips column names
if self.best_iteration is not None:
raw = self._native.predict(x, num_iteration=int(self.best_iteration))
...
# AFTER
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
if isinstance(x, pd.DataFrame):
expected = list(self._native.feature_name())
actual = list(x.columns)
if actual != expected:
missing = set(expected) - set(actual)
extra = set(actual) - set(expected)
if missing or extra:
raise ValueError(
f"LGBAdapter.predict_proba: feature mismatch — "
f"missing={sorted(missing)}, extra={sorted(extra)}"
)
x = x[expected] # reorder to training-time schema
# leave as DataFrame — lgb.Booster.predict accepts both
if self.best_iteration is not None:
raw = self._native.predict(x, num_iteration=int(self.best_iteration))
else:
raw = self._native.predict(x)
proba = np.asarray(raw, dtype=float)
if proba.ndim == 1:
return np.column_stack([1.0 - proba, proba])
return proba
Defensive : raises if columns don't match the training schema (rather than silently .to_numpy() and predict on whatever is passed).
Committee reco #3 (expert-crypto-trader, expert-ops) — add a strict mode flag to allow a transition window. The default is strict=True (raise as above) but operators can flip to strict=False via PG (CVN_HPO_LGB_5M_FEATURE_ORDER_STRICT=false) for an emergency graceful fallback.
Implementation note (CR PR #941 r2 reco #4) : the existing commun.finetune.hyperparams.resolve() API only handles int / float per _PARAM_TYPES ; passing "false" would crash on the float-fallback parse. S19-hardening MUST extend _PARAM_TYPES with a bool entry for FEATURE_ORDER_STRICT :
# src/commun/finetune/hyperparams.py — additional case in the parser
_PARAM_TYPES: Final = {
# ... existing entries ...
"FEATURE_ORDER_STRICT": bool,
}
# In the parse function — explicit bool branch
def _parse_value(raw: str, expected_type: type, key: str):
...
if expected_type is bool:
s = raw.strip().lower()
if s in {"true", "1", "yes", "on"}:
return True
if s in {"false", "0", "no", "off"}:
return False
raise RuntimeError(f"hyperparams.resolve: cannot parse {raw!r} as bool for {key!r}")
...
CR r3 reco #5 + #6 — bool extension is now a HARD prerequisite. The bool entry + parse branch land in S19-hardening (alongside Bug 1 + Bug 3) — NOT delayed. The temporary os.environ.get bridge in the LGB adapter is REMOVED before S19-hardening merges ; the adapter calls hyperparams.resolve("LGB", "5M", "FEATURE_ORDER_STRICT", fallback=True) from the start. Documented in the S19-hardening PR body as a hard merge-order dependency for S19 main.
The S19-hardening PR body explicitly lists, in its description :
- "S19 main PR (#NNN) MUST NOT merge before this PR — lightgbm_dag.py depends on the bool extension landing here"
- "If the deadline (30 days from this PR's merge) elapses without S19 main landing, this PR is reverted to roll back the unused bool branch — schedules in the team agenda"
The patch :
# AFTER (final v2 with strict mode)
def predict_proba(self, x: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
if isinstance(x, pd.DataFrame):
expected = list(self._native.feature_name())
actual = list(x.columns)
if actual != expected:
missing = set(expected) - set(actual)
extra = set(actual) - set(expected)
if missing or extra:
# Hard mismatch — schema is wrong, not just reordered.
# Raise regardless of strict (no safe reorder possible).
raise ValueError(
f"LGBAdapter.predict_proba: feature mismatch — "
f"missing={sorted(missing)}, extra={sorted(extra)}"
)
# Same columns, different order — strict vs warn+reorder
if _is_strict_mode(): # reads CVN_HPO_LGB_5M_FEATURE_ORDER_STRICT, defaults True
raise ValueError(
f"LGBAdapter.predict_proba: column order mismatch (strict mode). "
f"expected={expected[:5]}... actual={actual[:5]}..."
)
log_event(
logger,
"lgb_adapter_column_reorder",
expected_first5=expected[:5],
actual_first5=actual[:5],
)
x = x[expected] # reorder to training-time schema
# leave as DataFrame — lgb.Booster.predict accepts both
if self.best_iteration is not None:
raw = self._native.predict(x, num_iteration=int(self.best_iteration))
else:
raw = self._native.predict(x)
proba = np.asarray(raw, dtype=float)
if proba.ndim == 1:
return np.column_stack([1.0 - proba, proba])
return proba
Default strict=True keeps the ADR-25 fail-fast posture as the contract ; the strict=False PG escape hatch is a transition-window concession (mirrors ADR-90 fallback pattern). Hard mismatches (missing / extra features) ALWAYS raise — only column-order divergence on identical sets falls back to WARN+reorder.
Files touched (S19-hardening) :
src/training/harness/nodes/class_balance.py # Bug 1 — extend n_pos guard with n_neg
src/training/harness/nodes/theta_sweep.py # Bug 2 — add labels=[0, 1] (line 59)
src/training/harness/nodes/eval_metrics.py # Bug 2 — add labels=[0, 1] (line 69)
src/training/harness/adapters/lgb.py # Bug 3 — preserve / reorder columns + raise on mismatch
tests/unit/training_harness/nodes/test_class_balance.py # n_neg=0 raises
tests/unit/training_harness/nodes/test_theta_sweep.py # mono-class split returns correct f1
tests/unit/training_harness/nodes/test_eval_metrics.py # ditto
tests/unit/training_harness/adapters/test_lgb_adapter.py # column-order preservation + mismatch raise
5. Implementation plan¶
Phase 1 — Plan dossier + committee review (this PR)¶
- Plan dossier finalised, committee
plan_reviewPASSED. - Story transitions
New → Specified.
Phase 2 — S19-hardening PR (defensive first, lower risk)¶
- Branch
fix/CVN-N001-EE-S19-hardening - 3 bug fixes (1 per file) + unit tests for each (4 test files)
- CR cycle (4-5 rounds per memory rule)
- Committee
pr_reviewper ADR-68 (touchessrc/training/harness/) - Merge
Phase 3 — S19 main PR (H8 behavioural fix)¶
- Branch
feat/CVN-N001-EE-S19-theta-range-and-overtrade-guard - ADR-90 param registration (
THETA_MIN,THETA_MAX,RATE_BUY_WARN_THRESHOLD,RATE_BUY_FAIL_THRESHOLD) theta_sweep.pyAPI extension (4 new kwargs)lightgbm_dag.pywiring (resolve bounds + warn threshold + pass through)- Unit tests for bounds + guard
- CR cycle
- Committee
pr_review - Merge — but gate by cross-fold validation success before pushing to prod
Phase 4 — Cross-fold validation (operator-driven)¶
Run diagnostic__s18_step1_4_chain with the SAME parameters on 4 cells :
- AAVEUSDC fold=3 (the canary)
- OPUSDC fold=3
- LDOUSDC fold=4
- ETHUSDC fold=3 (added per committee reco #1 — control cell with healthier baseline, ADR-14 alignment)
For each cell, capture BEFORE/AFTER metrics :
| Metric | Source event | Acceptance threshold |
|---|---|---|
theta_picked |
event=training_complete |
0.30 ≤ θ ≤ 0.40 (the new bound) |
rate_buy_val |
event=training_complete |
≤ 0.20 (down from 0.46) |
raw_buy_signals |
event=signal_funnel |
down at least 50 % vs BEFORE |
final_trades |
event=signal_funnel |
down — exact target depends on cell |
sortino |
event=weighted_variant_evaluated |
> 0 (up from -9.5) |
return |
event=weighted_variant_evaluated |
> -50% (loss-bounded) |
f1_buy_val |
event=training_complete |
within 5 % of post-S17 reference (no model-quality regression) |
PG seeding for the new ADR-90 keys (Console UI) BEFORE the validation runs.
Committee reco #5 (expert-ops) : RATE_BUY_FAIL_THRESHOLD is set to a default of 0.25 in PG seeding (5 % above the warn threshold), enabling fail-fast on extreme over-trade cases instead of leaving it unset (silent-degradation risk). The warn-only mode is preserved in the [0.20, 0.25] band ; above 0.25 the run fails.
Committee reco #6 (expert-architect, expert-ops) : a pre-deploy CI check verifies that the 4 new ADR-90 keys + FEATURE_ORDER_STRICT (5 keys total per CR r3 reco #8) are seeded in PG before any deploy that includes the Phase 3 PR. Implementation : a new gate in the deploy CI that calls hyperparams.resolve("LGB", "5M", "THETA_MIN") etc. without the fallback ; failure to resolve = build fails with an actionable message. Keys checked :
- CVN_HPO_LGB_5M_THETA_MIN
- CVN_HPO_LGB_5M_THETA_MAX
- CVN_HPO_LGB_5M_RATE_BUY_WARN_THRESHOLD
- CVN_HPO_LGB_5M_RATE_BUY_FAIL_THRESHOLD
- CVN_HPO_LGB_5M_OVERTRADE_GUARD_MODE
- CVN_HPO_LGB_5M_FEATURE_ORDER_STRICT
Loki query {namespace="cvntrade"} |~ "event=hpo_fallback_applied" is monitored post-deploy ; any hit on these 6 keys triggers an oncall page.
Committee reco #11 (expert-ops) — pre-deploy LGB strict-mode compatibility check : before flipping FEATURE_ORDER_STRICT=true in prod, run a dry-run backtest with strict=false for 1 hour against live FE-pipeline output ; verify that event=lgb_adapter_column_reorder does NOT fire (= no upstream caller silently reorders columns). If any reorder event is logged, fix the upstream caller before flipping strict=true. Documented as a Phase 4 prerequisite step.
Committee reco #7 (expert-architect) : Phase 4 cross-fold validation includes a rollback test — manually trigger an over-trade fail (set RATE_BUY_FAIL_THRESHOLD=0.10 so the guard fires), verify the run errors as expected, then re-set to 0.25 via Console UI in seconds. Documents the rollback latency.
Committee reco #8 (expert-architect, expert-ml-engineer) : the Bug 3 raise behaviour (LGB column-mismatch) is documented in a new ADR (e.g. ADR-91 — to be drafted in a follow-up PR after S19-hardening merges) formalising the column-order contract between the harness and the backtest engine. Until the ADR lands, the contract is documented in the S19-hardening PR body + in the LGBAdapter.predict_proba docstring.
Phase 5 — Post-validation deploy + Story closure¶
- If all 4 cells PASS the acceptance thresholds → push to prod via the standard deploy CI path
- Trigger an FTF mini-sweep (1 crypto, 1 fold) post-deploy ; verify
event=theta_overtrade_warningdoes NOT fire spuriously on a fresh prod run - OP wp#165 transitions
In testing → Tested → Closedper ADR-81 - Closure note on wp#165 with PR SHA + Loki snapshot of the 3 cell verdicts
6. Risk analysis¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
[0.30, 0.40] range too narrow → LGB f1 degrades > 5 % on some folds |
Medium | Medium (model-quality regression) | Cross-fold validation pre-merge ; if any cell fails the f1 threshold, widen the range to [0.25, 0.45] and re-validate |
| Over-trade guard fires spuriously on healthy folds (false positive) | Low | Low (warning only ; no behavioural change in default mode) | Default to warn-only ; operator can flip to fail-mode via PG ; post-deploy mini-sweep verifies no spurious fires |
| Bug 3 fix raises on a healthy backtest call where columns happen to be reordered (false alarm) | Medium | Medium (backtest fails instead of silently mispredicting) | Defensive raise IS the contract per ADR-25 ; the alternative (silent mispredict) is the actual bug. If the backtest engine reorders columns, the engine MUST be fixed to pass them in training order. Document this in PR body. |
| The 2 PRs land in the wrong order (main before hardening) → main is harder to revert | Low | Low | Merge strategy doc'd in PR body of S19 main : "DO NOT MERGE before S19-hardening (#NNN)" |
PG seeding forgotten before deploy → fallback to legacy 0.05-0.95 range fires (the bug we're fixing) |
Medium | High (regression re-emerges silently) | lightgbm_dag.py uses resolve(..., fallback=0.30) and fallback=0.40 so the fix IS the default ; event=hpo_fallback_applied Loki-queryable to detect any unintended fallback path |
| Cross-fold validation reveals H8 isn't the dominant cause on OPUSDC/LDOUSDC | Medium | Medium | Treat as new finding ; halt merge ; reopen S18 (or open S20) to investigate per-crypto variance |
| Bug 3 fix exposes a hidden upstream bug (FE pipeline reorders columns sometimes) | Medium | High | The exposure IS the goal — fail-fast surfaces it. Fix the upstream caller separately if it manifests. |
7. Rollback procedure¶
Per phase :
- Phase 2 (hardening) PR not merged : git close PR.
- Phase 2 merged but Bug 3 starts raising in backtest : revert PR via
git revert <SHA>+ emergency PR ; investigate the upstream caller that passes mis-ordered columns. - Phase 3 (main) PR not merged : git close PR.
- Phase 3 merged but cross-fold validation reveals problem : revert PR + rollback PG seeds via Console (set
THETA_MIN/MAXback to0.05/0.95or unset). - Cutover after Phase 4 validation passes but issue surfaces in prod :
git revert+ helm rollback if needed ; PG seeds are Console-rollback in seconds.
RTO : git revert + CI build + helm upgrade ≈ 15 min.
8. Definition of done¶
Hard criteria for closing the Story (all must pass) :
- S19-hardening PR merged on main (3 bug fixes + 4 unit test files)
- S19 main PR merged on main (θ-sweep range + over-trade guard + ADR-90 registrations + unit tests)
- PG seeds
CVN_HPO_LGB_5M_THETA_MIN/MAX+CVN_HPO_LGB_5M_RATE_BUY_WARN_THRESHOLDset via Console UI - Cross-fold validation passed on 4 cells (AAVEUSDC fold=3 + OPUSDC fold=3 + LDOUSDC fold=4 + ETHUSDC fold=3 control) per §5 Phase 4 thresholds
- Post-deploy FTF mini-sweep emits no spurious
event=theta_overtrade_warning - Loki query
{namespace="cvntrade"} |~ "event=theta_overtrade_warning"returns the validation runs only (audit trail) - OP wp#165 transitions
In testing → Tested → Closedper ADR-81 with closure note + PR SHAs
9. ADR alignment¶
- ADR-25 (no silent fallback) — Bug 1 fix (fail-fast on degenerate splits) + over-trade guard fail-mode + Bug 3 raise on column mismatch
- ADR-31/32/33 (structured logging) —
event=theta_overtrade_warningfollows key=value format - ADR-59 (Console-only PG params) — new
THETA_MIN/MAX+ warn/fail thresholds via Console, NOT in git - ADR-68 (Expert Committee) —
pr_reviewmandatory per PR (touchessrc/training/harness/) - ADR-77 (MkDocs SSoT) — this dossier under
documentation/reviews/ - ADR-89 (training harness as plugin registry) — preserved ; per-model θ range fits the existing plugin model
- ADR-90 (training hyperparams in PG) — extends the existing
CVN_HPO_<MODEL>_<TF>_<PARAM>scheme to a new param family (THETA_*)
10. Out-of-scope follow-ups (filed separately)¶
- CVN-N001-EE-S20 : XGB-specific f1=0.089 fix (parent dossier §3.1 cited the canary number ; the XGB regression is structurally different — fixed θ=0.5, no θ-sweep — and likely a different mechanism than H8) — to be opened post-S19 close
- CVN-N001-EE-S21 (committee r2 reco #9 — confirmed) : data-driven calibration spike for the over-trade thresholds + θ-sweep bounds per crypto, using the 4 cross-fold validation cells as the starting dataset. Replaces the current "initial heuristic" basis for
0.20 / 0.25with empirical per-crypto bounds. Out-of-scope for S19 (heuristic is good enough for the H8 unblock). - Pre-#891 baseline empirical re-establishment : run
train_with_fixed_params_lgbmwithscale_pos_weight=1.0+θ=0.4on the captured fold to anchor the f1≈0.42 reference cited in S18 parent dossier §3 (S18 Step 5 §7 question #3) — could be a 1-day spike Story. - Live inference / execution kill switch (CR r3 reco #3) — already in scope under ADR-71 + Epic CVN-N001-EG (kill-switch implementation Story tracked separately ; design dossier
documentation/design/CVN-N001-EF-S02-kill-switch-design.md). S19 does NOT implement the kill switch ; it relies on the pre-existing kill-switch contract for emergency halt independent of code rollback. - CVN-N001-EE-S22 (NEW, CR r3 reco #4) : continuous data / label / concept drift detection for production LGB models — metrics (KS test on feature distribution, label-prior drift, prediction-distribution drift), thresholds, alerting, runbooks. Currently no automated detection ; relies on FTF sweep cadence + operator manual inspection. Major scope ; out-of-scope for S19.
- ADR-91 : formalise the LGB column-order contract between harness adapter and backtest engine (S18 Step 5 reco #8) — already filed as a follow-up after S19-hardening merges.
11. Test plan¶
Unit (S19-hardening)¶
tests/unit/training_harness/nodes/test_class_balance.py
- test_compute_class_balance_raises_on_n_neg_zero # NEW
- test_compute_class_balance_raises_on_n_pos_zero # existing — keep as regression guard
tests/unit/training_harness/nodes/test_theta_sweep.py
- test_pick_threshold_returns_correct_f1_when_y_all_positive # NEW (Bug 2 regression)
- test_pick_threshold_returns_correct_f1_when_y_pred_all_positive # NEW
- test_pick_threshold_default_bounds_unchanged # regression guard
tests/unit/training_harness/nodes/test_eval_metrics.py
- test_evaluate_split_binary_returns_correct_f1_mono_class # NEW
tests/unit/training_harness/adapters/test_lgb_adapter.py
- test_predict_proba_strict_true_raises_on_column_order_mismatch # NEW (CR PR #941 r2 reco split)
- test_predict_proba_strict_false_reorders_and_logs # NEW (CR PR #941 r2 reco split)
- test_predict_proba_raises_on_missing_feature_regardless_of_strict # NEW (hard mismatch always raises)
- test_predict_proba_raises_on_extra_feature_regardless_of_strict # NEW (hard mismatch always raises)
- test_predict_proba_accepts_ndarray_unchanged # regression guard
Unit (S19 main)¶
tests/unit/training_harness/nodes/test_theta_sweep.py
- test_pick_threshold_respects_theta_min_kwarg # NEW
- test_pick_threshold_respects_theta_max_kwarg # NEW
- test_pick_threshold_excludes_anchor_05_when_outside_bounds # NEW (CR PR #941 r2 reco #1 — bounds violation regression)
- test_pick_threshold_includes_anchor_05_when_inside_bounds # NEW (positive case for the same logic)
- test_pick_threshold_raises_on_invalid_bounds # NEW (theta_min >= theta_max OR bounds outside (0, 1))
- test_pick_threshold_emits_overtrade_warning_above_warn_threshold # NEW
- test_pick_threshold_raises_on_overtrade_above_fail_threshold # NEW (mandatory fail threshold per CR r2 reco #2/3)
- test_pick_threshold_no_warning_below_warn_threshold # NEW
tests/unit/finetune/test_hyperparams_bool_extension.py
- test_resolve_parses_true_variants_when_param_type_bool # NEW (CR PR #941 r2 reco #4 — bool extension)
- test_resolve_parses_false_variants_when_param_type_bool # NEW
- test_resolve_raises_on_unparseable_bool_literal # NEW
Unit (S19 main — CR r3 additions)¶
tests/unit/training_harness/nodes/test_theta_sweep.py (additions to v3 plan)
- test_pick_threshold_warn_only_mode_does_not_raise_above_fail_threshold # NEW (CR r3 reco #10 — explicit mode flag coverage)
- test_pick_threshold_fail_mode_raises_above_fail_threshold # NEW (positive complement)
- test_pick_threshold_invalid_overtrade_guard_mode_raises # NEW (enum guard regression)
- test_pick_threshold_includes_eval_metrics_labels_zero_one # NEW (CR r2 Bug 2 — moved from hardening to main per r3 reco #1)
tests/unit/training_harness/nodes/test_eval_metrics.py (moved from hardening)
- test_evaluate_split_binary_returns_correct_f1_mono_class # NEW (Bug 2 — moved here from hardening)
Downstream integration (CR r3 reco #2 — explicit test cases)¶
tests/integration/training_harness/test_lgb_downstream_integration.py
- test_autonomous_trainer_consumes_lgb_artifact_with_new_theta_range
# Run autonomous_orchestrator.train_one_crypto for AAVEUSDC fold=3 ;
# assert TrainedArtifact.threshold_buy ∈ [0.30, 0.40] ; assert
# autonomous_trained log event shows the new theta_picked field
- test_regime_trainer_propagates_overtrade_guard_event
# Run weighted_variant_trained on a synthetic over-trade case ;
# assert event=theta_overtrade_warning bubbles up the regime trainer
# log chain ; assert Loki-queryable
- test_walk_forward_predictor_uses_picked_theta_from_artifact
# Load a TrainedArtifact with threshold_buy=0.35 ; run
# WalkForwardPredictor.predict on a synthetic feature window ;
# assert the predictor gates buy signals at θ=0.35 (NOT the legacy 0.40)
These 3 tests gate the S19 main PR merge. They run as pytest -m integration and are added to the medium tier (per CLAUDE.md Pytest markers).
Integration (cross-fold validation, manual via DAG)¶
3 runs of diagnostic__s18_step1_4_chain :
- crypto=AAVEUSDC, fold_id=3
- crypto=OPUSDC, fold_id=3
- crypto=LDOUSDC, fold_id=4
Per cell, compare BEFORE/AFTER on the 7 metrics in §5 Phase 4.
Smoke (post-deploy)¶
- 1-crypto FTF mini-sweep ; verify Loki shows the new
theta_overtrade_warningevent registry but no actual warnings on a healthy run - Grafana FTF dashboard panels still render correctly (no LogQL parse errors with the new events)
Regression (post-merge for both PRs)¶
- Existing parity tests in
tests/unit/training_harness/parity/test_lgb_harness_vs_legacy.pystill PASS (the bug fixes are defensive, no behavioural change on currently-tested healthy paths) - Existing
tests/unit/training_harness/test_phase4_lgb_cutover.pystill PASS
12. Committee questions (for plan_review)¶
- Per-model θ range via ADR-90 keys vs hardcoded constants : the dossier proposes the ADR-90 path (
CVN_HPO_LGB_5M_THETA_MIN/MAX) — is this the right level of indirection, or should the bounds live in code with the hardcoded LGB legacy values for clarity ? ADR-90 path adds Console UI complexity but unifies the pattern. - Over-trade guard threshold :
0.20is the operator's recommendation from S18 Step 5 §9. Should the default warn threshold come from the legacyLightGBMConfig.threshold_buy=0.4×n_pos/n_train ≈ 0.175≈0.20, or be data-driven per crypto ? - PR merge order : hardening-first (defensive) then main (behavioural). Any objection ?
- Bug 3 raise behaviour : the proposed fix RAISES on column mismatch. Should it be a WARN+reorder instead (graceful fix) ? The committee opinion matters because raising could break currently-running backtests if they happen to reorder columns silently.
- Cross-fold validation gating : 3 cells before merge. Is this enough or should we add e.g. ETHUSDC fold=3 as a "control" with a healthier baseline ?
- Verdict :
PASS / PASS_WITH_REVISIONS / REJECTEDwith explicit AC for the next concrete step (S19 implementation kickoff).
13. Committee verdict (plan_review) — 2026-05-14¶
Status : PASSED_WITH_REVISIONS / OK / strong consensus / 0 blockers (session 1f4335a2, OP Meeting #135, 5 experts).
13.1 Areas of agreement (5)¶
- 2-PR split (hardening first, main second) excellent for risk management + clean revert envelope
- Strong ADR-25 alignment via Bug 1 fail-fast + Bug 3 raise on column mismatch
- ADR-90 extension for
THETA_MIN/MAX+RATE_BUY_*_THRESHOLDconsistent with existing pattern - Over-trade guard with structured logging is a crucial operational control plane
- Comprehensive risk analysis + rollback procedure
13.2 Areas of dissent (4)¶
| Topic | Pro / Against | Resolution |
|---|---|---|
rate_buy_val=0.20 warn threshold lacks data-driven justification |
3 / 2 (data-scientist + ops) | reco #2 — document as initial heuristic, schedule follow-up calibration spike |
| Bug 3 raise vs graceful reorder | 4 / 1 (ops) | reco #3 — strict=True default + opt-in strict=False via PG |
| Cross-fold size of 3 cells (ADR-14 robustness) | 3 / 2 (data-scientist + ops) | reco #1 — add ETHUSDC fold=3 as control (4 cells total) |
| Downstream system integration not fully validated | 3 / 1 (data-scientist) | reco #4 — add explicit downstream integration tests in Phase 4 |
13.3 8 recommendations integrated¶
| # | Reco | Section updated |
|---|---|---|
| 1 | Add ETHUSDC fold=3 to cross-fold validation set | §5 Phase 4 |
| 2 | Document rate_buy_val=0.20 as initial heuristic (justification : n_pos / n_train ≈ 0.175 ≈ 0.20, follow-up data-driven calibration tracked separately) |
§4.2.2 + §10 |
| 3 | Bug 3 add strict=False PG flag for graceful WARN+reorder transition mode (default strict=True) |
§4.3 (Bug 3 patch updated) |
| 4 | Validate downstream systems (autonomous trainer, regime trainer, walk-forward predictor) — add integration tests in Phase 4 | §5 Phase 4 + §11 |
| 5 | Set RATE_BUY_FAIL_THRESHOLD=0.25 default in PG seeding (5 % above warn) |
§4.2.2 / §5 Phase 4 |
| 6 | Pre-deploy CI check for PG seeding + monitor event=hpo_fallback_applied Loki |
§5 Phase 4 + §6 risk row |
| 7 | Add rollback test in Phase 4 (synthetic fail trigger) | §5 Phase 4 |
| 8 | Document Bug 3 raise behaviour in a new ADR-91 (follow-up after S19-hardening merges) | §10 (out-of-scope follow-up) |
13.4 Decision¶
The committee unanimously approves the 2-PR split and the H8 fix posture. No blockers. The 8 revisions integrated above tighten validation scope (1 control cell + downstream integration tests + rollback test) and operational ergonomics (graceful reorder mode + default fail threshold + pre-deploy gate).
Story transition : OP wp#165 → New → Specified after this dossier merges to main.
13.5 Round 2 verdict (2026-05-14, session aa76ed46, OP Meeting #136)¶
Status : PASSED_WITH_REVISIONS / EXECUTION_RISK / strong consensus / 1 BLOCKER + 10 recommendations.
Round 2 was triggered by 5 user-surfaced corrections to v3 that have all been addressed in v4 (the current revision of this dossier). The committee then ran a fresh review against v3 and surfaced :
13.5.1 BLOCKER (resolved in v4)¶
theta_sweep.py modified by both PRs (Bug 2 in hardening + θ range in main) → contradicts "independently mergeable" claim, would produce a merge conflict regardless of merge order. Resolution : Bug 2 (both theta_sweep.py:59 AND eval_metrics.py:69) MOVED to S19 main. S19-hardening becomes strictly orthogonal on the file surface (3 files : class_balance.py for Bug 1, adapters/lgb.py for Bug 3, hyperparams.py for the bool extension). See §4.1 updated table for the post-r3 file split.
13.5.2 11 recommendations integrated in v4¶
| # | Recommendation | Section updated |
|---|---|---|
| 1 | Resolve theta_sweep.py PR conflict (BLOCKER) — Bug 2 moved to main |
§4.1 (table) + §11 (test files moved) |
| 2 | Detail downstream integration tests (autonomous trainer + regime trainer + walk-forward predictor) — explicit test cases | §11 new "Downstream integration" subsection (3 explicit tests) |
| 3 | Live inference / execution kill switch — REFERENCE existing ADR-71 + Epic CVN-N001-EG, NOT in S19 scope | §10 (out-of-scope reference) |
| 4 | Continuous data / label / concept drift detection — NEW Story CVN-N001-EE-S22 filed | §10 (out-of-scope) |
| 5 | bool extension MUST land in S19-hardening — os.environ.get bridge REMOVED before merge |
§4.3 (hard prerequisite) |
| 6 | Document the bool-parsing bridge as temporary with hard 30-day deadline + revert plan | §4.3 (PR body lock) |
| 7 | Replace RATE_BUY_FAIL_THRESHOLD=1.0 magic-number escape with explicit OVERTRADE_GUARD_MODE = warn_only \| fail flag |
§4.2.2 (new explicit flag + enum guard) |
| 8 | Pre-deploy CI check extended to verify FEATURE_ORDER_STRICT is seeded (5 keys → 6 keys total) |
§5 Phase 4 (key list updated) |
| 9 | Schedule CVN-N001-EE-S21 calibration spike (data-driven thresholds) — confirmed | §10 (out-of-scope) |
| 10 | Unit test for warn-only mode (verifies FAIL path is unreachable in warn_only) |
§11 (3 tests added : warn_only / fail / invalid mode) |
| 11 | Pre-deploy LGB strict-mode dry-run (strict=False 1h-backtest, verify no lgb_adapter_column_reorder event) |
§5 Phase 4 (added prerequisite step) |
13.5.3 Areas of dissent (4)¶
| Topic | Resolution |
|---|---|
theta_sweep.py cross-PR modification |
reco #1 — file moved (BLOCKER resolved) |
RATE_BUY_FAIL_THRESHOLD=1.0 magic-number escape ergonomics |
reco #7 — explicit OVERTRADE_GUARD_MODE flag |
| Downstream integration test specificity | reco #2 — 3 explicit test cases added in §11 |
| Live inference kill switch + drift detection absence | recos #3 + #4 — referenced ADR-71 / filed S22 |
13.5.4 Round 2 decision¶
The committee unanimously approves the v4 plan once the 11 recommendations are integrated (now done). The EXECUTION_RISK code reflected the v3 BLOCKER + the operational gaps ; both addressed in v4. The dossier is READY FOR IMPLEMENTATION KICKOFF post-merge of PR #941.
Story transition (post-merge) : OP wp#165 → New → Specified.
14. References¶
- Predecessor Story : CVN-N001-EE-S18 (closed) OP wp#154
- Step 5 re-scope dossier (the design source) :
documentation/missions/cvn-n001-ee-s18-diagnostic/step5-rescope-dossier.md - Parent S18 plan dossier :
2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md - ADR-25 (no silent fallback) :
documentation/adr/0025-pas-de-fallback-silencieux-dans-les-pipelines-ml.md - ADR-89 (training harness as plugin registry) :
documentation/adr/0089-...md - ADR-90 (training hyperparams in PG / Console only) :
documentation/adr/0090-...md - Trigger PR : #891 (the harness migration that introduced the regression)
- S18 chained DAG run that produced the H8 evidence : Loki window 2026-05-14 14:44-15:08 UTC, search
{event="s18_chain_verdict"}