Skip to content

Observability gap — prod LightGBM HPO draws are non-auditable

Finding (§0bis, verified)

For prod LightGBM HPO draws (CVNTrade_HPO, e.g. UNIUSDC, 11 draws, run 228e9ada):

  • METRICS = {}no validation metric logged (no val-AUC / val-AUCPR / val-logloss, no calibration).
  • ARTIFACTS root = []no per-candle predictions persisted.
  • iteration params = best_n_estimators = 156 only — the searched HP (the tree ceiling), not the early-stopping best_iteration that the model actually used.

Why it is load-bearing

Three "replay-independent" signals one would expect for an ML output diagnostic do not exist:

  1. No val-metric — cannot read whether a draw learned to rank on its own validation set, replay-independently.
  2. No per-candle predictions — any p_buy must be re-derived by retraining (this is why s43 retrains, and why its outputs inherit the s18 replay / A6 caveat).
  3. No best_iteration — cannot tell whether a "156-tree" draw actually used 156 trees or early-stopped at ~1 — the precise quantity needed to diagnose the documented best_iter=1 failure mode. (The cited best_iter=1 finding came from the multi-model study's training logs, not from these HPO runs.)

Consequence: no decisive replay-independent ML-output signal is available from prod. This forced S14 to a controlled re-fit (Q1) to answer the config question pre-S09 — the gap is the root cause of "read-only cannot answer before S09".

Proposed fix (HPO logging)

Persist, per HPO draw, at training time:

  1. The validation metric at best_iteration (val-AUCPR for the rare positive, plus val-AUC and Brier/ECE), and ideally the per-iteration val curve (so early-stopping behaviour is auditable).
  2. The real best_iteration (early-stopping point) as a logged metric — alongside the searched best_n_estimators ceiling (the gap between them is the early-stopping diagnostic).
  3. A bounded sample of per-candle holdout predictions (e.g. the validation fold's p_buy), enough to compute AUPRC / reliability without a re-fit.

(ADR-25/31 key=value events; ADR-23 provenance; tie into the MLflow backbone CVN-N006.)

Impact / where it bites

  • EK (tradability decision protocol) — any GBDT KILL-tuple datum currently rests on un-auditable draws.
  • S14 Q2 / S09 — a persisted prediction sample would have let the rank/calibration fork run replay-independently (no S09 gate).
  • Every future LGB/CatBoost diagnostic — without these three, each one must re-fit and inherits replay caveats.

Status

Orthogonal to S14 (which works around it via the Q1 controlled fit). Filed as GH #1178; prioritise relative to EK's need for an auditable GBDT substrate.