Observability gap — prod LightGBM HPO draws are non-auditable¶
Finding (§0bis, verified)¶
For prod LightGBM HPO draws (CVNTrade_HPO, e.g. UNIUSDC, 11 draws, run 228e9ada):
METRICS = {}— no validation metric logged (no val-AUC / val-AUCPR / val-logloss, no calibration).ARTIFACTS root = []— no per-candle predictions persisted.- iteration params =
best_n_estimators = 156only — the searched HP (the tree ceiling), not the early-stoppingbest_iterationthat the model actually used.
Why it is load-bearing¶
Three "replay-independent" signals one would expect for an ML output diagnostic do not exist:
- No val-metric — cannot read whether a draw learned to rank on its own validation set, replay-independently.
- No per-candle predictions — any p_buy must be re-derived by retraining (this is why s43 retrains, and why its outputs inherit the s18 replay / A6 caveat).
- No
best_iteration— cannot tell whether a "156-tree" draw actually used 156 trees or early-stopped at ~1 — the precise quantity needed to diagnose the documentedbest_iter=1failure mode. (The cited best_iter=1 finding came from the multi-model study's training logs, not from these HPO runs.)
Consequence: no decisive replay-independent ML-output signal is available from prod. This forced S14 to a controlled re-fit (Q1) to answer the config question pre-S09 — the gap is the root cause of "read-only cannot answer before S09".
Proposed fix (HPO logging)¶
Persist, per HPO draw, at training time:
- The validation metric at
best_iteration(val-AUCPR for the rare positive, plus val-AUC and Brier/ECE), and ideally the per-iteration val curve (so early-stopping behaviour is auditable). - The real
best_iteration(early-stopping point) as a logged metric — alongside the searchedbest_n_estimatorsceiling (the gap between them is the early-stopping diagnostic). - A bounded sample of per-candle holdout predictions (e.g. the validation fold's p_buy), enough to compute AUPRC / reliability without a re-fit.
(ADR-25/31 key=value events; ADR-23 provenance; tie into the MLflow backbone CVN-N006.)
Impact / where it bites¶
- EK (tradability decision protocol) — any GBDT KILL-tuple datum currently rests on un-auditable draws.
- S14 Q2 / S09 — a persisted prediction sample would have let the rank/calibration fork run replay-independently (no S09 gate).
- Every future LGB/CatBoost diagnostic — without these three, each one must re-fit and inherits replay caveats.
Status¶
Orthogonal to S14 (which works around it via the Q1 controlled fit). Filed as GH #1178; prioritise relative to EK's need for an auditable GBDT substrate.