Skip to content

Plan dossier — CVN-N001-EI-S14 · LightGBM output-validity diagnostic

Review note. Plan for Story S14 (ADR-0095). Fixes what & why and pre-registers the decision rule before any code/fit. The Story is re-scoped (r3) into two forks after a §0bis check on what prod actually logs (below). Submitted for plan_review.

What changed in r3 (the §0bis pivot). r1/r2 assumed S14 could read prod's per-draw signals (val-metric, best_iteration, per-candle predictions) and answer before S09 on the persisted s43 predictions. The §0bis check falsified that for the read-only framing: prod's LightGBM HPO runs log metrics={}, artifacts=[], and only best_n_estimators (the searched ceiling) — not the early-stopping best_iteration, no val-metric, no per-candle predictions (verified on UNIUSDC, 11 draws, run 228e9ada). So the read-only rank/calibration fork is inescapably the s43 replay — fully S09-gated. But this also shows the read-only constraint was the thing coupling S14 to the suspect replay — and the actual question ("is the LightGBM config/harness able to train on rare-positive crypto data?") is a config property, reproducible pre-S09 on trusted data via a small controlled instrumented fit. Hence the re-scope.

⛔ Why before S09 — re-honoured for Q1, honest for Q2

  • Q1 — config/harness validity (PRE-S09, decisive, replay-independent). Does the LightGBM training config/harness produce a degenerate model (early-stops at ~1 of a 156-tree ceiling, or trains 156 trees that still emit the prior) on rare-positive crypto data? This is a property of the config × regime, not of any particular run's predictions — so it is reproducible on trusted data (the normal FE-split trainer, which produces the A6 canonical — the reference, not the victim; _load_captured_parquet count = 0 in autonomous_model_trainer / ablation_runner). The harness surfaces the real best_iteration — the exact signal prod does not log. No replay, no S09 dependency.
  • Q2 — fold-3 rank/calibration (POST-S09, gated). Does LGB rank/calibrate vs CatBoost on the s43 replayed fold-3? This needs the s43 per-candle predictions, which exist only via the s18 replay (A6, 5/5 FAIL) — prod persists none (artifacts=[]). S09-gated, and runs only if Q1 returns CONFIG_OK_LGB (else the cause is already found upstream). Issuing it on the suspect replay would repeat the sin s43 flagged.

Composition. Run Q1 first (cheap, pre-S09, on trusted data). If Q1 = CONFIG_DEGENERATE_LGB → the bug is found without S09, and s43's LGB arm is contaminated at the source. If Q1 = CONFIG_OK_LGB and s43 still showed compression → the compression is fold-3/replay-specific, i.e. exactly the S09 question (Q2). Q1 cannot decide that — and should not try.


Part A — Narrative

A1. Problem statement (non-technical)

s43 showed CatBoost spreads BUY-probabilities up to ~0.9 while LightGBM's never exceed ~0.30 (≈ the base rate). For weeks an LGB bug has been suspected but never characterised, because it was viewed through the economic / θ lens, which conflates scale with rank. The decisive view is upstream, in the model output — and, per the §0bis pivot, the most fundamental version of the question is does the LGB config train at all (Q1), which is testable without the suspect replay.

Three readings to separate (the Story decides between them, does not pre-judge): 1. "CatBoost ≫ LightGBM → CB is the better model." At the same θ this is apples-to-oranges when scales differ; the scale-invariant view is at matched rate. s43 v3 at ~11–15 % rate: precision ~0.33 (LGB) vs ~0.38 (CB) over a 0.25 break-even — margins 0.08 vs 0.13. Whether that gap is "small" or "real" is not pre-judged; the tail-rank test (Q2, §1) decides it. 2. "p_buy capped at 0.30 → LGB is buggy." Maybe — or benign: LGB underfits with defaults, or CB over-fits (its θ=0.9 corner is one won trade). "CB better" ≠ "LGB buggy". 3. "It's economic." No — it is a model-output / training-config problem one layer up.

What we measure. Q1 (config, trusted data): the real best_iteration vs the 156-tree ceiling; the training/val curve; AUPRC and ECE/reliability on a clean holdout. Q2 (fold-3, replay, post-S09): tail-rank + calibration of the persisted s43 predictions. All as lift over the measured base rate.

Honesty. Unreadable inputs / mis-aligned labels → INCONCLUSIVE_TOOLING(reason). Underpowered CIs → INCONCLUSIVE_POWER. The Q2 fork pre-S09 → INCONCLUSIVE_REPLAY_GATED.

A2. User stories

# As a… I want… so that…
US1 quant / reviewer to know whether the LGB config trains (best_iteration, AUPRC) on rare-positive crypto I characterise the weeks-old suspicion without waiting on S09
US2 programme DRI to know whether the s43/S05 GBDT verdict is contaminated at the source (Q1) I reopen the LGB arm if Q1 shows a degenerate config
US3 S09 owner to know whether the baseline LGB is itself degenerate (a separate fact) S09 reads its fidelity check against a known-good-or-bad baseline
US4 dev2 (Epic EK) a GBDT substrate whose family training is validated or flagged EK's KILL-tuple datum is not built on a config that cannot train

(US3 reworded from r2: S14 does not "explain A6" — see A5 note.)

A3. Hypotheses (pre-registered)

  • H1 — Config trains (Q1, the decisive pre-S09 test). The LGB config/harness builds a non-degenerate model on regime-matched trusted data.
  • Criterion: the real best_iteration (early-stopping) vs the searched ceiling (best_n_estimators, e.g. 156); AUPRC (lift over base rate) and the train/val curve on a clean holdout. Degenerate iff best_iteration ≤ BEST_ITER_FLOOR or AUPRC-lift ≈ chance.
  • H2 — Calibration (reliability, NOT p_buy spread). LGB's probabilities are mis-calibrated (a real score→prob defect) vs honestly calibrated for a weak signal (a weak classifier on a ~base-rate positive should compress near the base rate — that is correct, not a bug).
  • Criterion: reliability diagram + ECE (equal-mass bins) + Brier, per family. The mirror is first-class: if CatBoost is the high-ECE one, CB is over-confident and LGB is the well-calibrated family.
  • H3 — Cause localisation (the best_iteration × rank 2×2). best_iteration localises why a deficit exists; it never overrides rank:
  • best_iteration ≈ 1 (≪ ceiling) + low rank → early-stopping cause (eval metric optimised at tree 1);
  • best_iteration ≈ ceiling + low rank → capacity / feature / label / leakage (trains fully, learns nothing) — not early-stopping;
  • best_iteration ≈ 1 + healthy rank → fine (a strong single tree) — and a coherence flag (verify the read).

WA1 (cause, surfaced not tested). If early-stopping branch: eval metric on the imbalanced val set, early_stopping_rounds, num_boost_round/best_iteration handling. If capacity branch: min_child_samples/ min_gain_to_split, is_unbalance/scale_pos_weight, features/label plumbing. The Q1 eval-metric ablation (§1) discriminates the early-stopping branch directly.

A4. State of the art

  • Config pathology is reproducible. An early-stopping metric dominated by the majority class halts boosting at ~1 (best_iteration ≈ 1), collapsing predictions to the prior — the program's best_iter=1 multi-model finding (CVN-N001, PASSED). Because it is a config×regime property, it reproduces on trusted data — the basis of Q1's pre-S09 power.
  • Tail rank, not global AUC. The economics live in the high-confidence tail; use AUPRC (rare positive)
  • precision@matched-rate [Davis2006], not bulk-dominated global AUC (secondary context only).
  • Reliability is the calibration test. Only the reliability diagram / ECE distinguishes a weak-but-honest classifier (compressed near base rate, correct) from a true score→prob defect [Niculescu-Mizil2005]; it tests the CB-over-confident mirror symmetrically. Equal-mass bins (not equal-width) — the high-p_buy bins are empty for LGB by construction, so equal-width ECE is bulk-dominated and blind in the tail.
  • CatBoost robustness. Ordered boosting + conservative defaults fit small/imbalanced data where LGB defaults underfit — a benign explanation the fork must be able to return (CONFIG_OK_LGB / NO_LGB_ANOMALY).
Reference Grounds
[Davis2006] Davis & Goadrich, PR vs ROC Curves (ICML) AUPRC / tail-rank for the rare positive
[Niculescu-Mizil2005] Niculescu-Mizil & Caruana (ICML) reliability / ECE (equal-mass)
CVN-N001 best_iter=1 multi-model study (PASSED) the early-stopping cause; Q1 reproducibility basis

ADRs: ADR-0095, ADR-25/31 (no silent fallback / no print), ADR-92 (build SHA), ADR-23 (provenance).

A5. Consolidation & traceability

Problem Hypothesis US Section Fork
Does the LGB config train (best_iteration, AUPRC)? H1 US1 §3.Q1 Q1 (pre-S09)
Is LGB mis-calibrated vs honest weak signal? H2 US1 §3.Q1/Q2 Q1 holdout + Q2
Why the deficit (early-stop vs capacity)? H3 US1 §3.Q1 Q1 ablation
Does LGB rank on the replayed fold-3? H1/H2 US2 §3.Q2 Q2 (S09-gated)

A6 link (corrected, kept from r2 #6). A rank-deficient LGB makes both replay and prod f1 low → they coincide, they do not diverge; A6 is a divergence (replay ≠ canonical), a fidelity problem owned by S09, orthogonal to underfit. S14 does not "explain A6"; at most Q1 tells S09 whether the baseline LGB is itself degenerate (a separate fact). The "feeds A6" claim is withdrawn.

Observability finding (the §0bis itself, orthogonal). Prod LGB draws are non-auditable: metrics={}, artifacts=[], no best_iteration (only the searched best_n_estimators). This is why nothing decisive is replay-independent in read-only — and it will bite EK and every future LGB diagnostic. → separate observability ticket (persist val-metric + best_iteration + a prediction sample at HPO).


Part B — Technical specification

§0 Provenance

  • Discovery: s43 v3 (Fig 1) — LGB p_buy capped ~0.30; CB reaches θ=0.9; matched-rate precision ~0.33 (LGB) vs ~0.38 (CB) at ~11–15 % rate.
  • §0bis (verified): prod HPO runs (CVNTrade_HPO, e.g. UNIUSDC run 228e9ada, 11 draws) log metrics={}, artifacts=[], and best_n_estimators=156 only — no best_iteration, no val-metric, no per-candle predictions. → the rank/calibration fork is inescapably the s43 replay (S09-gated); the config question is not.
  • Trusted-data path (verified): the normal trainer builds Datasets from FE splits (OHLCV→enrich→FE→label), _load_captured_parquet count = 0 → independent of the s18 capture, and it is the path that produces the A6 canonical. This is Q1's input.
  • Prior art: CVN-N001 best_iter=1 multi-model study.

§1 Objective + pre-registered decision rule

Objective. Q1 (pre-S09): decide whether the LGB config/harness trains on regime-matched trusted data → CONFIG_DEGENERATE_LGB(cause=early_stopping|capacity) / CONFIG_OK_LGB — by directly measuring best_iteration (vs the ceiling), AUPRC-lift, ECE, on a clean holdout, with an eval-metric ablation to localise the cause. Q2 (post-S09, only if Q1=OK): the fold-3 rank/calibration fork on the cleared replay.

Measured reference (not a threshold). base_rate = positive prevalence of y per cell, measured first; every metric expressed as lift over base_rate (p_buy 0.30 at base_rate 0.30 = zero lift = deficit; at 0.10 = 3× lift = signal).

Frozen thresholds. - BEST_ITER_FLOOR = 3best_iteration ≤ 3 while the ceiling (best_n_estimators) ≫ 3 → the early-stopping signature (decisive in Q1, not a mere hint, because here it is directly measured on a clean fit). - Tail rank — AUPRC + precision@top-10/15 %, as lift over base_rate, CI over the draw/seed distribution (median and max). RANK_MATERIAL = 0.10 in AUPRC-lift units (ratio AUPRC/base_rate; units fixed here per referee) — chance-level / CIs-separated-below → "deficit". Global AUC = secondary context only. - Calibration — ECE with equal-mass (quantile) bins + reliability curve + Brier. ECE_BAD = 0.10. Mirror tested (high-ECE on CB → CB over-confident, LGB the well-calibrated one). - Power gate — any deciding CI half-width > its material threshold → INCONCLUSIVE_POWER. - Replay gate (Q2 only) — Q2 metrics rest on the s18 replay → INCONCLUSIVE_REPLAY_GATED until S09 clears it. Q1 is not replay-gated (trusted data).

Threshold provenance (referee). RANK_MATERIAL and ECE_BAD are conventional starting values, not economically derived; flagged for committee calibration. BEST_ITER_FLOOR is anchored to the documented best_iter=1 finding (a 1-tree model on a 156-ceiling is unambiguously degenerate).

Q1 — decision rule (frozen, pre-S09, on trusted data).

# Controlled instrumented fit: prod best_params (ex-MLflow) on a REGIME-MATCHED trusted fold,
# via the normal FE-split trainer (no s18 capture). Harness surfaces best_iteration + val curve.
if fit/labels unreadable or label-misaligned:  return INCONCLUSIVE_TOOLING(reason)   # ADR-25

base = base_rate(y_holdout)
bi   = best_iteration            # REAL early-stopping point (the signal prod does not log)
ceil = best_n_estimators         # searched ceiling (e.g. 156)
auprc_lift = auprc(holdout)/base ; ece = ece_equalmass(holdout)

if rank_halfwidth > RANK_MATERIAL:   return INCONCLUSIVE_POWER

if bi <= BEST_ITER_FLOOR and ceil >> bi:                 # 1 tree of 156 → degenerate
    # eval-metric ablation: refit with ES metric = AUCPR (not the majority-dominated default)
    if best_iteration_ablated >> bi:  cause = "early_stopping"     # metric was the culprit
    else:                              cause = "early_stopping(other)"
    verdict = CONFIG_DEGENERATE_LGB(cause=cause)
elif auprc_lift <= CHANCE_LIFT:                          # trains fully but learns nothing
    verdict = CONFIG_DEGENERATE_LGB(cause="capacity/feature/label")
else:
    verdict = CONFIG_OK_LGB(ece=<>, note="if s43 still compressed → fold-3/replay-specific → Q2/S09")
emit s14_q1_config_verdict verdict=<> cause=<> best_iteration=<> ceiling=<> auprc_lift=<> ece=<> base_rate=<>

Q2 — decision rule (frozen, post-S09, only if Q1=CONFIG_OK_LGB). As r2: tail-rank-first on the cleared replay predictions → RANK_DEFICIT_LGB / CALIBRATION_LGB / NO_LGB_ANOMALY (CB-over-confident mirror) / INCONCLUSIVE_{POWER,REPLAY_GATED,TOOLING}. Rank decides; best_iteration (now known from Q1) annotates cause.

Tie-breaks (frozen). (1) Tooling/labels → INCONCLUSIVE_TOOLING(reason). (2) Q1 before Q2 — if Q1 is degenerate, the bug is at the source; Q2 is moot (don't certify rank on a config that can't train). (3) Q2 replay gate pre-S09. (4) Power gate. (5) Rank/best_iteration decides "learned"; the ablation + 2×2 localise the causebest_iteration is measured in Q1 (decisive), only annotates in Q2. (6) Calibration by ECE/reliability, not p_buy spread; CB-over-confident is first-class. (7) Decide on CIs over the draw distribution (median + max), never recency.

Fig 1 — Q1 truth table.

best_iteration vs ceiling AUPRC-lift verdict
≤ 3 while ceiling ≫ (e.g. 156) any CONFIG_DEGENERATE_LGB(cause=early_stopping) — ablation refines
≈ ceiling ≤ chance CONFIG_DEGENERATE_LGB(cause=capacity/feature/label)
≈ ceiling > chance CONFIG_OK_LGB → (Q2/S09 if s43 still compressed)
any CI too wide INCONCLUSIVE_POWER
unreadable / mis-aligned INCONCLUSIVE_TOOLING(reason)

§2 Scope

In scope — Q1 (pre-S09): a small controlled instrumented fit (prod best_params on a regime-matched trusted fold via the normal FE-split trainer); measured base_rate, real best_iteration vs ceiling, train/val curve, AUPRC-lift, equal-mass ECE/reliability/Brier on a clean holdout; the eval-metric ablation; label-alignment assertion; verdict + cause. In scope — Q2 (post-S09, if Q1=OK): the r2 read-only rank/ calibration fork on the cleared s43 replay.

Out of scope: fixing the bug (follow-up, contingent on cause); issuing Q2 pre-S09 (replay-gated); a full HPO / multi-fold campaign; XGBoost; non-GBDT; any economic/tradability verdict (s43/EK).

Cost (honest). Q1 is no longer read-only — it is a small single-fold instrumented fit + one ablation refit. The read-only constraint was self-imposed to stay pre-S09/cheap; §0bis proved it cannot deliver that, so it is lifted for Q1 only. Operator-triggered via Airflow (no autonomous launch).

§3 Approach (Hamilton-native)

Q1 probes (trusted data): - _probe_q1_fit — runs the controlled fit (prod best_params, regime-matched trusted fold, normal FE-split trainer), captures best_iteration, the train/val curve, and the holdout predictions. Asserts the trusted fold's label = ATR0.5_1.5_H4 and base_rate comparable to s43 (else the pathology may not fire — a false CONFIG_OK; see §5). Asserts label alignment. - _probe_q1_rank — AUPRC-lift + precision@top-10/15 % on the holdout (median + max + CI over seeds/draws). - _probe_q1_calib — equal-mass ECE + reliability + Brier (LGB and CB, so the CB-over-confident mirror is testable). p_buy max/std descriptive only. - _probe_q1_ablation — refit(s) with the early-stopping eval metric set to AUCPR (and optionally F1 / precision@rate — committee rec #11) instead of the majority-dominated default; compares best_iteration to localise the early-stopping cause. - _decide_q1 — pure; Fig 1 → exactly one verdict.

Q2 probes (post-S09): the r2 read-only probes on the cleared replay predictions (_probe_tail_rank, _probe_calibration, _decide_q2).

Orchestration: Q1 = an operator-triggered Airflow run (small fit + ablation), emitting the verdict + curves. Q2 = a read-only recompute, post-S09, only if Q1=OK.

§4 Files

File Action
commun/finetune/diagnostic/s14_lgb_output_validity.py new — Q1 probes (_probe_q1_fit/_rank/_calib/_ablation/_decide_q1) + Q2 probes (_decide_q2)
_data/s14_q1_fit_inpod.py new — operator-triggered controlled fit on a regime-matched trusted fold; captures best_iteration + curve + holdout preds → _data/s14_q1.json
_data/s14_figures.py new — train/val curve, best_iteration-vs-ceiling, reliability diagrams (equal-mass), AUPRC-by-draw (lift); p_buy histogram (descriptive)
tests/unit/finetune/diagnostic/test_s14_lgb_output_validity.py new — Q1 + Q2 truth-table exhaustiveness, no-crash, decide-on-CI, power gate, label-align + leakage guards, regime-match assertion, + a synthetic degenerate config (early_stopping_rounds=1) validates the CONFIG_DEGENERATE_LGB path (committee rec #12)
documentation/stories/CVN-N001-EI-S14/{index.md,test_strategy.md} · design/ · runbook/ new — 5 artefacts (ADR-0095)
Loki catalogue (doc) extends14_q1_config_verdict, s14_lgb_output_verdict

§5 Risks

Risk Impact Mitigation
Trusted fold not regime-matched (new, decisive) pathology may not fire → false CONFIG_OK_LGB assert label = ATR0.5_1.5_H4 and base_rate within ±20 % of the s43 base rate (committee rec #4, frozen tolerance) (_probe_q1_fit); pick a fold/universe in the same imbalance regime
Look-ahead in the trusted fold (committee rec #9) a leaking holdout fakes a CONFIG_OK _probe_q1_fit asserts the FE pipeline is fitted on train only (never refitted on the holdout) and the triple-barrier labels respect the embargo/horizon (no look-ahead) — else INCONCLUSIVE_TOOLING(reason=leakage)
Config selection (which draws) testing one benign config misses a config-specific bug test the prod best_params set incl. the recency draw + a sample across draws; report the best_iteration distribution
Q2 A6 inheritance rank/calibration on the suspect replay INCONCLUSIVE_REPLAY_GATED until S09; Q2 runs only if Q1=OK
Weak-but-calibrated mistaken for a bug CALIBRATION_LGB wrongly indicts LGB calibration by equal-mass ECE/reliability, not p_buy spread; CB-over-confident mirror first-class
Global AUC hides the tail a tail deficit passes as "AUC≈CB" keyed on AUPRC + prec@rate; global AUC secondary
Underpowered fork wide CI infalsifiable power gate → INCONCLUSIVE_POWER
Label mis-alignment metrics on wrong y _probe_q1_fit asserts alignment → INCONCLUSIVE_TOOLING(reason=labels)
best_iteration coherence bi=1 + healthy rank, or bi=ceiling + chance the 2×2 (H3) localises; a bi↔rank conflict flags a possible wrong-artefact read
Plan↔code drift deciding value invented at run every threshold (BEST_ITER_FLOOR, RANK_MATERIAL, ECE_BAD, CHANCE_LIFT) is in this plan

§6 Success criteria + ops

  1. base_rate measured + all metrics as lift; label alignment + regime-match asserted before any metric.
  2. Q1: event=s14_q1_config_verdict with best_iteration, ceiling, AUPRC-lift, ECE, cause, ablation delta.
  3. Q2 (post-S09, if Q1=OK): event=s14_lgb_output_verdict.
  4. Exactly one verdict per fork (truth-table exhaustiveness tests green, incl. power/replay/tooling gates).
  5. Reliability (equal-mass) + Brier + AUPRC (median+max+CI) reported; global AUC + p_buy histogram secondary.
  6. No-crash → INCONCLUSIVE_TOOLING(reason) (ADR-25/31).

Routing. CONFIG_DEGENERATE_LGB(early_stopping) → fix the ES eval metric / rounds (WA1); s43 LGB arm contaminated at source, reopen; EK GBDT datum provisional; bug found pre-S09. …(capacity/feature/label) → deeper training/feature investigation. CONFIG_OK_LGB → the config trains; if s43 still compressed → Q2/S09 (fold-3/replay-specific); else the LGB-bug suspicion is laid to rest. Q2 routings as r2.

§7 Settled decisions (committee may revise)

  1. Re-scope into Q1 (config, pre-S09) + Q2 (fold-3, S09-gated) after §0bis — settled (the read-only pre-S09 framing was falsified by prod logging metrics={}/artifacts=[]).
  2. Q1 lifts read-only for a small operator-triggered controlled fit on trusted data — settled (it is the only way to a decisive pre-S09 answer; the canonical-producing FE-split path is replay-independent).
  3. Rank = tail (AUPRC/prec@rate), not global AUC — settled. Calibration by equal-mass ECE/reliability — settled.
  4. best_iteration measured in Q1 (decisive), annotates in Q2; rank decides "learned" — settled.
  5. Regime-matched trusted fold (label + base_rate) is a hard precondition of Q1 — settled.
  6. Threshold provenance flagged (RANK_MATERIAL/ECE_BAD conventional, to calibrate) — settled.

Committee plan_review (ADR-68)

Verdict: PASSED · OK — session 3713e74b · OP Meeting 272 · 5 experts, strong consensus, 0 blockers, 0 dissent. Endorsed the §0bis re-scope (Q1 config-validity pre-S09 / Q2 fold-3 S09-gated), the read-only-lift for Q1's controlled fit, the decision rule, and the replay-gating.

Recommendation disposition (12 non-blocking):

# Recommendation Disposition
4 Quantify "comparable base rate" folded — ±20 % of s43 base rate (§1/§5)
9 Confirm leakage prevention for trusted data folded — FE-fit-on-train-only + no-look-ahead assert (§3/§5)
11 Expand eval-metric ablation (multi-metric) folded — AUCPR + opt. F1/prec@rate (§3)
12 Validate degenerate verdict path (synthetic config) foldedearly_stopping_rounds=1 test (§4)
2 Calibrate RANK_MATERIAL/ECE_BAD economically already flagged §1 (threshold provenance) — committee calibration
6 Specify Q1 best_params sampling strategy already §5/§7 (operator arbitrage)
1 Prioritise the observability ticket follow-up — filed GH #1178
3,7 Q1 runbook + kill-switch + staged rollout follow-up (runbook artefact, at merge)
5,8,10 Trusted-fold drift checks / drift monitoring / multi-fold coverage follow-up (ops + strengthening)

Methodological invariants — honest application

Invariant S14 r3
Pre-registration ✅ frozen §1 (both forks); editorialising removed
No-crash INCONCLUSIVE_TOOLING(reason)
Inconclusives first-class INCONCLUSIVE_{POWER,REPLAY_GATED,TOOLING}
Decide on significance ✅ AUPRC + ECE CI over draw distribution (median+max)
§0bis (verify the source the path uses) the pivot — verified prod logs nothing decisive; Q1 sources the canonical-producing FE path
Replay/fidelity gating ✅ Q2 S09-gated; Q1 replay-independent (trusted data)
Gate on unmeasured/unvalidated inputs ✅ thresholds frozen; base_rate measured; regime-match + label-align asserted

Appendix — plan_review checklist

  • Re-scope Q1 (pre-S09, config) / Q2 (S09-gated, fold-3) recorded; §0bis pivot stated.
  • Q1 input = canonical-producing FE-split trainer, regime-matched (label + base_rate) — asserted.
  • Q1 measures real best_iteration vs ceiling; eval-metric ablation localises early-stopping.
  • Calibration by equal-mass ECE/reliability; CB-over-confident mirror testable.
  • Rank = tail (AUPRC/prec@rate, lift); global AUC secondary; CI over draws (median+max).
  • Power gate; label-alignment guard; threshold provenance flagged.
  • Q2 replay-gated; runs only if Q1=CONFIG_OK.
  • A6 link withdrawn; observability finding spun out to a ticket.
  • Q1 is operator-triggered (not read-only, not autonomous); 5 artefacts planned.