Plan dossier — CVN-N001-EI-S14 · LightGBM output-validity diagnostic¶

Review note. Plan for Story S14 (ADR-0095). Fixes what & why and pre-registers the decision rule before any code/fit. The Story is re-scoped (r3) into two forks after a §0bis check on what prod actually logs (below). Submitted for plan_review.

What changed in r3 (the §0bis pivot). r1/r2 assumed S14 could read prod's per-draw signals (val-metric, best_iteration, per-candle predictions) and answer before S09 on the persisted s43 predictions. The §0bis check falsified that for the read-only framing: prod's LightGBM HPO runs log metrics={}, artifacts=[], and only best_n_estimators (the searched ceiling) — not the early-stopping best_iteration, no val-metric, no per-candle predictions (verified on UNIUSDC, 11 draws, run 228e9ada). So the read-only rank/calibration fork is inescapably the s43 replay — fully S09-gated. But this also shows the read-only constraint was the thing coupling S14 to the suspect replay — and the actual question ("is the LightGBM config/harness able to train on rare-positive crypto data?") is a config property, reproducible pre-S09 on trusted data via a small controlled instrumented fit. Hence the re-scope.

⛔ Why before S09 — re-honoured for Q1, honest for Q2¶

Q1 — config/harness validity (PRE-S09, decisive, replay-independent). Does the LightGBM training config/harness produce a degenerate model (early-stops at ~1 of a 156-tree ceiling, or trains 156 trees that still emit the prior) on rare-positive crypto data? This is a property of the config × regime, not of any particular run's predictions — so it is reproducible on trusted data (the normal FE-split trainer, which produces the A6 canonical — the reference, not the victim; _load_captured_parquet count = 0 in autonomous_model_trainer / ablation_runner). The harness surfaces the real best_iteration — the exact signal prod does not log. No replay, no S09 dependency.
Q2 — fold-3 rank/calibration (POST-S09, gated). Does LGB rank/calibrate vs CatBoost on the s43 replayed fold-3? This needs the s43 per-candle predictions, which exist only via the s18 replay (A6, 5/5 FAIL) — prod persists none (artifacts=[]). S09-gated, and runs only if Q1 returns CONFIG_OK_LGB (else the cause is already found upstream). Issuing it on the suspect replay would repeat the sin s43 flagged.

Composition. Run Q1 first (cheap, pre-S09, on trusted data). If Q1 = CONFIG_DEGENERATE_LGB → the bug is found without S09, and s43's LGB arm is contaminated at the source. If Q1 = CONFIG_OK_LGB and s43 still showed compression → the compression is fold-3/replay-specific, i.e. exactly the S09 question (Q2). Q1 cannot decide that — and should not try.

Part A — Narrative¶

A1. Problem statement (non-technical)¶

s43 showed CatBoost spreads BUY-probabilities up to ~0.9 while LightGBM's never exceed ~0.30 (≈ the base rate). For weeks an LGB bug has been suspected but never characterised, because it was viewed through the economic / θ lens, which conflates scale with rank. The decisive view is upstream, in the model output — and, per the §0bis pivot, the most fundamental version of the question is does the LGB config train at all (Q1), which is testable without the suspect replay.

Three readings to separate (the Story decides between them, does not pre-judge): 1. "CatBoost ≫ LightGBM → CB is the better model." At the same θ this is apples-to-oranges when scales differ; the scale-invariant view is at matched rate. s43 v3 at ~11–15 % rate: precision ~0.33 (LGB) vs ~0.38 (CB) over a 0.25 break-even — margins 0.08 vs 0.13. Whether that gap is "small" or "real" is not pre-judged; the tail-rank test (Q2, §1) decides it. 2. "p_buy capped at 0.30 → LGB is buggy." Maybe — or benign: LGB underfits with defaults, or CB over-fits (its θ=0.9 corner is one won trade). "CB better" ≠ "LGB buggy". 3. "It's economic." No — it is a model-output / training-config problem one layer up.

What we measure. Q1 (config, trusted data): the real best_iteration vs the 156-tree ceiling; the training/val curve; AUPRC and ECE/reliability on a clean holdout. Q2 (fold-3, replay, post-S09): tail-rank + calibration of the persisted s43 predictions. All as lift over the measured base rate.

Honesty. Unreadable inputs / mis-aligned labels → INCONCLUSIVE_TOOLING(reason). Underpowered CIs → INCONCLUSIVE_POWER. The Q2 fork pre-S09 → INCONCLUSIVE_REPLAY_GATED.

A2. User stories¶

#	As a…	I want…	so that…
US1	quant / reviewer	to know whether the LGB config trains (best_iteration, AUPRC) on rare-positive crypto	I characterise the weeks-old suspicion without waiting on S09
US2	programme DRI	to know whether the s43/S05 GBDT verdict is contaminated at the source (Q1)	I reopen the LGB arm if Q1 shows a degenerate config
US3	S09 owner	to know whether the baseline LGB is itself degenerate (a separate fact)	S09 reads its fidelity check against a known-good-or-bad baseline
US4	dev2 (Epic EK)	a GBDT substrate whose family training is validated or flagged	EK's KILL-tuple datum is not built on a config that cannot train

(US3 reworded from r2: S14 does not "explain A6" — see A5 note.)

A3. Hypotheses (pre-registered)¶

H1 — Config trains (Q1, the decisive pre-S09 test). The LGB config/harness builds a non-degenerate model on regime-matched trusted data.
Criterion: the real best_iteration (early-stopping) vs the searched ceiling (best_n_estimators, e.g. 156); AUPRC (lift over base rate) and the train/val curve on a clean holdout. Degenerate iff best_iteration ≤ BEST_ITER_FLOOR or AUPRC-lift ≈ chance.
H2 — Calibration (reliability, NOT p_buy spread). LGB's probabilities are mis-calibrated (a real score→prob defect) vs honestly calibrated for a weak signal (a weak classifier on a ~base-rate positive should compress near the base rate — that is correct, not a bug).
Criterion: reliability diagram + ECE (equal-mass bins) + Brier, per family. The mirror is first-class: if CatBoost is the high-ECE one, CB is over-confident and LGB is the well-calibrated family.
H3 — Cause localisation (the best_iteration × rank 2×2). best_iteration localises why a deficit exists; it never overrides rank:
best_iteration ≈ 1 (≪ ceiling) + low rank → early-stopping cause (eval metric optimised at tree 1);
best_iteration ≈ ceiling + low rank → capacity / feature / label / leakage (trains fully, learns nothing) — not early-stopping;
best_iteration ≈ 1 + healthy rank → fine (a strong single tree) — and a coherence flag (verify the read).

WA1 (cause, surfaced not tested). If early-stopping branch: eval metric on the imbalanced val set, early_stopping_rounds, num_boost_round/best_iteration handling. If capacity branch: min_child_samples/ min_gain_to_split, is_unbalance/scale_pos_weight, features/label plumbing. The Q1 eval-metric ablation (§1) discriminates the early-stopping branch directly.

A4. State of the art¶

Config pathology is reproducible. An early-stopping metric dominated by the majority class halts boosting at ~1 (best_iteration ≈ 1), collapsing predictions to the prior — the program's best_iter=1 multi-model finding (CVN-N001, PASSED). Because it is a config×regime property, it reproduces on trusted data — the basis of Q1's pre-S09 power.
Tail rank, not global AUC. The economics live in the high-confidence tail; use AUPRC (rare positive)
precision@matched-rate [Davis2006], not bulk-dominated global AUC (secondary context only).
Reliability is the calibration test. Only the reliability diagram / ECE distinguishes a weak-but-honest classifier (compressed near base rate, correct) from a true score→prob defect [Niculescu-Mizil2005]; it tests the CB-over-confident mirror symmetrically. Equal-mass bins (not equal-width) — the high-p_buy bins are empty for LGB by construction, so equal-width ECE is bulk-dominated and blind in the tail.
CatBoost robustness. Ordered boosting + conservative defaults fit small/imbalanced data where LGB defaults underfit — a benign explanation the fork must be able to return (CONFIG_OK_LGB / NO_LGB_ANOMALY).

Reference	Grounds
[Davis2006] Davis & Goadrich, PR vs ROC Curves (ICML)	AUPRC / tail-rank for the rare positive
[Niculescu-Mizil2005] Niculescu-Mizil & Caruana (ICML)	reliability / ECE (equal-mass)
CVN-N001 best_iter=1 multi-model study (PASSED)	the early-stopping cause; Q1 reproducibility basis

ADRs: ADR-0095, ADR-25/31 (no silent fallback / no print), ADR-92 (build SHA), ADR-23 (provenance).

A5. Consolidation & traceability¶

Problem	Hypothesis	US	Section	Fork
Does the LGB config train (best_iteration, AUPRC)?	H1	US1	§3.Q1	Q1 (pre-S09)
Is LGB mis-calibrated vs honest weak signal?	H2	US1	§3.Q1/Q2	Q1 holdout + Q2
Why the deficit (early-stop vs capacity)?	H3	US1	§3.Q1	Q1 ablation
Does LGB rank on the replayed fold-3?	H1/H2	US2	§3.Q2	Q2 (S09-gated)

A6 link (corrected, kept from r2 #6). A rank-deficient LGB makes both replay and prod f1 low → they coincide, they do not diverge; A6 is a divergence (replay ≠ canonical), a fidelity problem owned by S09, orthogonal to underfit. S14 does not "explain A6"; at most Q1 tells S09 whether the baseline LGB is itself degenerate (a separate fact). The "feeds A6" claim is withdrawn.

Observability finding (the §0bis itself, orthogonal). Prod LGB draws are non-auditable: metrics={}, artifacts=[], no best_iteration (only the searched best_n_estimators). This is why nothing decisive is replay-independent in read-only — and it will bite EK and every future LGB diagnostic. → separate observability ticket (persist val-metric + best_iteration + a prediction sample at HPO).

Part B — Technical specification¶

§0 Provenance¶

Discovery: s43 v3 (Fig 1) — LGB p_buy capped ~0.30; CB reaches θ=0.9; matched-rate precision ~0.33 (LGB) vs ~0.38 (CB) at ~11–15 % rate.
§0bis (verified): prod HPO runs (CVNTrade_HPO, e.g. UNIUSDC run 228e9ada, 11 draws) log metrics={}, artifacts=[], and best_n_estimators=156 only — no best_iteration, no val-metric, no per-candle predictions. → the rank/calibration fork is inescapably the s43 replay (S09-gated); the config question is not.
Trusted-data path (verified): the normal trainer builds Datasets from FE splits (OHLCV→enrich→FE→label), _load_captured_parquet count = 0 → independent of the s18 capture, and it is the path that produces the A6 canonical. This is Q1's input.
Prior art: CVN-N001 best_iter=1 multi-model study.

§1 Objective + pre-registered decision rule¶

Objective. Q1 (pre-S09): decide whether the LGB config/harness trains on regime-matched trusted data → CONFIG_DEGENERATE_LGB(cause=early_stopping|capacity) / CONFIG_OK_LGB — by directly measuring best_iteration (vs the ceiling), AUPRC-lift, ECE, on a clean holdout, with an eval-metric ablation to localise the cause. Q2 (post-S09, only if Q1=OK): the fold-3 rank/calibration fork on the cleared replay.

Measured reference (not a threshold). base_rate = positive prevalence of y per cell, measured first; every metric expressed as lift over base_rate (p_buy 0.30 at base_rate 0.30 = zero lift = deficit; at 0.10 = 3× lift = signal).

Frozen thresholds. - BEST_ITER_FLOOR = 3 — best_iteration ≤ 3 while the ceiling (best_n_estimators) ≫ 3 → the early-stopping signature (decisive in Q1, not a mere hint, because here it is directly measured on a clean fit). - Tail rank — AUPRC + precision@top-10/15 %, as lift over base_rate, CI over the draw/seed distribution (median and max). RANK_MATERIAL = 0.10 in AUPRC-lift units (ratio AUPRC/base_rate; units fixed here per referee) — chance-level / CIs-separated-below → "deficit". Global AUC = secondary context only. - Calibration — ECE with equal-mass (quantile) bins + reliability curve + Brier. ECE_BAD = 0.10. Mirror tested (high-ECE on CB → CB over-confident, LGB the well-calibrated one). - Power gate — any deciding CI half-width > its material threshold → INCONCLUSIVE_POWER. - Replay gate (Q2 only) — Q2 metrics rest on the s18 replay → INCONCLUSIVE_REPLAY_GATED until S09 clears it. Q1 is not replay-gated (trusted data).

Threshold provenance (referee). RANK_MATERIAL and ECE_BAD are conventional starting values, not economically derived; flagged for committee calibration. BEST_ITER_FLOOR is anchored to the documented best_iter=1 finding (a 1-tree model on a 156-ceiling is unambiguously degenerate).

Q1 — decision rule (frozen, pre-S09, on trusted data).

# Controlled instrumented fit: prod best_params (ex-MLflow) on a REGIME-MATCHED trusted fold,
# via the normal FE-split trainer (no s18 capture). Harness surfaces best_iteration + val curve.
if fit/labels unreadable or label-misaligned:  return INCONCLUSIVE_TOOLING(reason)   # ADR-25

base = base_rate(y_holdout)
bi   = best_iteration            # REAL early-stopping point (the signal prod does not log)
ceil = best_n_estimators         # searched ceiling (e.g. 156)
auprc_lift = auprc(holdout)/base ; ece = ece_equalmass(holdout)

if rank_halfwidth > RANK_MATERIAL:   return INCONCLUSIVE_POWER

if bi <= BEST_ITER_FLOOR and ceil >> bi:                 # 1 tree of 156 → degenerate
    # eval-metric ablation: refit with ES metric = AUCPR (not the majority-dominated default)
    if best_iteration_ablated >> bi:  cause = "early_stopping"     # metric was the culprit
    else:                              cause = "early_stopping(other)"
    verdict = CONFIG_DEGENERATE_LGB(cause=cause)
elif auprc_lift <= CHANCE_LIFT:                          # trains fully but learns nothing
    verdict = CONFIG_DEGENERATE_LGB(cause="capacity/feature/label")
else:
    verdict = CONFIG_OK_LGB(ece=<>, note="if s43 still compressed → fold-3/replay-specific → Q2/S09")
emit s14_q1_config_verdict verdict=<> cause=<> best_iteration=<> ceiling=<> auprc_lift=<> ece=<> base_rate=<>

Q2 — decision rule (frozen, post-S09, only if Q1=CONFIG_OK_LGB). As r2: tail-rank-first on the cleared replay predictions → RANK_DEFICIT_LGB / CALIBRATION_LGB / NO_LGB_ANOMALY (CB-over-confident mirror) / INCONCLUSIVE_{POWER,REPLAY_GATED,TOOLING}. Rank decides; best_iteration (now known from Q1) annotates cause.

Tie-breaks (frozen). (1) Tooling/labels → INCONCLUSIVE_TOOLING(reason). (2) Q1 before Q2 — if Q1 is degenerate, the bug is at the source; Q2 is moot (don't certify rank on a config that can't train). (3) Q2 replay gate pre-S09. (4) Power gate. (5) Rank/best_iteration decides "learned"; the ablation + 2×2 localise the cause — best_iteration is measured in Q1 (decisive), only annotates in Q2. (6) Calibration by ECE/reliability, not p_buy spread; CB-over-confident is first-class. (7) Decide on CIs over the draw distribution (median + max), never recency.

Fig 1 — Q1 truth table.

`best_iteration` vs ceiling	AUPRC-lift	verdict
`≤ 3` while ceiling ≫ (e.g. 156)	any	`CONFIG_DEGENERATE_LGB(cause=early_stopping)` — ablation refines
`≈ ceiling`	≤ chance	`CONFIG_DEGENERATE_LGB(cause=capacity/feature/label)`
`≈ ceiling`	> chance	`CONFIG_OK_LGB` → (Q2/S09 if s43 still compressed)
any	CI too wide	`INCONCLUSIVE_POWER`
unreadable / mis-aligned	—	`INCONCLUSIVE_TOOLING(reason)`

§2 Scope¶

In scope — Q1 (pre-S09): a small controlled instrumented fit (prod best_params on a regime-matched trusted fold via the normal FE-split trainer); measured base_rate, real best_iteration vs ceiling, train/val curve, AUPRC-lift, equal-mass ECE/reliability/Brier on a clean holdout; the eval-metric ablation; label-alignment assertion; verdict + cause. In scope — Q2 (post-S09, if Q1=OK): the r2 read-only rank/ calibration fork on the cleared s43 replay.

Out of scope: fixing the bug (follow-up, contingent on cause); issuing Q2 pre-S09 (replay-gated); a full HPO / multi-fold campaign; XGBoost; non-GBDT; any economic/tradability verdict (s43/EK).

Cost (honest). Q1 is no longer read-only — it is a small single-fold instrumented fit + one ablation refit. The read-only constraint was self-imposed to stay pre-S09/cheap; §0bis proved it cannot deliver that, so it is lifted for Q1 only. Operator-triggered via Airflow (no autonomous launch).

§3 Approach (Hamilton-native)¶

Q1 probes (trusted data): - _probe_q1_fit — runs the controlled fit (prod best_params, regime-matched trusted fold, normal FE-split trainer), captures best_iteration, the train/val curve, and the holdout predictions. Asserts the trusted fold's label = ATR0.5_1.5_H4 and base_rate comparable to s43 (else the pathology may not fire — a false CONFIG_OK; see §5). Asserts label alignment. - _probe_q1_rank — AUPRC-lift + precision@top-10/15 % on the holdout (median + max + CI over seeds/draws). - _probe_q1_calib — equal-mass ECE + reliability + Brier (LGB and CB, so the CB-over-confident mirror is testable). p_buy max/std descriptive only. - _probe_q1_ablation — refit(s) with the early-stopping eval metric set to AUCPR (and optionally F1 / precision@rate — committee rec #11) instead of the majority-dominated default; compares best_iteration to localise the early-stopping cause. - _decide_q1 — pure; Fig 1 → exactly one verdict.

Q2 probes (post-S09): the r2 read-only probes on the cleared replay predictions (_probe_tail_rank, _probe_calibration, _decide_q2).

Orchestration: Q1 = an operator-triggered Airflow run (small fit + ablation), emitting the verdict + curves. Q2 = a read-only recompute, post-S09, only if Q1=OK.

§4 Files¶

File	Action
`commun/finetune/diagnostic/s14_lgb_output_validity.py`	new — Q1 probes (`_probe_q1_fit`/`_rank`/`_calib`/`_ablation`/`_decide_q1`) + Q2 probes (`_decide_q2`)
`_data/s14_q1_fit_inpod.py`	new — operator-triggered controlled fit on a regime-matched trusted fold; captures best_iteration + curve + holdout preds → `_data/s14_q1.json`
`_data/s14_figures.py`	new — train/val curve, best_iteration-vs-ceiling, reliability diagrams (equal-mass), AUPRC-by-draw (lift); p_buy histogram (descriptive)
`tests/unit/finetune/diagnostic/test_s14_lgb_output_validity.py`	new — Q1 + Q2 truth-table exhaustiveness, no-crash, decide-on-CI, power gate, label-align + leakage guards, regime-match assertion, + a synthetic degenerate config (`early_stopping_rounds=1`) validates the `CONFIG_DEGENERATE_LGB` path (committee rec #12)
`documentation/stories/CVN-N001-EI-S14/{index.md,test_strategy.md}` · design/ · runbook/	new — 5 artefacts (ADR-0095)
Loki catalogue (doc)	extend — `s14_q1_config_verdict`, `s14_lgb_output_verdict`

§5 Risks¶

Risk	Impact	Mitigation
Trusted fold not regime-matched (new, decisive)	pathology may not fire → false `CONFIG_OK_LGB`	assert label = `ATR0.5_1.5_H4` and `base_rate` within ±20 % of the s43 base rate (committee rec #4, frozen tolerance) (`_probe_q1_fit`); pick a fold/universe in the same imbalance regime
Look-ahead in the trusted fold (committee rec #9)	a leaking holdout fakes a `CONFIG_OK`	`_probe_q1_fit` asserts the FE pipeline is fitted on train only (never refitted on the holdout) and the triple-barrier labels respect the embargo/horizon (no look-ahead) — else `INCONCLUSIVE_TOOLING(reason=leakage)`
Config selection (which draws)	testing one benign config misses a config-specific bug	test the prod `best_params` set incl. the recency draw + a sample across draws; report the `best_iteration` distribution
Q2 A6 inheritance	rank/calibration on the suspect replay	`INCONCLUSIVE_REPLAY_GATED` until S09; Q2 runs only if Q1=OK
Weak-but-calibrated mistaken for a bug	`CALIBRATION_LGB` wrongly indicts LGB	calibration by equal-mass ECE/reliability, not p_buy spread; CB-over-confident mirror first-class
Global AUC hides the tail	a tail deficit passes as "AUC≈CB"	keyed on AUPRC + prec@rate; global AUC secondary
Underpowered fork	wide CI infalsifiable	power gate → `INCONCLUSIVE_POWER`
Label mis-alignment	metrics on wrong `y`	`_probe_q1_fit` asserts alignment → `INCONCLUSIVE_TOOLING(reason=labels)`
best_iteration coherence	bi=1 + healthy rank, or bi=ceiling + chance	the 2×2 (H3) localises; a bi↔rank conflict flags a possible wrong-artefact read
Plan↔code drift	deciding value invented at run	every threshold (`BEST_ITER_FLOOR`, `RANK_MATERIAL`, `ECE_BAD`, `CHANCE_LIFT`) is in this plan

§6 Success criteria + ops¶

base_rate measured + all metrics as lift; label alignment + regime-match asserted before any metric.
Q1: event=s14_q1_config_verdict with best_iteration, ceiling, AUPRC-lift, ECE, cause, ablation delta.
Q2 (post-S09, if Q1=OK): event=s14_lgb_output_verdict.
Exactly one verdict per fork (truth-table exhaustiveness tests green, incl. power/replay/tooling gates).
Reliability (equal-mass) + Brier + AUPRC (median+max+CI) reported; global AUC + p_buy histogram secondary.
No-crash → INCONCLUSIVE_TOOLING(reason) (ADR-25/31).

Routing. CONFIG_DEGENERATE_LGB(early_stopping) → fix the ES eval metric / rounds (WA1); s43 LGB arm contaminated at source, reopen; EK GBDT datum provisional; bug found pre-S09. …(capacity/feature/label) → deeper training/feature investigation. CONFIG_OK_LGB → the config trains; if s43 still compressed → Q2/S09 (fold-3/replay-specific); else the LGB-bug suspicion is laid to rest. Q2 routings as r2.

§7 Settled decisions (committee may revise)¶

Re-scope into Q1 (config, pre-S09) + Q2 (fold-3, S09-gated) after §0bis — settled (the read-only pre-S09 framing was falsified by prod logging metrics={}/artifacts=[]).
Q1 lifts read-only for a small operator-triggered controlled fit on trusted data — settled (it is the only way to a decisive pre-S09 answer; the canonical-producing FE-split path is replay-independent).
Rank = tail (AUPRC/prec@rate), not global AUC — settled. Calibration by equal-mass ECE/reliability — settled.
best_iteration measured in Q1 (decisive), annotates in Q2; rank decides "learned" — settled.
Regime-matched trusted fold (label + base_rate) is a hard precondition of Q1 — settled.
Threshold provenance flagged (RANK_MATERIAL/ECE_BAD conventional, to calibrate) — settled.

Committee plan_review (ADR-68)¶

Verdict: PASSED · OK — session 3713e74b · OP Meeting 272 · 5 experts, strong consensus, 0 blockers, 0 dissent. Endorsed the §0bis re-scope (Q1 config-validity pre-S09 / Q2 fold-3 S09-gated), the read-only-lift for Q1's controlled fit, the decision rule, and the replay-gating.

Recommendation disposition (12 non-blocking):

#	Recommendation	Disposition
4	Quantify "comparable base rate"	✅ folded — ±20 % of s43 base rate (§1/§5)
9	Confirm leakage prevention for trusted data	✅ folded — FE-fit-on-train-only + no-look-ahead assert (§3/§5)
11	Expand eval-metric ablation (multi-metric)	✅ folded — AUCPR + opt. F1/prec@rate (§3)
12	Validate degenerate verdict path (synthetic config)	✅ folded — `early_stopping_rounds=1` test (§4)
2	Calibrate `RANK_MATERIAL`/`ECE_BAD` economically	already flagged §1 (threshold provenance) — committee calibration
6	Specify Q1 `best_params` sampling strategy	already §5/§7 (operator arbitrage)
1	Prioritise the observability ticket	follow-up — filed GH #1178
3,7	Q1 runbook + kill-switch + staged rollout	follow-up (runbook artefact, at merge)
5,8,10	Trusted-fold drift checks / drift monitoring / multi-fold coverage	follow-up (ops + strengthening)

Methodological invariants — honest application¶

Invariant	S14 r3
Pre-registration	✅ frozen §1 (both forks); editorialising removed
No-crash	✅ `INCONCLUSIVE_TOOLING(reason)`
Inconclusives first-class	✅ `INCONCLUSIVE_{POWER,REPLAY_GATED,TOOLING}`
Decide on significance	✅ AUPRC + ECE CI over draw distribution (median+max)
§0bis (verify the source the path uses)	✅ the pivot — verified prod logs nothing decisive; Q1 sources the canonical-producing FE path
Replay/fidelity gating	✅ Q2 S09-gated; Q1 replay-independent (trusted data)
Gate on unmeasured/unvalidated inputs	✅ thresholds frozen; base_rate measured; regime-match + label-align asserted

Appendix — plan_review checklist¶

Re-scope Q1 (pre-S09, config) / Q2 (S09-gated, fold-3) recorded; §0bis pivot stated.
Q1 input = canonical-producing FE-split trainer, regime-matched (label + base_rate) — asserted.
Q1 measures real best_iteration vs ceiling; eval-metric ablation localises early-stopping.
Calibration by equal-mass ECE/reliability; CB-over-confident mirror testable.
Rank = tail (AUPRC/prec@rate, lift); global AUC secondary; CI over draws (median+max).
Power gate; label-alignment guard; threshold provenance flagged.
Q2 replay-gated; runs only if Q1=CONFIG_OK.
A6 link withdrawn; observability finding spun out to a ticket.
Q1 is operator-triggered (not read-only, not autonomous); 5 artefacts planned.