CVN-N001-EI-S03 — Split / regime reconstruction (Block 3b) — Plan dossier¶

Story: CVN-N001-EI-S03 — Split / regime reconstruction (Block 3b)
OpenProject: wp#226 (New) · GitHub issue: #1058 · Epic: CVN-N001-EI (#1055)
Supersedes: CVN-N001-EE-S31 (no code survives; conceptual heritage only)
Plan dossier: documentation/reviews/2026-05-28-cvn-n001-ei-s03-split-regime-reconstruction-plan.md
Session-type for review: plan_review (ADR-68)
Author: dev session 2026-05-28

0. Why this Story exists (provenance)¶

CVN-N001-EI is the autonomous diagnostic program spun out of the multi-model best_iter=1 study (committee experiment_review 24745ff4 PASSED; base document documentation/reviews/2026-05-24-cvn-n001-ee-best-iter1-multimodel-study.md). Its §5.0 provisional ranking puts validation / regime instability (C-d) as the most probable primary driver of the weak signal — ahead of LightGBM leaf-wise capacity (a modifier). The base document's §12 decision routing makes Block 3b (this Story) the home for testing C-d, and pins the action for each outcome in advance, so no re-litigation is needed when results land.

Two facts fix S03's scope precisely:

The decision routing for Block 3b (base doc §12):

"Block 3 split materially moves best_iter + AUC ⇒ the subject was never LightGBM — it is validation design (C-d / C-c) ⇒ ship the regime-aware / purged-walk-forward split (CVN-N001-EI-S03); re-run the model comparison on it."
S02 (Block 3a) closed with an unresolved reserve assigned to S03. S02's executed audit (Loki, 2026-05-25 20:25–20:44) returned 5/5 LEAK_CONFIRMED → SYSTEMATIC_LEAK → mission HALT on defi_top5 — but the operator-logged transition reserve (OP wp#225) records that every leakage/label probe was clean (fl_val_auc 0.649–0.678 < 0.90; fl_corr_drift 0.097–0.126 < 0.15; ls_psi 0.003–0.033 < 0.25; te_entropy_bits ≈ 0.66) and the verdict was driven solely by the temporal-autocorrelation probe (ta_acf 0.687–0.721 vs cut 0.20) under degraded proxies (acf_only_embargo, time_segments_regime — i.e. no real purged/embargo split was available). That was judged a likely probe-calibration false-positive, and S40_ACF_CUT recalibration was explicitly deferred to S03 #1058 with a real purged/embargo split.

So S03 inherits a gate (resolve the S02 ACF reserve) and the ablation (Block 3b proper). Both are served by the same instrument — a real purged/embargo walk-forward split — which is why they belong in one Story.

1. Objective & pre-committed decisions¶

In plain terms: we test whether the weak signal comes from how we cut train/test, and we first clear a temporal-leakage flag S02 left open. If a better split recovers materially better out-of-sample discrimination, the problem was validation design, not the model. Scope up front: every verdict here is specific to the defi_top5 control group — it informs the program but is not a program-wide conclusion (generalisation is a separate Story, §6).

Two load-bearing jobs:

Job A — Resolve the S02 temporal-autocorrelation reserve (the gate; runs first)¶

Re-measure temporal-autocorrelation leakage under a real purged + embargo walk-forward split (not the degraded acf_only_embargo proxy S02 ran under). The question: is ta_acf ≈ 0.70 a genuine future-information path, or the expected residual autocorrelation of overlapping triple-barrier labels that a correct embargo neutralises?

Pre-registered cut, no cut-shopping (round-3): the recalibrated S40_ACF_CUT is the upper 95th percentile of an empirical null — the ACF distribution under a label-permutation / time-shuffle of the post-embargo series (any structure that survives shuffling is not temporal-overlap signal). The method is fixed before the run; the cut is never chosen after seeing the observed ACF.

Job-A outcome	Conclusion	Pre-committed action (base doc §12)
ACF stays high after correct purge+embargo (genuine temporal leak)	`LEAK_CONFIRMED` is real — AUC≈0.64 partly spurious	HALT CVN-N001-EE. Fix `src/features/`/alignment; re-baseline before any modeling. Recalibrate `S40_ACF_CUT` only if the cut itself is wrong, not to silence a real leak.
ACF collapses below a recalibrated cut once embargo ≥ label-horizon (false positive)	S02 ACF was label-overlap autocorrelation — a probe-calibration artefact neutralised by correct embargo; not a temporal leak	Clear the reserve — scoped (round-2). `ACF_EMBARGO_FALSE_POSITIVE` certifies no label-overlap leak only. The feature-leak risk rests separately on S02's clean `fl_` probes (`fl_val_auc` < 0.90, `fl_corr_drift` < 0.15), which S03 re-affirms but does not* re-derive from the ACF. Recalibrate `S40_ACF_CUT` (Console, ADR-59); record S02 → `INTEGRITY_CLEAN` (label-overlap + feature-leak, by distinct evidence). Proceed to Job B.
ACF drops but does not collapse (e.g. 0.70→0.35 — below start, above any horizon-justified cut)	ambiguous: partial embargo sensitivity, neither clean nor a proven leak	Keep the reserve; escalate (`ACF_INDETERMINATE`). Do not auto-clear and do not auto-HALT. The full embargo→ACF curve is reported for committee judgment. Job B may run and report descriptive split deltas, but NO program-routing action may fire from Job B until the ACF reserve is resolved or explicitly waived (round-3); any Job-B conclusion in this state is tagged `ACF_UNRESOLVED_CONTEXT` — so a `SPLIT_MOVES_VALIDATION` cannot be acted on while a HALT-level integrity concern is open. An escalation playbook (manual-review criteria + paths) is documented in the results dossier (committee r4).

Scope of a Job-A clear (round-2). Embargo ≥ H reduces label-overlap autocorrelation by construction, whether or not a feature-leak coexists — so an ACF collapse exonerates label-overlap only. A feature that peeks at the future may never surface in post-embargo ACF; that risk is carried by the fl_* probes (clean in S02), not the ACF. ACF_EMBARGO_FALSE_POSITIVE is therefore a two-part certificate (ACF explained and fl_* clean), never a blanket no-leak.

Job B — Split ablation (Block 3b proper)¶

Two reference deltas, not one (resolves the round-1 baseline ambiguity — the central round-2 fix). The current production split is naive (no purge/embargo); the honest leak-free reference is purged-WF (family 1). S03 measures both directions explicitly, because two opposite stories are alive and a one-sided bar encodes only one:

Δ_leak = AUC(purged-WF) − AUC(naive current split). The leak-detection delta. A material negative Δ_leak means the production AUC≈0.64 was partly leak-inflated (a clean split removes signal that came from leakage) — true OOS discrimination is weaker than reported. This is the leak story, and it is exactly what Job A investigates — so it must have its own verdict row.
Δ_cd(family) = AUC(regime-aware family) − AUC(purged-WF honest baseline), families 2–6. The C-d transfer delta. A material positive Δ_cd means a better split recovers transfer the naive/purged split was destroying — the C-d story.

Bilateral verdict table (AUC-gated, FDR-corrected, power-gated; best_iter corroborating per below):

ΔAUC outcome	Verdict	Conclusion	Pre-committed action
Δ_cd(family) CI excludes 0, point ≥ +0.02	`SPLIT_MOVES_VALIDATION`	validation design (C-d/C-c) is the driver, not LightGBM	Trigger a generalisation Story (confirm beyond defi_top5); on confirmation the §12 program action "ship the regime-aware split" fires (scope reconciliation, §6).
Δ_leak CI excludes 0, point ≤ −0.02 (clean split drops AUC)	`BASELINE_LEAK_INFLATED`	the naive production AUC was partly leakage; true OOS discrimination is weaker	Feed Job A — corroborates a real leak → leak investigation / HALT (fix `src/features/`, re-baseline) before any modeling conclusion.
all deltas CI include 0, and power gate met	`SPLIT_STABLE`	validation design is not the primary driver	Close the validation-design line → route to S04 (capacity) / S06 (features/target).
power gate not met	`INCONCLUSIVE_UNDERPOWERED`	the instrument was deaf, not the effect absent	§1 power gate + the a-priori ladder (§3); do not close the C-d line.
Δ_cd(family) CI excludes 0, point ≤ −0.02 (a regime split is worse than purged-WF)	sub-finding (not a clean verdict)	that family hurts transfer — unexpected	report + flag; never read as C-d support.

best_iter corroborating, never a veto (round-1 fix retained): the C-d question is out-of-sample transfer = AUC, not the LGB-specific best_iter. A split that moves best_iter without moving AUC (CI includes 0) is a learning-dynamics result flagged for S04, not SPLIT_MOVES_VALIDATION. The arbitrary best_iter ≥ 5 gate is gone.

Power gate (round-1 fix retained): SPLIT_STABLE only at ≥ 0.80 power to detect |ΔAUC| = 0.02 at the observed clustered n; else INCONCLUSIVE_UNDERPOWERED — evidence of absence (powered) vs absence of evidence (deaf). A-priori power for regime-A→B is computed before the run, with a pre-committed ladder if it falls short (§3).

Practical materiality (round-3). ΔAUC = 0.02 is the pre-registered practical materiality threshold for routing decisions, not a claim that smaller effects are scientifically zero. A statistically significant effect below 0.02 is reported as a sub-material finding and does not by itself trigger program routing. (Rationale: 0.02 ≈ one CI half-width of the study's ±0.02 cross-model AUC stability band — a routing decision needs an effect at least as large as the noise floor it must clear; committee confirms — Q1.)

ΔAUC outcome	Treatment
CI excludes 0 and \|Δ\| ≥ 0.02	routing-level effect (the verdict-table rows above)
CI excludes 0 and 0 < \|Δ\| < 0.02	sub-material — report only, no routing
CI includes 0 and power gate met	`SPLIT_STABLE`
CI includes 0 and power gate not met	`INCONCLUSIVE_UNDERPOWERED`

ADR-2 / ADR-25: no production code change ships before Blocks 2 (S04) and 3 (S03) both pass. S03 therefore measures and decides; it does not promote a new production split on its own (§3).

2. Scope — the six split families¶

Tested on the canonical control group crypto_group=defi_top5 (project policy: control group for all cross-fold validation), primary model LightGBM (the best_iter=1 symptom is LGB-specific), with XGBoost + CatBoost retained as descriptive cross-model context (does the effect transfer? — not gating, see Q5).

#	Split family	What it stresses	Instrument
0	Naive current split (production reference)	the Δ_leak anchor — measured, not a candidate	current fold generation (no purge/embargo) — present so Δ_leak (§1) is observable
1	Purged walk-forward CV (honest baseline)	label-overlap leakage, the ADR-14 reference; the baseline for Δ_cd	`PurgedKFold` (`src/training/cv/purged_kfold.py`) wired into fold generation
2	Strict temporal	chronological train→test, no shuffle, embargo = label horizon	walk-forward folds with embargo ≥ H
3	Market period	bull / bear / range epoch transfer	transparent trend-sign epochs, thresholds train-fold-only (§4); 6-code `regime_detector` cross-check (Q3)
4	Volatility bucket	calm↔volatile transfer	realised-vol quantile buckets, edges fit train-fold-only (§4)
5	Crypto	cross-asset transfer	leave-one-crypto-out within `defi_top5`
6	Train-regime-A → test-regime-B	the direct C-d stressor	absolute pre-registered vol bands + trend sign (stable regime identity — not fold-relative quantiles, §4), over `compatible_train_builder` heritage

Baseline discipline (round-2): family 0 (naive) and family 1 (purged-WF) are the two references. Δ_leak = AUC(1) − AUC(0) is the leak-detection delta (bilateral); Δ_cd = AUC(family 2–6) − AUC(1) is the C-d transfer delta. Family 1 is therefore both the honest baseline for the C-d comparison and, against family 0, the leak measurement (§1).

Verdict-bearing core vs secondary (round-3 — protects statistical power). The matrix (families × cryptos × folds, FDR + clustered bootstrap) is large; not every family can carry a verdict without going underpowered. So: - Core (verdict-bearing by default): family 0 (naive), family 1 (purged-WF), family 2 (strict-temporal), family 6 (regime-A→B). These carry Job A and the highest-prior C-d test. - Secondary (descriptive unless powered): families 3 (market-period), 4 (volatility-bucket), 5 (crypto-LOO) — they become verdict-bearing only if their own a-priori power gate passes independently; otherwise they are reported descriptively and never trigger routing.

Metrics are reported decoupled (base doc Block 5 / §8): discrimination (AUC, PR-AUC, best_iter), calibration (Brier, ECE) and decision policy (f1_buy, rate_buy — descriptive only, never the verdict). The verdict keys on discrimination (AUC), with best_iter corroborating (§1). PR-AUC guard (round-3): PR-AUC is not gating, but because BUY labels are rare/imbalanced (AUC can flatter), any AUC-positive result whose PR-AUC deteriorates is flagged DISCRIMINATION_CONFLICT — the AUC gain is not believed until reconciled.

3. Technical approach¶

Two-layer diagnostic (Airflow + Hamilton), mirroring S40 (S02) and S27 — the canonical pattern in documentation/process/DIAGNOSTIC_HAMILTON_PATTERN.md. Airflow owns orchestration + I/O + per-cell fan-out (.expand → one pod per cell, isolation + retry); Hamilton owns pure compute (split construction → per-split fit → metrics → ΔAUC/Δbest_iter → cell verdict → group synthesis). New diagnostic id: S41.

Reuse, don't rebuild: - PurgedKFold already implements purge + embargo (env CVN_PURGE_BARS / CVN_EMBARGO_BARS, ADR-14). Job A is a configuration + measurement of it, not a reimplementation. - S07 DIAGNOSTIC_CAPTURE pinned-fold cache (just shipped, CVN-N001-EI-S07) makes the multi-split re-audit cheap: the captured fold is pinned once (cold), every split family re-audits warm (use_pin=True, < 2 min) instead of replaying the ~22 min cold Phase A. S03 is the first heavy consumer S07 was built for. - regime_detector / regime_tagger / compatible_train_builder supply the regime/market-period/volatility labels (splits 3/4/6). - Training harness (src/training/harness/, ADR-89): each split's per-fold fit reuses train_one + canonical eval_metrics.evaluate_split_binary (single source of truth for AUC/best_iter/F1).

Statistics: clustered-bootstrap 95 % CIs (block bootstrap over folds, the §11 UQ method) on ΔAUC and Δbest_iter; BH/FDR correction across the split families × cryptos (mirrors S28); a-priori power analysis (target |ΔAUC| = 0.02, α = 0.05, clustered n) sizing each family before the run — a family below 0.80 power yields INCONCLUSIVE_UNDERPOWERED, never SPLIT_STABLE (§1 power gate). Pre-registered verdict catalogue (closed set, ADR-33): SPLIT_MOVES_VALIDATION, SPLIT_STABLE, BASELINE_LEAK_INFLATED, INCONCLUSIVE_UNDERPOWERED, ACF_GENUINE_LEAK, ACF_EMBARGO_FALSE_POSITIVE, ACF_INDETERMINATE, INCONCLUSIVE_TOOLING — plus two non-terminal flags that attach to a verdict: DISCRIMINATION_CONFLICT (AUC up / PR-AUC down, §2) and ACF_UNRESOLVED_CONTEXT (Job B ran while Job A indeterminate, §1). No silent fallback (ADR-25); HALT-level outcomes emit severity=error loud, not as a Python crash (feedback_no_python_crash_visible).

A-priori-underpower ladder (round-2 — pre-committed action for the most likely failure: a deaf instrument before the run). If the a-priori estimate shows regime-A→B (worst-case n — embargo across a regime boundary) cannot reach 0.80 power at |ΔAUC| = 0.02 with the available data, act in this fixed order, no run-and-hope: (1) pool the five cryptos into a single group-level regime-A→B test (trading per-crypto FDR granularity for n) and recompute power; (2) if still < 0.80, regime-A→B is pre-declared INCONCLUSIVE_UNDERPOWERED and the C-d-via-regime-A→B question is routed to a data-collection follow-on (longer history / more cryptos). The target effect size is never inflated to manufacture power. This makes the a-priori-underpower branch as pre-committed as the post-hoc verdicts. The a-priori estimate is emitted as event=s41_power_estimate family=… power=… before any fit (committee r4), so the underpower branch is auditable in Loki up front, not reconstructed after.

Gating of the contingent production split (ADR-56): the regime-aware / purged-WF split is implemented as an off-by-default, CVN_*-gated capability (e.g. CVN_SPLIT_MODE, ftf_config / Console, ADR-59) so S03 can A/B it against the current split for measurement only. It is not promoted to the production default in S03 (ADR-2 — awaits S04 pass + a closure decision). Every gated factor carries a guardrail + integration test (ADR-58).

CVN_SPLIT_MODE is experimental, not production-ready (round-3). In S03 it is an experimental diagnostic factor only — not a production-ready split switch. Its presence in ftf_config must not be read as "ready to flip"; production promotion requires a separate closure decision after S03 and S04. The Console entry will be labelled experimental accordingly.

4. Files to create / modify¶

Create - src/commun/finetune/diagnostic/s41_split_ablation.py — probe helpers + verdict logic (Jobs A & B), closed verdict catalogue. - src/commun/finetune/diagnostic/hamilton/s41_nodes.py — pure Hamilton compute nodes (split fits, ΔAUC/Δbest_iter, cell verdict). - src/commun/finetune/diagnostic/hamilton/s41_io.py — I/O preamble (load_cell_inputs → inputs dict / verdict-on-failure), reusing the S07 pinned-fold path. - src/training/cv/regime_split.py — split constructors for families 3/4/5/6 (market-period, volatility, crypto LOO, regime-A→B), building on regime_detector + compatible_train_builder. Each returns purge/embargo-correct (train_idx, test_idx) honouring ADR-14. Threshold-fitting policy (round-2), resolving the leak-safety ↔ regime-identity tension: - Partition splits (3 market-period, 4 volatility-bucket): bucket edges fit train-fold-only (never full-sample) — leak-safe. A period's bucket may legitimately differ across folds; that's fine, they are CV partitions, not stable-identity claims. - Family 6 (regime-A→B): requires a stable regime identity to define "A" vs "B" — fold-relative quantiles would label the same calendar period "A" in one fold and "B" in another and dissolve the construct. So family 6 uses absolute pre-registered vol bands + trend sign (frozen constants from domain knowledge / a reference period — no fit on data at all → leak-safe and identity-stable). Both policies asserted in test_regime_split.py (no train/test index overlap; no full-sample statistic in any threshold).

Family-6 regime identity v1 (round-3 — structure fixed here; numeric values are committee-tunable but pre-registered before the run, never after seeing results):

- volatility metric : realized volatility = std(log-returns) over N bars  (v1: N = one trading day in the PTE timeframe)
- vol bands         : LOW / MID / HIGH via ABSOLUTE pre-registered thresholds
                      (frozen as the tertiles of realized-vol over a disjoint reference window,
                       computed ONCE and pinned as constants — never recomputed on evaluated folds)
- trend             : sign of rolling return over M bars  (v1: M = 7 days)
- neutral band      : |rolling return| < eps → TREND_NEUTRAL  (avoids sign churn on noise)
- regime code       : VOL_BAND × TREND_SIGN  (e.g. HIGH_VOL × DOWN)
- min samples       : >= 500 rows per train regime AND per test regime
- if min samples fail (a band empty / too thin) : family-6 verdict = INCONCLUSIVE_UNDERPOWERED
                      (feeds the §3 a-priori-underpower ladder)

The reference window for the frozen vol thresholds is disjoint from the evaluated folds (leak-safe), and the bands are constants in regime_split.py, not derived from test data. Runtime guard (committee r4): beyond the unit-test assertions, regime_split.py checks at execution time and emits event=s41_embargo_violation severity=error + fails the cell cleanly (no crash) if any train/test overlap or embargo < H slips through — caught in the pod, not just in CI. - dags/dag_diagnostic__s41.py (dag_id diagnostic__s41) — two-layer DAG: crypto_group fan-out, params (crypto_group, fold_id, split_families, use_pin, force_recapture, skip_phase_a, artifact_dir), ADR-90 PG-sourced config, ADR-92 build provenance (dag_doc_md / make_tags / dag_loaded_event), on_failure_callback halting downstream cells on a cell failure (committee r4 — containment). - tests/unit/test_s41_split_ablation.py, tests/unit/test_regime_split.py — verdict logic, split correctness (no purge/embargo violation), determinism. - documentation/missions/CVN-N001-EI/ results-dossier scaffold (verdict + CIs + PDF per ADR-79/80 when the run lands).

Modify (minimal touch) - src/commun/finetune/ablation_runner.py (_generate_folds, ~L806) — make the split strategy pluggable (strategy object/callable) without changing the default behaviour. Default path unchanged (regression-safe). - pyproject.toml markers / mkdocs nav for the new diagnostic page (docs-live gate). - documentation/stories/CVN-N001-EI-S03/mlops_readiness.md — ADR-70 (touches src/training/ + diagnostic harness).

Config (Console only, ADR-59) — not git: S40_ACF_CUT recalibration (Job A outcome), CVN_SPLIT_MODE factor keys.

5. Risks & threats to validity¶

Regime circularity — regime_detector is itself a heuristic model; splitting by its output risks defining regimes to fit the story. Mitigation: prefer transparent, pre-registered regime definitions (realised-vol quantiles, trend sign over a fixed window) for splits 3/4; document thresholds; treat the 6-code detector as a secondary cross-check (Q3); fit partition thresholds (3/4) train-fold-only and define family-6 regime identity from absolute pre-registered bands (§4) — the first is leak-safe, the second is leak-safe and identity-stable. Committee r4: a sensitivity cross-check of the pre-registered bands vs the 6-code regime_detector is reported to quantify residual circularity bias (divergent regime assignments are flagged, not silently trusted).
Underpowered negative — the highest-cost error. With 6 families × 5 cryptos × folds, then FDR on top, the real danger is not a false positive but insufficient power: a real, moderate C-d effect on already-thin per-regime folds may never reach significance, and everything defaults to SPLIT_STABLE — which triggers a heavy action (close the validation-design line, route to S04/S06). Since C-d is the leading hypothesis, a false "stable" is directionally expensive (preuve d'absence vs absence de preuve). Mitigation: the power gate (§1) — SPLIT_STABLE requires ≥ 0.80 power to detect ΔAUC = +0.02 at the observed clustered n, else INCONCLUSIVE_UNDERPOWERED; an a-priori power estimate for regime-A→B before the run; MIN_SAMPLES guard (500) + clustered-bootstrap CIs.
Multiple comparisons (6 families × 5 cryptos × folds). Mitigation: BH/FDR correction (mirror S28); report family-level and group-level synthesis.
Embargo must hold in every new split, especially regime-A→B (embargo across a regime boundary) and crypto LOO. Mitigation: assert no train/test index overlap and embargo ≥ label horizon in regime_split.py + unit tests; fail-fast (ADR-25).
Job A false-negative risk (clearing a real leak as a "calibration artefact"). Mitigation: the pre-committed Job-A table requires the ACF to collapse below a cut justified by the label horizon (embargo ≥ H), not merely below an ad-hoc number; if ambiguous → keep the reserve / escalate, do not auto-clear.
Reproducibility — pin folds via S07 cache + input_data_sha; canonical seed 1337; parity vs the S02 baseline capture (same defi_top5, fold 3).
No production code ships here (ADR-2) — the gated split is measurement-only; promotion is a separate, post-S04 decision.

6. Success criteria¶

Job A resolved: a pre-registered verdict (ACF_GENUINE_LEAK → HALT, or ACF_EMBARGO_FALSE_POSITIVE → reserve cleared + recalibrated S40_ACF_CUT) for defi_top5, with the ACF measured under a real purge+embargo split and the embargo-vs-horizon justification shown.
Job B verdict: per-crypto + group synthesis (SPLIT_MOVES_VALIDATION / SPLIT_STABLE / INCONCLUSIVE_UNDERPOWERED / INCONCLUSIVE_TOOLING) across the 6 families, AUC-gated with best_iter corroborating (§1), clustered-bootstrap CIs, FDR-corrected. A SPLIT_STABLE is valid only if the power gate (≥ 0.80 power for ΔAUC = +0.02) is met — else INCONCLUSIVE_UNDERPOWERED; the a-priori power estimate for regime-A→B is reported.
Scope stated explicitly: the verdict is specific to defi_top5 (control group, S02 parity), not a program-wide statement on validation design. Generalisation beyond defi_top5 is out of scope and called out as such in the results dossier.
Action-scope reconciliation (round-2): the §12 routing the verdict feeds is program-level, but the verdict is control-group. A positive S03 (SPLIT_MOVES_VALIDATION on defi_top5) does not directly fire "ship the regime-aware split" program-wide — it triggers a generalisation Story (confirm the recovery beyond defi_top5) which, on confirmation, fires the §12 action. BASELINE_LEAK_INFLATED (more likely systematic) escalates to a program-level leak investigation but still confirms before any program-wide re-baseline.
Run executed warm via the S07 pinned fold (use_pin=True), demonstrating the < 2 min re-audit per split family.
All probes emit structured event=s41_* lines (ADR-31/32/33), durable in Loki; verdict JSON + results dossier (ADR-79/80).
make qa green (new unit/contract tests in the fast tier); no purge/embargo violation in any split (tested).
Docs live on docs.cvntrade.eu; MLOps readiness filed; gated factor has guardrail + integration test (ADR-58).
No production split promoted (ADR-2) — the verdict feeds the §12 routing; promotion is deferred.

7. Open questions for `plan_review`¶

Pre-review round 1 (2026-05-28) resolved Q2/Q5/Q6, reshaped Q1/Q3. Round 2 resolved the baseline ambiguity + bilateral materiality bar (BASELINE_LEAK_INFLATED), scoped the Job-A clear to label-overlap, pre-committed the a-priori-underpower ladder, set absolute regime bands for family 6, and reconciled control-group verdict ↔ §12 action. Round 3 (Approve-with-required-clarifications) locked all five: (1) ACF_INDETERMINATE blocks Job-B routing (ACF_UNRESOLVED_CONTEXT); (2) core vs secondary families (0/1/2/6 verdict-bearing, 3/4/5 descriptive-unless-powered); (3) Family-6 regime identity v1 spec (§4); (4) practical-materiality framing + sub-material row (§1); (5) CVN_SPLIT_MODE experimental-not-production (§3). Non-blocking #6/#7/#8 also folded (scope in objective; PR-AUC DISCRIMINATION_CONFLICT; pre-registered ACF-cut null).

Round 4 — committee plan_review (session 0854dc31, OP Meeting #234): PASS. Mistral score_b 8/9/9/8/8, zero blockers (the CONSOLIDATOR_ERROR was the Gemini spend-cap, judged on per-expert Mistral basis per operator policy). All open questions now closed. Folded committee enhancements: pre-run s41_power_estimate event (§3), runtime embargo guard (§4), S41 DAG on_failure_callback containment (§4), threshold-vs-6-code sensitivity cross-check (§5), ACF_INDETERMINATE escalation playbook (§1); DISCRIMINATION_CONFLICT reconcile-before-acting already in §2.

Q1 — Materiality bar. RESOLVED (committee r4): ΔAUC = 0.02 + 0.80 power floor confirmed and locked as pre-registered (committee: conservative, aligned with the study's noise floor). AUC-gated, best_iter corroborating-not-gating, power gate → INCONCLUSIVE_UNDERPOWERED (§1).
Q2 — Family prioritisation. RESOLVED (pre-review): load-bearing core = purged-WF + strict-temporal + regime-A→B; market-period / volatility / crypto second-tier.
Q3 — Regime definition. Reshaped (pre-review): transparent pre-registered vol-quantile / trend-sign epochs as primary (6-code detector = cross-check), with all thresholds fit train-fold-only (§4). For committee: sign off the specific quantile/trend definitions.
Q4 — Contingent-split boundary. RESOLVED (operator, post-committee r4): the regime-aware split lands in S03 as an ADR-56-gated, off-by-default capability (ready to flip post-S04), not deferred to a follow-on. It is not promoted to production in S03 (ADR-2); CVN_SPLIT_MODE is experimental, Console-labelled (§3). (Committee was split: majority leaned defer, data-scientist leaned land-gated; operator chose land-gated.)
Q5 — Cross-model. RESOLVED (pre-review): XGB/CB stay descriptive context only; LGB gates the verdict (avoids diluting already-tight power and blurring the LGB-specific routing).
Q6 — Job-A priority. RESOLVED (pre-review): Job A resolves before any Job B conclusion — the program is under an unresolved HALT-level flag until then.

Appendix — Evidence anchors¶

S02 executed verdict (Loki 2026-05-25): s40_group_outcome ... status=SYSTEMATIC_LEAK halt=mission conclusive=5/5 counts={'LEAK_CONFIRMED':5} proxy_flags=['time_segments_regime','acf_only_embargo','event_halt']; per-cell fl_val_auc 0.649–0.678, fl_corr_drift 0.097–0.126, ls_psi 0.003–0.033, ta_acf 0.687–0.721.
S02 reserve (OP wp#225 transition comment, 2026-05-25): "verdict driven solely by ACF probe … leakage probes clean … likely probe-calibration false-positive; S40_ACF_CUT recalibration deferred to S03 #1058."
Base study §12 decision routing: 2026-05-24-cvn-n001-ee-best-iter1-multimodel-study.md.
Existing instruments: src/training/cv/purged_kfold.py, src/commun/finetune/ablation_runner.py:806, src/commun/regime/{regime_detector,regime_tagger,compatible_train_builder}.py, src/commun/finetune/diagnostic/s40_validation_integrity.py, dags/dag_diagnostic__s40.py, S07 diagnostic_capture_*.