S18 Step 4 — design (amended post-Loki forensics)¶

Status : draft v2 (post-Loki amendment, pre-committee plan_review) Parent dossier : documentation/reviews/2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md §5.4 Trigger : Step 3 verdict = REFUTED + Loki forensic query on the chained DAG run 2026-05-14 14:44 UTC Author : Claude Opus 4.7 (under operator review)

1. Why this v2¶

The v1 of this dossier proposed a 7-stage replay matrix (A.GRID, A.OPT, B.GRID, … D.OPT). Loki forensics on the actual chained-DAG run obsoletes that scope :

Loki query results (2026-05-14 14:44-15:08 UTC)¶

event=s18_step1_trial_complete : 51 events
  best_iter=1 : 49 trials
  best_iter=2 :  2 trials

Every single Optuna trial converged to best_iter ≤ 2 across the full hyperparameter search space (50 trials + final retrain). Combined with Step 3 producing legacy_best_iter=1, harness_best_iter=1 on GRID_DEFAULT_HP_LGB :

Configuration	Params	valid_sets	best_iter
Phase A trial 0	Optuna trial 0	`[val_set]`	1
Phase A trial 1	Optuna trial 1	`[val_set]`	1
Phase A trial 2	Optuna trial 2	`[val_set]`	2
…	…	…	…
Phase A trial 50	Optuna best retrain	`[val_set]`	1
Step 3 legacy	GRID_DEFAULT	`[train_set, val_set]`	1
Step 3 harness	GRID_DEFAULT	`[val_set]`	1

53 total configurations × 2 param families × 2 valid_sets configs → all early-stop at iter 1-2.

This eliminates : - H1 (early stopping config) — uniform across 51 different param sets - H6 (eval metric mismatch) — would have to affect 53 distinct configs identically - H7 (HPO param override) — none of 51 Optuna combinations escape the trap

The remaining hypotheses are all data-side : - H3 label misalignment / off-by-one - H4 sample weights formula bug (but weights_present=false in capture → unlikely) - H5 feature corruption (NaN leak / scaling drift / leak feature)

A 7-stage replay testing the training loop invariants is misdirected — the loop is doing the right thing on bad data.

Methodology flaw exposed by Step 3¶

Parity.py:320 hardcoded y_pred = (y_proba >= 0.5).astype(int). With scale_pos_weight=4.71 compressing probas downward, ZERO trades cross θ=0.5. So f1_buy_val=0.0, n_trades_val=0 was a measurement artefact. The true signal is best_iter=1 (which is robust to θ).

Step 4 MUST fix this : either run a θ-sweep on the trained booster's proba distribution or skip the f1_buy measurement entirely and rely on AUC / loss curves.

2. Amended Step 4 scope — data forensics¶

Drop the 7-stage replay matrix. Replace with a focused data inspector that runs the §5.3 invariants 1, 2 (data parity) + a tightly-scoped §5.3.4 (iter-1 training behavior) on the captured fold.

2.1 The forensic checks (in §5.3 stop-at-first-divergence order)¶

#	Check	Rules out / confirms	Implementation
F1	Label-index alignment	H3 off-by-one	`assert (X_train.index == y_train.index).all()` ; ditto val
F2	Label distribution sanity	H3 mapping bug	Already confirmed by Loki : pos/neg = 17.5%/16.9% train/val. Reconfirm + emit
F3	NaN ratio per feature	H5 NaN leak	`X.isna().sum(axis=0).max()` ; flag features with > 50% NaN
F4	Feature-label correlation	H5 leak feature	`np.corrcoef(X[:, j], y)` for each j ; top 10 by `\|corr\|`. If max > 0.95 → leak
F5	Train vs val feature drift	H5 scaling drift	per-feature mean/std on train vs val ; flag features with > 3σ shift
F6	Iter-1 train AUC + θ-sweep	smoking gun	Train 1 single tree with GRID_DEFAULT ; report `iter1_train_auc`, `iter1_val_auc`, full `proba_val` distribution stats (min/max/mean/std), `best_f1_at_theta_sweep` (θ ∈ [0.05, 0.50] step 0.05), `theta_at_best_f1`, `n_trades_at_best_theta`. Interpretation : - `iter1_train_auc ≈ 1.0` AND `iter1_val_auc ≈ 1.0` → leak feature in both splits - `iter1_train_auc ≈ 1.0` AND `iter1_val_auc ≈ 0.5` → label misalignment on train only - `iter1_train_auc ≈ 0.5` → label-feature mismatch on train - `iter1_val_auc > 0.85` AND `best_f1_at_theta < 0.20` → model over-polarised (scale_pos_weight or objective mismatch — all probas in a narrow zone, no θ recovers f1) - `iter1_train_auc ∈ [0.65, 0.85]` AND flat val curve → train/val drift

Stop-at-first-divergence : if F1 fails, report and stop. F2 needs F1 OK. Etc.

2.2 Why ONE tree at iter 1 is the right probe¶

The 51 Phase A trials all stop at iter 1-2. So iter 1 is the locus of the divergence. A single tree at iter 1 gives us : - train_auc → answers "does the first tree learn the train labels ?" (label-vs-feature alignment) - val_auc → answers "does the first tree generalize to val ?" (train/val parity) - proba distribution → answers "are the predictions concentrated near 0.5 or polarized ?"

If iter 1 train_auc ≈ 1.0 AND val_auc ≈ 1.0 → leak feature (the data has a "future leak"). If iter 1 train_auc ≈ 1.0 AND val_auc ≈ 0.5 → leak only on train (val labels misaligned). If iter 1 train_auc ≈ 0.5 → label-feature mismatch on train.

We don't need 7 stages. We need 1 tree + correlation matrix.

2.3 Decision matrix → next step¶

2.4 Module API¶

# src/commun/finetune/diagnostic/s18_step4_invariants.py

@dataclass(frozen=True)
class ForensicResult:
    check_label: str       # "F1_label_index_alignment", etc.
    status: str            # "PASS" | "FAIL"
    notes: str
    metrics: dict          # check-specific quantitative payload

@dataclass(frozen=True)
class InvariantVerdict:
    status: str            # "DIVERGENCE_LOCATED" | "NO_DIVERGENCE" | "INCONCLUSIVE"
    first_divergence: str | None  # "F4" if F4 failed first
    implicated_hypothesis: list[str]  # ["H5"] etc.
    results: list[ForensicResult]
    notes: str


def run_step4_invariants(
    crypto: str,
    fold_id: int,
    parquet_path: Path,
    artifact_dir: Path,
    leak_corr_threshold: float = 0.95,
    nan_ratio_threshold: float = 0.5,
    drift_sigma_threshold: float = 3.0,
    iter1_grid_params: dict | None = None,  # falls back to DEFAULT_LGB_PARAMS
    raise_on_no_divergence: bool = True,
) -> InvariantVerdict:
    """Run forensic checks F1-F6 on the captured fold. Stop at first divergence."""

2.5 DAG pattern¶

Chain Phase A (Step 1 capture, ~19 min) + Phase B (forensics, ~10 s) in a single pod — same pattern as #935. Operator triggers, gets verdict in ~20 min, kills cycle.

dags/dag_diagnostic__s18_step1_4_chain.py

Same params as #935 + optional skip_phase_a (re-use existing artifacts when present).

3. Open questions¶

Cross-fold : reproduce on a second crypto×fold cell only if F-check verdict is ambiguous, NOT pre-emptively (saves 20 min cycle).
Leak feature detection threshold : |corr| > 0.95 is a strong heuristic but may miss multi-feature leaks. Sufficient for v1 ; if F6 contradicts F4 we add a SHAP top-10 in a follow-up.
What is "the correct AUC" we should be measuring ? The harness uses AUC. The legacy used AUC. If the iter-1 AUC is > 0.9 on val we have a leak ; if it's 0.7 plateau-immediate we have something more subtle. Trust the operator to interpret the F6 output.

4. Acceptance criteria¶

Module s18_step4_invariants.py implementing F1-F6 forensic checks
DAG dag_diagnostic__s18_step1_4_chain.py same-pod pattern as #935
Retroactive Step 3 fix (bundled in same PR — Fix B from operator review 2026-05-14) : s18_step3_parity.py θ=0.5 hardcoded replaced by θ-sweep [0.05, 0.50] step 0.05. New PathTrace fields : best_f1_buy_at_theta_sweep, theta_at_best_f1, final_n_trades_val_at_best_theta. final_f1_buy_val kept (now means f1@θ=0.5, documented as control). Rationale : Step 3 will be re-run (cross-fold + post-fix validation) — without the patch every run loses the f1 signal due to scale_pos_weight compressing probas below 0.5.
Unit tests for each F-check : synthetic data triggers expected verdict
Unit test for the θ-sweep patch on Step 3 (final_predictions_summary carries the new fields, verdict logic unchanged)
ADR-25 compliance : task red on status != "DIVERGENCE_LOCATED", all events logged before raise
ADR-31/32/33 : event=s18_step4_* key=value
CR full cycle ; committee pr_review (substantive diagnostic code)

5. Effort estimate¶

Phase	Effort
Module impl + unit tests	~4h
DAG + integration	~1h
PR + CR rounds	~1 day
Trigger + verdict + interpretation	~30 min after deploy
Total to verdict	~1.5 days

vs the v1 4-stage matrix (~2 days impl + ambiguous output). The forensic approach is faster AND has clearer decision branches.

6. Risks¶

Risk	Likelihood	Mitigation
All F-checks pass (NO_DIVERGENCE) — bug is upstream of the parquet	medium	Verdict explicitly flags this and escalates to committee with all check metrics
Captured parquet differs from prod cache (Step 1 capture has its own shape)	low	Step 1 monkey-patches lgb.Dataset but doesn't transform features ; what's captured IS what the harness fed to lgb
Step 1 re-capture cycle is 19 min and the pod still GCs the parquet	low	Run forensics in the SAME pod (chained DAG), not a separate Step 4 pod
Leak feature exists but is < 0.95 corr (gradual leak)	medium	Single-tree iter-1 train_auc is the canary — if it's >>0.9 with corr<0.95, escalate to SHAP analysis

7. Next steps¶

Operator validates this design (this file v2).
Optional committee plan_review if the operator wants ADR-68 cover (the substantive change vs v1 is the scope reduction — committee is fast-path acceptable but not strictly mandatory since the parent §5.4 already covered "data invariants").
Issue + branch post-validation : feat/CVN-N001-EE-S18-step4-data-forensics
Implementation per AC §4.
Trigger → verdict → Step 5.