S18 Step 4 — design (amended post-Loki forensics)¶
Status : draft v2 (post-Loki amendment, pre-committee plan_review)
Parent dossier : documentation/reviews/2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md §5.4
Trigger : Step 3 verdict = REFUTED + Loki forensic query on the chained DAG run 2026-05-14 14:44 UTC
Author : Claude Opus 4.7 (under operator review)
1. Why this v2¶
The v1 of this dossier proposed a 7-stage replay matrix (A.GRID, A.OPT, B.GRID, … D.OPT). Loki forensics on the actual chained-DAG run obsoletes that scope :
Loki query results (2026-05-14 14:44-15:08 UTC)¶
Every single Optuna trial converged to best_iter ≤ 2 across the full hyperparameter search space (50 trials + final retrain). Combined with Step 3 producing legacy_best_iter=1, harness_best_iter=1 on GRID_DEFAULT_HP_LGB :
| Configuration | Params | valid_sets | best_iter |
|---|---|---|---|
| Phase A trial 0 | Optuna trial 0 | [val_set] |
1 |
| Phase A trial 1 | Optuna trial 1 | [val_set] |
1 |
| Phase A trial 2 | Optuna trial 2 | [val_set] |
2 |
| … | … | … | … |
| Phase A trial 50 | Optuna best retrain | [val_set] |
1 |
| Step 3 legacy | GRID_DEFAULT | [train_set, val_set] |
1 |
| Step 3 harness | GRID_DEFAULT | [val_set] |
1 |
53 total configurations × 2 param families × 2 valid_sets configs → all early-stop at iter 1-2.
This eliminates : - H1 (early stopping config) — uniform across 51 different param sets - H6 (eval metric mismatch) — would have to affect 53 distinct configs identically - H7 (HPO param override) — none of 51 Optuna combinations escape the trap
The remaining hypotheses are all data-side :
- H3 label misalignment / off-by-one
- H4 sample weights formula bug (but weights_present=false in capture → unlikely)
- H5 feature corruption (NaN leak / scaling drift / leak feature)
A 7-stage replay testing the training loop invariants is misdirected — the loop is doing the right thing on bad data.
Methodology flaw exposed by Step 3¶
Parity.py:320 hardcoded y_pred = (y_proba >= 0.5).astype(int). With scale_pos_weight=4.71 compressing probas downward, ZERO trades cross θ=0.5. So f1_buy_val=0.0, n_trades_val=0 was a measurement artefact. The true signal is best_iter=1 (which is robust to θ).
Step 4 MUST fix this : either run a θ-sweep on the trained booster's proba distribution or skip the f1_buy measurement entirely and rely on AUC / loss curves.
2. Amended Step 4 scope — data forensics¶
Drop the 7-stage replay matrix. Replace with a focused data inspector that runs the §5.3 invariants 1, 2 (data parity) + a tightly-scoped §5.3.4 (iter-1 training behavior) on the captured fold.
2.1 The forensic checks (in §5.3 stop-at-first-divergence order)¶
| # | Check | Rules out / confirms | Implementation |
|---|---|---|---|
| F1 | Label-index alignment | H3 off-by-one | assert (X_train.index == y_train.index).all() ; ditto val |
| F2 | Label distribution sanity | H3 mapping bug | Already confirmed by Loki : pos/neg = 17.5%/16.9% train/val. Reconfirm + emit |
| F3 | NaN ratio per feature | H5 NaN leak | X.isna().sum(axis=0).max() ; flag features with > 50% NaN |
| F4 | Feature-label correlation | H5 leak feature | np.corrcoef(X[:, j], y) for each j ; top 10 by |corr|. If max > 0.95 → leak |
| F5 | Train vs val feature drift | H5 scaling drift | per-feature mean/std on train vs val ; flag features with > 3σ shift |
| F6 | Iter-1 train AUC + θ-sweep | smoking gun | Train 1 single tree with GRID_DEFAULT ; report iter1_train_auc, iter1_val_auc, full proba_val distribution stats (min/max/mean/std), best_f1_at_theta_sweep (θ ∈ [0.05, 0.50] step 0.05), theta_at_best_f1, n_trades_at_best_theta. Interpretation : - iter1_train_auc ≈ 1.0 AND iter1_val_auc ≈ 1.0 → leak feature in both splits - iter1_train_auc ≈ 1.0 AND iter1_val_auc ≈ 0.5 → label misalignment on train only - iter1_train_auc ≈ 0.5 → label-feature mismatch on train - iter1_val_auc > 0.85 AND best_f1_at_theta < 0.20 → model over-polarised (scale_pos_weight or objective mismatch — all probas in a narrow zone, no θ recovers f1) - iter1_train_auc ∈ [0.65, 0.85] AND flat val curve → train/val drift |
Stop-at-first-divergence : if F1 fails, report and stop. F2 needs F1 OK. Etc.
2.2 Why ONE tree at iter 1 is the right probe¶
The 51 Phase A trials all stop at iter 1-2. So iter 1 is the locus of the divergence. A single tree at iter 1 gives us : - train_auc → answers "does the first tree learn the train labels ?" (label-vs-feature alignment) - val_auc → answers "does the first tree generalize to val ?" (train/val parity) - proba distribution → answers "are the predictions concentrated near 0.5 or polarized ?"
If iter 1 train_auc ≈ 1.0 AND val_auc ≈ 1.0 → leak feature (the data has a "future leak"). If iter 1 train_auc ≈ 1.0 AND val_auc ≈ 0.5 → leak only on train (val labels misaligned). If iter 1 train_auc ≈ 0.5 → label-feature mismatch on train.
We don't need 7 stages. We need 1 tree + correlation matrix.
2.3 Decision matrix → next step¶
| F1 fail | → H3 confirmed → S19 = fix the temporal join / dataset construction | | F3 fail (NaN > 50%) | → H5 confirmed → S19 = fix the FE pipeline NaN handling | | F4 fail (corr > 0.95) | → leak feature → identify the feature + remove from FE pipeline | | F5 fail | → train/val drift → fix split / scaling in FE pipeline | | F6 iter-1 train_auc ≈ 1.0 | corroborates F4 | | F6 iter-1 train_auc ≈ 0.5 | corroborates F1 or F3 | | All checks pass | Escalate to committee — the bug is upstream of the captured parquet (FTF cache, labeling, or somewhere we didn't capture) |
2.4 Module API¶
# src/commun/finetune/diagnostic/s18_step4_invariants.py
@dataclass(frozen=True)
class ForensicResult:
check_label: str # "F1_label_index_alignment", etc.
status: str # "PASS" | "FAIL"
notes: str
metrics: dict # check-specific quantitative payload
@dataclass(frozen=True)
class InvariantVerdict:
status: str # "DIVERGENCE_LOCATED" | "NO_DIVERGENCE" | "INCONCLUSIVE"
first_divergence: str | None # "F4" if F4 failed first
implicated_hypothesis: list[str] # ["H5"] etc.
results: list[ForensicResult]
notes: str
def run_step4_invariants(
crypto: str,
fold_id: int,
parquet_path: Path,
artifact_dir: Path,
leak_corr_threshold: float = 0.95,
nan_ratio_threshold: float = 0.5,
drift_sigma_threshold: float = 3.0,
iter1_grid_params: dict | None = None, # falls back to DEFAULT_LGB_PARAMS
raise_on_no_divergence: bool = True,
) -> InvariantVerdict:
"""Run forensic checks F1-F6 on the captured fold. Stop at first divergence."""
2.5 DAG pattern¶
Chain Phase A (Step 1 capture, ~19 min) + Phase B (forensics, ~10 s) in a single pod — same pattern as #935. Operator triggers, gets verdict in ~20 min, kills cycle.
Same params as #935 + optional skip_phase_a (re-use existing artifacts when present).
3. Open questions¶
- Cross-fold : reproduce on a second crypto×fold cell only if F-check verdict is ambiguous, NOT pre-emptively (saves 20 min cycle).
- Leak feature detection threshold :
|corr| > 0.95is a strong heuristic but may miss multi-feature leaks. Sufficient for v1 ; if F6 contradicts F4 we add a SHAP top-10 in a follow-up. - What is "the correct AUC" we should be measuring ? The harness uses AUC. The legacy used AUC. If the iter-1 AUC is > 0.9 on val we have a leak ; if it's 0.7 plateau-immediate we have something more subtle. Trust the operator to interpret the F6 output.
4. Acceptance criteria¶
- Module
s18_step4_invariants.pyimplementing F1-F6 forensic checks - DAG
dag_diagnostic__s18_step1_4_chain.pysame-pod pattern as #935 - Retroactive Step 3 fix (bundled in same PR — Fix B from operator review 2026-05-14) :
s18_step3_parity.pyθ=0.5 hardcoded replaced by θ-sweep [0.05, 0.50] step 0.05. NewPathTracefields :best_f1_buy_at_theta_sweep,theta_at_best_f1,final_n_trades_val_at_best_theta.final_f1_buy_valkept (now means f1@θ=0.5, documented as control). Rationale : Step 3 will be re-run (cross-fold + post-fix validation) — without the patch every run loses the f1 signal due toscale_pos_weightcompressing probas below 0.5. - Unit tests for each F-check : synthetic data triggers expected verdict
- Unit test for the θ-sweep patch on Step 3 (
final_predictions_summarycarries the new fields, verdict logic unchanged) - ADR-25 compliance : task red on
status != "DIVERGENCE_LOCATED", all events logged before raise - ADR-31/32/33 :
event=s18_step4_*key=value - CR full cycle ; committee
pr_review(substantive diagnostic code)
5. Effort estimate¶
| Phase | Effort |
|---|---|
| Module impl + unit tests | ~4h |
| DAG + integration | ~1h |
| PR + CR rounds | ~1 day |
| Trigger + verdict + interpretation | ~30 min after deploy |
| Total to verdict | ~1.5 days |
vs the v1 4-stage matrix (~2 days impl + ambiguous output). The forensic approach is faster AND has clearer decision branches.
6. Risks¶
| Risk | Likelihood | Mitigation |
|---|---|---|
| All F-checks pass (NO_DIVERGENCE) — bug is upstream of the parquet | medium | Verdict explicitly flags this and escalates to committee with all check metrics |
| Captured parquet differs from prod cache (Step 1 capture has its own shape) | low | Step 1 monkey-patches lgb.Dataset but doesn't transform features ; what's captured IS what the harness fed to lgb |
| Step 1 re-capture cycle is 19 min and the pod still GCs the parquet | low | Run forensics in the SAME pod (chained DAG), not a separate Step 4 pod |
| Leak feature exists but is < 0.95 corr (gradual leak) | medium | Single-tree iter-1 train_auc is the canary — if it's >>0.9 with corr<0.95, escalate to SHAP analysis |
7. Next steps¶
- Operator validates this design (this file v2).
- Optional committee
plan_reviewif the operator wants ADR-68 cover (the substantive change vs v1 is the scope reduction — committee is fast-path acceptable but not strictly mandatory since the parent §5.4 already covered "data invariants"). - Issue + branch post-validation :
feat/CVN-N001-EE-S18-step4-data-forensics - Implementation per AC §4.
- Trigger → verdict → Step 5.