Skip to content

S18 Step 4 — design (amended post-Loki forensics)

Status : draft v2 (post-Loki amendment, pre-committee plan_review) Parent dossier : documentation/reviews/2026-05-13-cvn-n001-ee-s18-harness-shallow-training-diagnostic-plan.md §5.4 Trigger : Step 3 verdict = REFUTED + Loki forensic query on the chained DAG run 2026-05-14 14:44 UTC Author : Claude Opus 4.7 (under operator review)


1. Why this v2

The v1 of this dossier proposed a 7-stage replay matrix (A.GRID, A.OPT, B.GRID, … D.OPT). Loki forensics on the actual chained-DAG run obsoletes that scope :

Loki query results (2026-05-14 14:44-15:08 UTC)

event=s18_step1_trial_complete : 51 events
  best_iter=1 : 49 trials
  best_iter=2 :  2 trials

Every single Optuna trial converged to best_iter ≤ 2 across the full hyperparameter search space (50 trials + final retrain). Combined with Step 3 producing legacy_best_iter=1, harness_best_iter=1 on GRID_DEFAULT_HP_LGB :

Configuration Params valid_sets best_iter
Phase A trial 0 Optuna trial 0 [val_set] 1
Phase A trial 1 Optuna trial 1 [val_set] 1
Phase A trial 2 Optuna trial 2 [val_set] 2
Phase A trial 50 Optuna best retrain [val_set] 1
Step 3 legacy GRID_DEFAULT [train_set, val_set] 1
Step 3 harness GRID_DEFAULT [val_set] 1

53 total configurations × 2 param families × 2 valid_sets configs → all early-stop at iter 1-2.

This eliminates : - H1 (early stopping config) — uniform across 51 different param sets - H6 (eval metric mismatch) — would have to affect 53 distinct configs identically - H7 (HPO param override) — none of 51 Optuna combinations escape the trap

The remaining hypotheses are all data-side : - H3 label misalignment / off-by-one - H4 sample weights formula bug (but weights_present=false in capture → unlikely) - H5 feature corruption (NaN leak / scaling drift / leak feature)

A 7-stage replay testing the training loop invariants is misdirected — the loop is doing the right thing on bad data.

Methodology flaw exposed by Step 3

Parity.py:320 hardcoded y_pred = (y_proba >= 0.5).astype(int). With scale_pos_weight=4.71 compressing probas downward, ZERO trades cross θ=0.5. So f1_buy_val=0.0, n_trades_val=0 was a measurement artefact. The true signal is best_iter=1 (which is robust to θ).

Step 4 MUST fix this : either run a θ-sweep on the trained booster's proba distribution or skip the f1_buy measurement entirely and rely on AUC / loss curves.

2. Amended Step 4 scope — data forensics

Drop the 7-stage replay matrix. Replace with a focused data inspector that runs the §5.3 invariants 1, 2 (data parity) + a tightly-scoped §5.3.4 (iter-1 training behavior) on the captured fold.

2.1 The forensic checks (in §5.3 stop-at-first-divergence order)

# Check Rules out / confirms Implementation
F1 Label-index alignment H3 off-by-one assert (X_train.index == y_train.index).all() ; ditto val
F2 Label distribution sanity H3 mapping bug Already confirmed by Loki : pos/neg = 17.5%/16.9% train/val. Reconfirm + emit
F3 NaN ratio per feature H5 NaN leak X.isna().sum(axis=0).max() ; flag features with > 50% NaN
F4 Feature-label correlation H5 leak feature np.corrcoef(X[:, j], y) for each j ; top 10 by |corr|. If max > 0.95 → leak
F5 Train vs val feature drift H5 scaling drift per-feature mean/std on train vs val ; flag features with > 3σ shift
F6 Iter-1 train AUC + θ-sweep smoking gun Train 1 single tree with GRID_DEFAULT ; report iter1_train_auc, iter1_val_auc, full proba_val distribution stats (min/max/mean/std), best_f1_at_theta_sweep (θ ∈ [0.05, 0.50] step 0.05), theta_at_best_f1, n_trades_at_best_theta. Interpretation :
- iter1_train_auc ≈ 1.0 AND iter1_val_auc ≈ 1.0 → leak feature in both splits
- iter1_train_auc ≈ 1.0 AND iter1_val_auc ≈ 0.5 → label misalignment on train only
- iter1_train_auc ≈ 0.5 → label-feature mismatch on train
- iter1_val_auc > 0.85 AND best_f1_at_theta < 0.20 → model over-polarised (scale_pos_weight or objective mismatch — all probas in a narrow zone, no θ recovers f1)
- iter1_train_auc ∈ [0.65, 0.85] AND flat val curve → train/val drift

Stop-at-first-divergence : if F1 fails, report and stop. F2 needs F1 OK. Etc.

2.2 Why ONE tree at iter 1 is the right probe

The 51 Phase A trials all stop at iter 1-2. So iter 1 is the locus of the divergence. A single tree at iter 1 gives us : - train_auc → answers "does the first tree learn the train labels ?" (label-vs-feature alignment) - val_auc → answers "does the first tree generalize to val ?" (train/val parity) - proba distribution → answers "are the predictions concentrated near 0.5 or polarized ?"

If iter 1 train_auc ≈ 1.0 AND val_auc ≈ 1.0 → leak feature (the data has a "future leak"). If iter 1 train_auc ≈ 1.0 AND val_auc ≈ 0.5 → leak only on train (val labels misaligned). If iter 1 train_auc ≈ 0.5 → label-feature mismatch on train.

We don't need 7 stages. We need 1 tree + correlation matrix.

2.3 Decision matrix → next step

| F1 fail | → H3 confirmed → S19 = fix the temporal join / dataset construction | | F3 fail (NaN > 50%) | → H5 confirmed → S19 = fix the FE pipeline NaN handling | | F4 fail (corr > 0.95) | → leak feature → identify the feature + remove from FE pipeline | | F5 fail | → train/val drift → fix split / scaling in FE pipeline | | F6 iter-1 train_auc ≈ 1.0 | corroborates F4 | | F6 iter-1 train_auc ≈ 0.5 | corroborates F1 or F3 | | All checks pass | Escalate to committee — the bug is upstream of the captured parquet (FTF cache, labeling, or somewhere we didn't capture) |

2.4 Module API

# src/commun/finetune/diagnostic/s18_step4_invariants.py

@dataclass(frozen=True)
class ForensicResult:
    check_label: str       # "F1_label_index_alignment", etc.
    status: str            # "PASS" | "FAIL"
    notes: str
    metrics: dict          # check-specific quantitative payload

@dataclass(frozen=True)
class InvariantVerdict:
    status: str            # "DIVERGENCE_LOCATED" | "NO_DIVERGENCE" | "INCONCLUSIVE"
    first_divergence: str | None  # "F4" if F4 failed first
    implicated_hypothesis: list[str]  # ["H5"] etc.
    results: list[ForensicResult]
    notes: str


def run_step4_invariants(
    crypto: str,
    fold_id: int,
    parquet_path: Path,
    artifact_dir: Path,
    leak_corr_threshold: float = 0.95,
    nan_ratio_threshold: float = 0.5,
    drift_sigma_threshold: float = 3.0,
    iter1_grid_params: dict | None = None,  # falls back to DEFAULT_LGB_PARAMS
    raise_on_no_divergence: bool = True,
) -> InvariantVerdict:
    """Run forensic checks F1-F6 on the captured fold. Stop at first divergence."""

2.5 DAG pattern

Chain Phase A (Step 1 capture, ~19 min) + Phase B (forensics, ~10 s) in a single pod — same pattern as #935. Operator triggers, gets verdict in ~20 min, kills cycle.

dags/dag_diagnostic__s18_step1_4_chain.py

Same params as #935 + optional skip_phase_a (re-use existing artifacts when present).

3. Open questions

  1. Cross-fold : reproduce on a second crypto×fold cell only if F-check verdict is ambiguous, NOT pre-emptively (saves 20 min cycle).
  2. Leak feature detection threshold : |corr| > 0.95 is a strong heuristic but may miss multi-feature leaks. Sufficient for v1 ; if F6 contradicts F4 we add a SHAP top-10 in a follow-up.
  3. What is "the correct AUC" we should be measuring ? The harness uses AUC. The legacy used AUC. If the iter-1 AUC is > 0.9 on val we have a leak ; if it's 0.7 plateau-immediate we have something more subtle. Trust the operator to interpret the F6 output.

4. Acceptance criteria

  • Module s18_step4_invariants.py implementing F1-F6 forensic checks
  • DAG dag_diagnostic__s18_step1_4_chain.py same-pod pattern as #935
  • Retroactive Step 3 fix (bundled in same PR — Fix B from operator review 2026-05-14) : s18_step3_parity.py θ=0.5 hardcoded replaced by θ-sweep [0.05, 0.50] step 0.05. New PathTrace fields : best_f1_buy_at_theta_sweep, theta_at_best_f1, final_n_trades_val_at_best_theta. final_f1_buy_val kept (now means f1@θ=0.5, documented as control). Rationale : Step 3 will be re-run (cross-fold + post-fix validation) — without the patch every run loses the f1 signal due to scale_pos_weight compressing probas below 0.5.
  • Unit tests for each F-check : synthetic data triggers expected verdict
  • Unit test for the θ-sweep patch on Step 3 (final_predictions_summary carries the new fields, verdict logic unchanged)
  • ADR-25 compliance : task red on status != "DIVERGENCE_LOCATED", all events logged before raise
  • ADR-31/32/33 : event=s18_step4_* key=value
  • CR full cycle ; committee pr_review (substantive diagnostic code)

5. Effort estimate

Phase Effort
Module impl + unit tests ~4h
DAG + integration ~1h
PR + CR rounds ~1 day
Trigger + verdict + interpretation ~30 min after deploy
Total to verdict ~1.5 days

vs the v1 4-stage matrix (~2 days impl + ambiguous output). The forensic approach is faster AND has clearer decision branches.

6. Risks

Risk Likelihood Mitigation
All F-checks pass (NO_DIVERGENCE) — bug is upstream of the parquet medium Verdict explicitly flags this and escalates to committee with all check metrics
Captured parquet differs from prod cache (Step 1 capture has its own shape) low Step 1 monkey-patches lgb.Dataset but doesn't transform features ; what's captured IS what the harness fed to lgb
Step 1 re-capture cycle is 19 min and the pod still GCs the parquet low Run forensics in the SAME pod (chained DAG), not a separate Step 4 pod
Leak feature exists but is < 0.95 corr (gradual leak) medium Single-tree iter-1 train_auc is the canary — if it's >>0.9 with corr<0.95, escalate to SHAP analysis

7. Next steps

  1. Operator validates this design (this file v2).
  2. Optional committee plan_review if the operator wants ADR-68 cover (the substantive change vs v1 is the scope reduction — committee is fast-path acceptable but not strictly mandatory since the parent §5.4 already covered "data invariants").
  3. Issue + branch post-validation : feat/CVN-N001-EE-S18-step4-data-forensics
  4. Implementation per AC §4.
  5. Trigger → verdict → Step 5.