Skip to content

Plan dossier — CVN-N001-EE-S18 — Harness shallow-training regression diagnostic

Date : 2026-05-13 Story : CVN-N001-EE-S18 (GH #909, OP wp#154) Plan_review verdict : session 4298520f PASSED / OK / strong consensus, OP Meeting #133 — 0 blockers, 6 forward-looking recos (strict budget cap adherence, trading posture confirmation, bisect audit trail, staging-env validation pre-Step 1, expand parity reproducer to all 3 models + per-iter AUC, cross-validate on additional folds if Step 4 ambiguous) — all accepted, integrated in §5.2 + §9 implementation Type : diagnostic Story (not a fix-in-flight). Reports root cause + recommends remediation Story scope. No PERMANENT production code change in S18 itself ; temporary instrumentation allowed under §8 amendment.

Amendment 2026-05-13 (post-PASS reviewer feedback) : §5 has been rewritten to operationalise the diagnostic as a deterministic divergence isolation process. Key additions : §5.1 7-hypothesis space (H1-H7) ordered by likelihood ; §5.3 6 mandatory parity invariants with stop-at-first-divergence rule ; §5.2 Step 4 replaced (4-stage replay A→D, not commit-bisect) ; §5.4 explicit probability-distribution diagnostics ; §7.1 escalation rule tightened (Step 2, not Step 3) ; §8 temporary instrumentation allowed (must be removed before closure). The plan_review committee verdict 4298520f PASSED OK stands ; this is a refinement of the methodology under the same scope. Author : Dominique (operator) + Claude Predecessor : CVN-N001-EE-S17 (PR #904 merged, OP wp#153) — S17 externalized 455 hyperparams to PG Console (ADR-90), correctly bundled byte-for-byte legacy values, but the post-merge FTF canary on defi_top5 5m ATR0.5_1.5_H4 exposed that the pre-S17 #891 harness migration regression is still active : training stops after 1-2 iterations for LightGBM, 11-26 for XGBoost, masked for LGB+CB by the val-tuned θ-sweep that lowers the threshold to 0.2 to compensate, and exposed crudely on XGBoost which stays at fixed θ=0.5 and produces a near-empty buy classifier (f1_buy mean 0.089).


1. Demande au reviewer

The committee is asked to validate the following decisions :

  1. Root-cause hypothesis ranking — the 4 hypotheses below ordered by my priors. Is the ordering reasonable, or do you see a more likely root cause I missed ?
  2. Diagnostic methodology — the proposed "fold capture + dual-path replay" approach in §5. Is this sufficient to converge on a single root cause without changing prod ?
  3. What is explicitly out-of-scope for S18 (§8) — namely, S18 stops at the diagnostic verdict + remediation Story scope ; the actual fix is a follow-up Story. Is this scope discipline acceptable, or should S18 include a hot-path remediation if the root cause is small ?
  4. Success criteria (§7) — the verdict tree LOCK / KEEP_AVAILABLE / ABANDON adapted to a diagnostic Story. Is the decision tree complete ?
  5. Risk tolerance — during the diagnostic, FTF canaries continue to fail (LGB+CB at f1≈0.36, XGB collapsed). The operator's stated tolerance is "no live trading impact, FTF leaderboard is informational". Is this risk envelope acceptable, or should the harness be helm-rollback'd to pre-#891 during S18 ?
  6. Verdict tree — recommend among PASSED / PASSED-WITH-CHANGES / NEEDS-REVISION / REJECTED with rationale per finding.

2. Contexte projet

CVNTrade is a crypto ML trading system. The training pipeline runs daily-ish FTF sweeps across 5 cryptos × 3 model types (XGBoost / LightGBM / CatBoost) × 3 folds × N Optuna trials. Each trial trains a binary BUY/HOLD classifier on 5-min OHLCV+derived features (~319 features post-FE), with class imbalance ~18% positive.

The "training harness" (src/training/harness/, ADR-89) is a Hamilton-based plugin registry that replaced the legacy train_with_fixed_params* autonomous trainers in PR #891 (CVN-N001-EE-S16) + #896 + #899 (CVN-N001-EE-S16 Phase 4.1 cutover that DELETED the legacy code paths).

S17 (PR #904 merged 2026-05-12) externalized 455 training hyperparameters into PG ftf_config.base_env (Console-managed surface, ADR-90), with byte-for-byte parity tests pinning the bundled values against the pre-#899 git history.


3. Problème observé (chiffres, pas adjectifs)

3.1 FTF run ftf_20260512_184337_fdee27_ATR0.5_1.5_H4 (post-S17 canary)

Property Value
Trigger operator manual Airflow UI trigger post-S17 seed --apply of 452 keys (3 EQUAL pre-existing)
Started 2026-05-12 18:44:09 UTC
Completed 2026-05-13 04:15:32 UTC
Duration 9h 31min (34 282 s)
Factor model_type (3 variants : xgboost / lightgbm / catboost)
Cryptos UNIUSDC, OPUSDC, ARBUSDC, AAVEUSDC, LDOUSDC
Strategy ATR0.5_1.5_H4, timeframe 5m
Folds 3 (fold_id 2, 3, 4)
Results 45 rows × finetune_results, 0 errors
event=hpo_fallback_applied count 0 — Console seeding 100 % coverage ✅ (S17 validated)

3.2 Per-model aggregate metrics (FTF run, mean across 15 (crypto × fold) cells per model)

Model f1_buy mean f1_buy [min, max] precision_mean recall_mean n_trades_mean sortino_mean training_time_mean
catboost 0.389 [0.333, 0.441] 0.295 0.591 154 -1.76 1 322 s
lightgbm 0.368 [0.309, 0.428] 0.261 0.664 169 -2.59 8 119 s
xgboost 0.089 [0.013, 0.211] 0.650 0.051 19 +0.39 1 307 s

3.3 Reference baselines

Run f1_buy (defi_top5 5m ATR0.5_1.5_H4) Notes
Pre-#891 (legacy autonomous trainers) ≈ 0.42 The "what we want back" target
Post-#891 (harness migration, pre-S17 seeding) ≈ 0.22 The 2026-05-11 incident that motivated S17
Post-S17 (this run) LGB 0.37, CB 0.39, XGB 0.09 LGB+CB up from 0.22 ; XGB completely collapsed
Expected post-S17 (S17 plan dossier §4 W1 canary criterion) ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types NOT MET for any model type, catastrophically not met for XGB

3.4 The smoking gun — best_iteration distribution from Loki event=training_complete

Captured from {namespace="cvntrade"} |= "event=training_complete" over the FTF window :

Model Trials captured best_iter p25 best_iter p50 best_iter p75 min max
lightgbm 53 1 1 1 1 2
catboost 147 8 14 55 2 336
xgboost 500 11 17 26 3 155

Translation : LightGBM stops training at iteration 1 or 2 in virtually every trial ; the val metric (AUC, first_metric_only=True) peaks immediately then never improves through 50 early-stopping rounds. XGBoost stops at iteration 11-26 in the median ; the val metric peaks within the first 30 rounds then never improves through 150 early-stopping rounds. CatBoost has bimodal distribution (some trials reach 336 iterations, others stop at 2) — partial pathology.

A model that hits its best val score at iteration 1 has effectively not learned ; it produces a near-random probability distribution. LGB and CB compensate via the val-tuned θ-sweep that lowers the decision threshold to ~0.2 (capturing the noisy tail of the distribution and grabbing enough BUYs to score f1_buy ≈ 0.37). XGB does not run the θ-sweep (per committee verdict 8.3 of 2026-05-09-cvn-n001-ee-s16-pr-review-dossier.md §6.3 — XGB stays at fixed θ=0.5 because the legacy walk-forward calibrator handled its threshold separately, but the calibrator is NOT wired into the harness DAG yet), so its near-random proba never crosses 0.5 and it produces 1-44 trades per fold vs 100-280 for LGB/CB.

3.5 Sample Loki training_complete events (verbatim)

lightgbm best_iteration=1 training_time_sec=1.5 theta_picked=0.2 f1_buy_val=0.3477 f1_buy_at_threshold=0.3477 n_candidates=20
lightgbm best_iteration=1 training_time_sec=1.6 theta_picked=0.2 f1_buy_val=0.3470 f1_buy_at_threshold=0.3470 n_candidates=20
lightgbm best_iteration=2 training_time_sec=1.7 theta_picked=0.2 f1_buy_val=0.3291 f1_buy_at_threshold=0.3291 n_candidates=20

xgboost  best_iteration=24 training_time_sec=2.502 theta_picked=0.5 f1_buy_val=0.0601
xgboost  best_iteration=15 training_time_sec=2.685 theta_picked=0.5 f1_buy_val=0.0056
xgboost  best_iteration=7  training_time_sec=2.1   theta_picked=0.5 f1_buy_val=0.0056

3.6 Class balancing audit (Loki event=class_balance_applied)

xgboost  binary=True scale_pos_weight=4.4659 n_pos=1597 n_neg=7132 imbalance_ratio=4.47

Class balancing is correctly applied (scale_pos_weight=4.47 matches the 18.3 % positive rate). The class imbalance handling is NOT the regression source.


4. Ce qui a déjà été tenté (et pourquoi ça n'a pas marché)

Attempt Date Outcome
Patch defaults in source + helm upgrade 2026-05-11 Failed silently. Live Loki confirmed event=training_started learning_rate=0.016149 even after deploying an image with learning_rate=0.1 hardcoded — Optuna's suggest_* overrides every in-code default. This established the operator's rule "Règle 1 — changement de paramètre ne devrait jamais nécessiter un nouveau déploiement".
S17 — externalize hyperparameters to PG Console 2026-05-12 Did its job correctly. Seeded 455 keys byte-for-byte matching pre-#899 legacy values, zero event=hpo_fallback_applied post-merge, Optuna picks (e.g. LGB learning_rate=0.081 ... 0.099) ARE within the seeded ranges. But this fixed the wrong layer : the regression is in the training itself, not the hyperparameter values.
Helm tag pin to pre-S17 SHA not attempted Would not help — the regression is pre-#891, not S17-introduced. Pre-S17 (post-#891) was already at f1=0.22. The harness binary is the issue.

The pattern : we kept fixing layers above the actual broken layer. S18 goes one layer deeper, into the training itself.


5. Hypothèse et plan proposé

Amendment (2026-05-13, post-PASS) — diagnostic discipline reinforcement. The plan_review committee verdict 4298520f PASSED OK strong consensus (0 blockers, 6 forward-looking recos integrated). Post-PASS reviewer feedback flagged that "byte-for-byte parity reproducer" was under-specified, the bisect step was too vague to guarantee bug-finding (likelihood ~60 %), and the hypothesis space was 4 broad bins rather than 7 actionable ones. This §5 has been rewritten to operationalise the diagnostic as a deterministic divergence isolation process, not a heuristic investigation. Each step enforces strict parity invariants between legacy and harness execution, and the process stops at the first observed divergence to localise the regression with certainty. The amendment adds : (a) §5.1 with a 7-hypothesis space ordered by likelihood, (b) §5.3 with mandatory parity invariants per step, (c) §5.2 Step 4 rewritten as deterministic divergence localisation (4-stage replay A→D, not commit-bisect), (d) §5.2.5 explicit probability-distribution diagnostics (separates "model never crosses θ" from "θ broken"), (e) §7.1 escalation rule tightened (stop at end of Step 2 if no divergence, not Step 3), (f) §8 amended to allow temporary instrumentation (must be removed before closure).

5.1 Seven root-cause hypotheses, ordered by likelihood (amended)

The hypothesis space is enumerated to guide the divergence isolation — each step of §5.2 maps to invariants in §5.3 that selectively confirm or rule out hypotheses below.

# Hypothesis Likelihood Mapped invariants (§5.3)
H1 Early stopping misconfigured : the harness sets early_stopping_rounds / metric / first_metric_only in a way that triggers stop on the first val evaluation (e.g. AUC saturates at iter 1 with scale_pos_weight=4.47). Most likely because LGB best_iter=1 across 53/53 trials is exactly the failure mode of mis-set early stopping on the val metric. highest §5.3.3 hyperparams + §5.3.4 training behavior (best_iter + per-iter eval curve)
H2 eval_set incorrect (train used as val) : the harness valid_sets=[val_set] may be receiving the train Dataset instead, OR the val Dataset is constructed with reference=train_set in a way that shares the bin edges incorrectly. Val AUC then mirrors train AUC and peaks at iter 1 on trivial early splits. high §5.3.1 dataset (val n_samples + label dist) + §5.3.4 per-iter eval curve
H3 Labels misaligned (shift / mapping) : the harness Dataset construction or the temporal split shifted the label vector by 1 row (off-by-one timestamp join), producing val labels that are random noise vs the val features. Model can't generalise → val metric peaks early. high §5.3.1 label distribution + index equality + dataset hash
H4 Sample weights broken (class imbalance handling) : scale_pos_weight=4.47 vs sample_weight=[…] semantics differ between LGB Dataset and the harness path. Wrong weighting → first iter overshoots → val metric never recovers. medium §5.3.3 hyperparams (effective_params + class_balance applied) + §5.3.4 training loss curve first 5 iters
H5 Feature matrix corrupted (NaN / scaling) : the harness FE pipeline output differs from legacy. Either NaN leaking into LGB (which treats NaN as a separate split direction) or a feature scaled differently. Model splits on the NaN ↔ no-NaN partition immediately. medium §5.3.2 feature parity (feature list + matrix hash + NaN ratio)
H6 Objective / metric mismatch : the harness passes metric="auc" but the legacy used metric="binary_logloss" (or vice versa) ; AND/OR the metric direction is inverted (maximise vs minimise) in one path. AUC peaking at iter 1 is consistent with a metric that's already saturated by the class-weight setup. medium §5.3.3 (eval metric name) + §5.3.4 (per-iter metric value direction)
H7 HPO override of critical params : Optuna's suggest_* in _hpo_space returns a param that the legacy never picked (e.g. num_leaves=15 with max_depth=10 produces unbalanced trees that overfit iter 1). The seeded ranges look right (verified S17 audit), but the COMBINATION at trial-zero might be pathological. lowest §5.3.3 (effective_params snapshot at the captured trial) + Step 4 Stage C (harness + no HPO)

Goal of the diagnostic : confirm or rule out each hypothesis using §5.3 invariants. The process stops at the first divergence observed in §5.3, not when all hypotheses are tested. The hypothesis space exists to give the operator a checklist when reading the reproducer output — not to prescribe an exhaustive trial-by-trial search.

5.2 Diagnostic methodology — fold capture + dual-path replay

Step 0 (committee reco #4) — Pre-validate the captured fold in staging. Before Step 1 burns budget on prod-shape data, sanity-check that a local replay of the chosen fold reproduces the FTF-observed metrics within ε=0.005 on f1_buy_val. If the local replay diverges, ESCALATE BEFORE proceeding — the divergence itself is a secondary diagnostic signal (env drift) requiring its own Story before S18 can succeed. Wall budget : 0.25 day.

Step 1 — Capture a reference fold from the FTF run. Pick the worst LGB cell (AAVEUSDC fold=3, f1_buy=0.352, best_iter=1) for high signal/noise. Re-run the harness DAG manually with verbose logging to capture : - lgb_native_params (the dict actually passed to lgb.train) - lgb.Dataset construction args (label distribution, feature_count, presence of categorical_feature) - The val_set construction (same checks) - Per-iteration AUC + binary_logloss for first 30 iterations (lgb.log_evaluation(period=1) instead of period=0)

Step 2 — Reconstruct the pre-#891 legacy training call. The legacy train_with_fixed_params_lgbm was deleted in #899. We replay from git :

git show e75418ca^:src/training/LightGBM/cvntrade_LightGBM_config.py | grep -A 30 "def train_with_fixed_params"
git show e75418ca^:src/training/LightGBM/grid_utils.py | sed -n '26,275p'

Pull the exact lgb.train call signature + Dataset construction + early-stopping args + class weight handling from that pre-#899 snapshot. Compare side-by-side with the harness lgb_booster_and_time at src/training/harness/dags/models/lightgbm_dag.py:103.

Step 3 — Build a minimal byte-for-byte parity reproducer (committee reco #5 — expanded scope). Single Python script in scripts/diagnostic_s18_harness_vs_legacy.py (NOT in the prod path) that : - Loads the SAME train/val/test splits the FTF run used (from cached parquet OR by re-running enrichment + FE) - Trains all 3 models (LGB / XGB / CB) twice : once with the harness pipeline (from training.harness.dags.models import lightgbm_dag etc.), once with the reconstructed pre-#899 legacy code path - Logs best_iteration, theta_picked (if applicable), f1_buy_val, per-iteration AUC + per-iteration binary_logloss for both harness and legacy paths - Emits a side-by-side trajectory plot (harness-eval-trace.log + legacy-eval-trace.log) — the operator wants to SEE the iter-by-iter divergence, not just the final best_iter delta - Asserts on the divergence

If the legacy path produces best_iter > 100 and the harness path produces best_iter = 1, the diff between the two code paths IS the regression. Bisect.

Step 4 — Deterministic divergence localisation (amended — replaces commit bisect). The bisect approach was rejected as too vague : the bug can come from data, features, labels, weights, training config, HPO, or the evaluation metric — bisect across commits would test a moving target. Replace with a 4-stage replay that pins each variable in turn :

Stage Setup Pins / what it isolates
A same captured dataset (Step 1 parquet) + legacy trainer (Step 2 reconstruction) the legacy baseline on this exact fold ; expected : best_iter > 100, f1_buy_val ≥ 0.40. If A fails to match the pre-#891 baseline → the captured fold itself is contaminated → escalate (Step 0 should have caught this, but defence-in-depth here).
B same dataset + harness trainer + same Optuna trial params as A's run the harness training loop in isolation. If A passes but B fails → the divergence is in the training loop (early stopping config, Dataset construction, eval_set, sample weights, metric). Confirm via §5.3.3 + §5.3.4 invariants. Maps to H1-H6.
C same dataset + harness + no HPO (fixed legacy params, no Optuna call) isolates HPO from the training loop. If B fails but C passes → HPO is the culprit (suggest_* returns a pathological combination) → maps to H7. If both B and C fail → training loop is the culprit regardless of HPO → maps to H1-H6.
D same dataset + harness + fixed legacy params + same metric / early_stopping config as legacy full-control replay. If D passes → we've reconstructed the legacy behaviour within the harness, the diff between B/C and D points to the exact misconfiguration. If D still fails → root cause is upstream (Dataset construction itself, not the call signature) → maps to H2/H3/H5.

Goal : isolate whether divergence is due to data / training loop / HPO / evaluation metric. The 4 stages collectively cover the hypothesis space §5.1. Bisect across the #891+#896+#899 commits is now only used as a Stage 4.5 if A-D fail to converge.

Step 5 — Report the verdict. Final dossier documentation/missions/cvn-n001-ee-s18-diagnostic/report.md with : - Root cause identified at line-level granularity - Hypothesis ranking validated or revised - Remediation Story scope (CVN-N001-EE-S19 if needed) : surgical fix vs. broader refactor decision - ADR addendum if the regression revealed a missing invariant in ADR-89

5.3 Diagnostic invariants (hard requirement, amended)

The §5.2 reproducer is operational ONLY if it enforces strict parity assertions between legacy and harness at each of the 6 checkpoints below. The reproducer MUST stop at the first divergence and emit the diff as a structured artifact (under documentation/missions/cvn-n001-ee-s18-diagnostic/divergence-<step>.md). This converts the diagnostic from a heuristic investigation into a deterministic divergence isolation process.

Implementation contract — the reproducer Python script has 6 assertion functions, each runs in order, each calls pytest.fail (or raises) on mismatch with a structured message :

5.3.1 Dataset parity

assert legacy_dataset.n_samples == harness_dataset.n_samples
assert (legacy_dataset.index == harness_dataset.index).all()        # tz-aware Timestamp Index
assert legacy_dataset.label_dist == harness_dataset.label_dist      # {-1, 0, +1} counts equal
assert hash_dataset(legacy_dataset) == hash_dataset(harness_dataset)  # sha256 over (index, labels, feature_values)

Rules out H3 (label misalignment), partially confirms / rules out H2 (val construction). Maps to Step 4 stage A vs B.

5.3.2 Feature parity

assert legacy_features == harness_features                          # ordered list of feature names
assert hash_matrix(X_legacy) == hash_matrix(X_harness)              # sha256 over the 2D feature array
assert X_legacy.isna().sum().sum() == X_harness.isna().sum().sum()  # NaN ratio

Rules out H5 (feature matrix corrupted / NaN leak / scaling drift).

5.3.3 Hyperparameter parity (effective params, post-Optuna)

assert legacy_effective_params == harness_effective_params  # dict equality on the params actually fed to lgb.train / xgb.train / CB.fit
assert legacy_early_stopping_config == harness_early_stopping_config  # {rounds, eval_set_name, metric_name, first_metric_only}
assert legacy_eval_metric == harness_eval_metric  # ordered list ; first metric drives early stopping in LGB

Confirms or rules out H1 (early stopping misconfigured), H4 (sample weights), H6 (eval metric mismatch), H7 (HPO override).

5.3.4 Training behavior parity

assert legacy_best_iteration in nearby_window(harness_best_iteration)  # within ±5 iters tolerance
assert legacy_training_loss_curve[:20]  harness_training_loss_curve[:20]  # element-wise within ε=0.01
assert legacy_eval_metric_curve[:20]  harness_eval_metric_curve[:20]

The per-iter curves are the smoking gun — if legacy iter 5 has val AUC=0.65 and harness iter 5 has val AUC=0.78 then peaks down, the iter-5 divergence point IS the bug location. Maps to H1, H2, H4, H6.

5.3.5 Raw prediction parity (BEFORE threshold)

assert legacy_proba.shape == harness_proba.shape  # (n_val, 2) for binary
assert np.allclose(legacy_proba, harness_proba, atol=1e-4)  # element-wise on the val set

This is the single most critical check — separates "model OK but threshold broken" from "model never crosses θ". If 5.3.4 passes but 5.3.5 fails → the trained Booster makes different predictions on the SAME val set → the divergence is in predict_proba itself (adapter shim, feature_name argument, missing feature padding). If 5.3.5 passes → the model is identical, the divergence is in the threshold path (θ-sweep, fixed θ=0.5 fallback for XGB).

5.3.6 Final prediction parity (AFTER threshold)

assert legacy_y_pred == harness_y_pred  # element-wise binary {0, 1}
assert legacy_f1_buy_val  harness_f1_buy_val  # within ε=0.005

Only meaningful if 5.3.5 passed. Confirms θ-sweep semantics + class assignment.

Stop-at-first-divergence rule : the reproducer runs assertions 5.3.1 → 5.3.6 in order. The FIRST assertion to fail is the divergence locus. The reproducer emits the locus + the diff (e.g., "Step 5.3.4 failed at iter 5 : legacy val_auc=0.65, harness val_auc=0.78, ratio 1.20") and stops. Do NOT try to "see how many invariants fail" — the first one IS the bug, downstream ones are consequences.

5.4 Probability-distribution diagnostics (amended, for the "near-empty buy classifier" symptom)

The XGB f1_buy=0.09 + recall=0.05 + 1-44 trades observation is consistent with two distinct failure modes that the reproducer MUST separate. For each model, emit :

histogram(harness_proba[:, 1], bins=50)      # raw BUY-class probability
mean(harness_proba[:, 1])
std(harness_proba[:, 1])
pct_above_0_5 = (harness_proba[:, 1] > 0.5).mean()
pct_above_theta = (harness_proba[:, 1] > theta_picked).mean()

Interpretation table :

Symptom Mode
All proba < 0.2 Model never learned to discriminate BUY ; θ irrelevant. Upstream training problem (H1-H6).
Mean proba ≈ 0.18 (= class prior) Model output collapsed to the base rate. Confirms training-loop failure.
Bimodal distribution but pct_above_0.5 = 0 Model discriminates but never crosses fixed θ=0.5 → θ broken (XGB case). The model itself is OK.
Mean proba ≈ 0.4 with high std Model partial signal ; θ-sweep at 0.2 catches enough → LGB/CB f1≈0.37 case.

This diagnostic separates "fix the training" (most cases) from "fix the threshold path" (XGB-specific case). Maps directly to S19 scope decision.

5.5 Concrete deliverables

Deliverable Path Owner
Diagnostic script scripts/diagnostic_s18_harness_vs_legacy.py Claude
Reference fold capture artifact documentation/missions/cvn-n001-ee-s18-diagnostic/captured-fold-aaveusdc-3.parquet Claude
Per-iter eval log (harness path) documentation/missions/cvn-n001-ee-s18-diagnostic/harness-eval-trace.log Claude
Per-iter eval log (legacy path) documentation/missions/cvn-n001-ee-s18-diagnostic/legacy-eval-trace.log Claude
Side-by-side diff doc documentation/missions/cvn-n001-ee-s18-diagnostic/code-diff-harness-vs-legacy.md Claude
Bisect log documentation/missions/cvn-n001-ee-s18-diagnostic/bisect.md Claude
Final verdict + remediation scope documentation/missions/cvn-n001-ee-s18-diagnostic/report.md (also PDF via existing tooling) Claude + operator review

No PERMANENT production code change in S18 (amended). The diagnostic script lives in scripts/ not src/. Permanent PRs do NOT touch src/training/ or src/commun/. Temporary instrumentation of src/training/harness/ IS allowed when the parity invariants of §5.3 cannot be measured without it (e.g., printing per-iter eval metric values, capturing lgb.Dataset construction args at call site). Temporary instrumentation MUST be : - introduced on a throwaway branch separate from the diagnostic PR ; - removed BEFORE S18 closure (the closure ritual MUST verify git diff main..HEAD -- src/ is empty) ; - captured in the deliverables under documentation/missions/cvn-n001-ee-s18-diagnostic/temp-instrumentation-diff.md so reviewers can audit what was probed.

The remediation Story (S19, if needed) follows S18 with its own plan_review committee.


6. Ce qu'on a écarté (et pourquoi)

6.1 "Just revert PR #891" (the harness migration)

Reverting #891 + #896 + #899 would restore the f=0.42 baseline immediately. Rejected because : - Loses the ADR-89 plugin registry that makes ensemble work easier - S17 (just merged) externalized hyperparams that the legacy code doesn't read - Forces a re-do of the harness migration with the same risk - We'd lose the FTF leaderboard infrastructure rebuilt around the harness

The right move is to diagnose what went wrong in the harness, fix the specific bug, keep the plugin registry. Revert is the fallback if S18 + S19 fail to converge in 30 days (escalation per mlops_readiness.md §6.bis T+30 yellow flag — applied to S18 too).

6.2 "Just disable θ-sweep on LGB+CB so they show the same pathology as XGB"

Would expose the regression more crudely but doesn't help diagnose. The θ-sweep is doing what it was designed to do (val-tuned threshold). The actual bug is upstream of the θ-sweep.

6.3 "Run S18 + S19 as a single Story"

Rejected — diagnostic Stories carry different risk profile than remediation Stories. The committee sees S18 fresh (no implementation pressure to confirm the hypothesis prematurely). S19 (the fix) gets its own plan_review committee with the diagnostic report as input. This is the canonical workflow (ADR-81).

6.4 "Patch the symptom by widening Optuna's learning_rate range lower bound"

Tempting because the post-#891 pre-S17 incident report (event=training_started learning_rate=0.016149) suggested the model picked a tiny LR. But this is treating the symptom : if LGB picks LR=0.016 OR LR=0.099 and both produce best_iter=1, the LR isn't the problem. The val metric peaking at iter 1 is the problem.

6.5 "Just compare lgb_native_params to legacy" (skip steps 1, 3, 4)

Tempting because it's fast. Rejected because the regression might be in : - The Dataset construction (categorical_feature, feature_name, reference) - The val_set construction (different label distribution if walk-forward index changed) - The class weight (scale_pos_weight vs sample_weight semantics) - The training_started log filter (if a param the harness USES isn't being LOGGED, we wouldn't see the divergence)

A 1h side-by-side parity reproducer is cheaper than guessing.


7. Critères de succès

S18 produces a verdict per the ADR-79 decision tree, adapted for a diagnostic Story :

Verdict Trigger Next action
LOCK (= "root cause identified, fix is small + obvious") Bisect converges on a single commit ; fix is < 30 LOC AND parity reproducer validates the fix Open CVN-N001-EE-S19 with a 1-week implementation plan + plan_review.
KEEP_AVAILABLE (= "root cause identified, fix is non-trivial") Bisect converges but fix needs design (e.g. categorical feature plumbing across 3 model adapters + ensemble DAGs) Open CVN-N001-EE-S19 with a 2-3-sprint implementation plan + design dossier + plan_review.
ABANDON (= "root cause NOT identifiable within budget OR fix riskier than revert") 30 days of S18 work and no convergence ; OR identified fix has > 3x the surface area of #891 revert Open CVN-N001-EE-S19 to revert #891 + #896 + #899, accept the ADR-89 loss, file a successor Epic to redo the harness migration with stricter parity tests at training layer (not just API layer).

S18 itself succeeds if it produces a defensible verdict — not necessarily a fix.

7.1 Budget

Phase Wall budget
Step 0 (staging pre-validation, committee reco #4) 0.25 day
Step 1 (fold capture) 1 day
Step 2 (legacy reconstruction) 0.5 day
Step 3 (parity reproducer — 3 models, per-iter AUC + logloss, committee reco #5) 2 days
Step 4 (bisect, 3 candidate commits) 1 day
Step 5 (final dossier) 0.5 day
Total ~5.25 days (1 sprint slot, single-WIP)

Operator has explicit budget cap (amended) : if Step 3 (parity reproducer + §5.3 invariants) finds NO divergence by end of Step 2 in the §5.3 sequence — i.e., after dataset + feature + hyperparameter parity all pass — STOP and re-evaluate the §5.1 hypothesis space. A clean parity at all 3 levels means the divergence is downstream (training behavior, predict_proba, or threshold) ; if §5.3.4 + §5.3.5 + §5.3.6 ALSO pass, the bug is not where this dossier predicted — escalate immediately to committee with the parity-trace artefacts and re-scope the diagnostic. Don't burn 2 weeks chasing a phantom. Step 4 audit trail (sub-stages A/B/C/D outcomes) MUST be documented as Step 4 progresses, per committee reco #3.

Trading posture check-in (committee reco #2) : at Step 0 + Step 3 + Step 5, operator explicitly confirms in wp#154 comment "no live trading promotion contemplated during S18, FTF leaderboard remains informational". If trading posture changes mid-S18, the diagnostic pauses and S18 yields to a different priority.

Cross-fold validation (committee reco #6) : if Step 4 bisect is ambiguous on the captured AAVEUSDC fold=3 cell, repeat Step 1 + Step 3 on 2 additional cells (e.g. OPUSDC fold=3 + LDOUSDC fold=4) to rule out crypto-regime dependence. Additional budget : ~1 day if invoked.


8. Ce qui est explicitement out-of-scope pour S18

  • Any change to src/training/, src/commun/pipeline/, src/commun/finetune/ — diagnostic only.
  • Any change to ftf_config.base_env — S17 just shipped, the values are correct.
  • Any new FTF DAG trigger on prod data — Step 1 captures ONE fold ; Steps 2-4 use the captured artifact, not live data.
  • The XGBoost-specific calibrator wiring — the legacy walk-forward calibrator for XGB θ-thresholding is a separate known gap (out-of-scope for S18, may be addressed in S19 if it correlates with the root cause).
  • Adding a CB θ-sweep where currently disabled — committee verdict 8.3 stands. Out-of-scope.
  • Any change to the autonomous ensemble trainer or G5 grep gate — those are separate forward-looking recos (Tracks 2 + 5 of the post-S17 plan), unrelated to the shallow-training regression.

9. Risques + mitigations

Risk Likelihood Impact Mitigation
The captured fold doesn't reproduce the pathology locally (different env, library versions) medium high (no diagnostic possible) Step 1 includes a sanity assertion : f1_buy_val of the local replay must match the FTF-reported value within ε=0.005. If it diverges → escalate to committee with the divergence as additional evidence (suggests env drift, separate Story needed).
Bisect points at a commit with too many independent changes (#891 was a big PR) high medium Manual narrowing within the commit (cherry-pick sub-diffs to test). If still ambiguous, the dossier reports "regression in [commit X], suspected line range Y-Z, awaiting operator + reviewer pairing for final narrowing."
Root cause is data-shape-dependent (cross-crypto variance) low medium Step 1 captures ONE crypto+fold ; if Step 3 reproduces on it AND legacy reconstructed code shows >100 best_iter on the same data, we have enough signal. Cross-validate on 2 additional cryptos if Step 4 ambiguous.
Root cause is in a non-harness file (e.g. commun.pipeline.feature_engineering_api) low high (S18 scope inadequate) Step 5 explicitly states scope expansion if needed ; not blocking for the verdict.
Diagnostic takes longer than 5.25 days medium low Budget cap in §7.1 ; escalation rule amended : if §5.3 invariants 5.3.1–5.3.3 all pass without divergence, STOP at end of Step 2 (parity reproducer + early invariants) and re-evaluate hypothesis space §5.1 before continuing. Don't chase a phantom into Step 4.

The FTF leaderboard during the S18 diagnostic window will continue to show LGB+CB at f1≈0.36 and XGB at f1≈0.09. This is the operator's stated risk envelope ("FTF leaderboard is informational, no live trading impact"). If a live trading promotion is contemplated during S18, the operator pauses promotions until S18+S19 closes (escalation noted in mlops_readiness.md §6.bis T+30 if no progress).


10. Sign-off checklist (gate before plan_review submission)

  • §1 question to committee : 6 explicit decisions to validate
  • §2 project context : 1 paragraph, self-contained for a reviewer unfamiliar with cvntrade
  • §3 problem observed : 6 sub-tables / quotes, all numerical, zero adjectives
  • §4 prior attempts : 3 documented with outcomes
  • §5 hypothesis + plan : 4 ranked hypotheses + 6-step methodology (Step 0 staging pre-validation + Steps 1-5) + 7 concrete deliverables
  • §6 explicit anti-suggestions : 5 alternatives ruled out with rationale
  • §7 success criteria : LOCK / KEEP_AVAILABLE / ABANDON adapted for diagnostic Story + 5.25d budget
  • §8 scope discipline : 6 explicit out-of-scope items
  • §9 risks : 5 entries with mitigations
  • §10 committee recos integration (session 4298520f, PASSED OK strong consensus, 0 blockers) :
  • reco #1 strict budget cap → §7.1 wording strengthened (escalate at Step 3 = 2d, not 1.5d post expansion)
  • reco #2 trading posture confirmation → §7.1 explicit check-in points (Step 0, 3, 5)
  • reco #3 bisect audit trail → §7.1 MUST document sub-diffs tested + outcomes
  • reco #4 staging pre-validation → §5.2 NEW Step 0 (0.25 day budget)
  • reco #5 expand reproducer to 3 models + per-iter AUC + logloss → §5.2 Step 3 wording updated
  • reco #6 cross-fold validation if ambiguous → §7.1 explicit invocation criterion (+1d budget if invoked)