Skip to content

PR review dossier — CVN-N001-EE-S16 Training Harness (PR #891)

Date : 2026-05-09 Story : CVN-N001-EE-S16 (GH #890, OP wp#143) PR : #891 (4 commits, ready-for-review) Author : Dominique (operator) + Claude Session type : pr_review (mandatory per CLAUDE.md §8 — substantial PR touching src/training/) Plan dossier : 2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md Plan_review verdict : session 7f13b78d PASSED / OK / strong consensus, OP Meeting #125


1. What this PR delivers

The PR unifies the XGB / LGB / CB autonomous training paths into a single Hamilton-based training harness with plugin registries for models and ensembles. Lives at src/training/harness/. Public surface :

  • train_one(model_name, datasets, hpo_params) → TrainedArtifact — single-model training entry point
  • train_ensemble(name, datasets, hpo_params) → TrainedArtifact — ensemble training entry point
  • MODEL_REGISTRY / ENSEMBLE_REGISTRY — convention-scan plugin registries
  • Typed contracts (Datasets, HPOParams, SplitMetrics, TrainedArtifact, FeatureVersion)

The 4 phases ship in 4 distinct commits on the same branch :

Phase Commit Scope
1 8415c919 Skeleton + XGB adapter + 1e-9 byte-for-byte parity vs legacy train_with_fixed_params
2 153c5c6e LGB + CB adapters + 3-way parity + logging contract test
3 5defb939 Ensemble registry + stack_3model_avg + stack_3model_logreg_shrink DAGs
4 30072b22 LGB autonomous routed via harness behind CVN_USE_HARNESS=1 flag + ADR-89 + synthetic 4th-model pickup test

Side-fixes shipped on the same branch (commits 2b5750e1 + 0ea1e72e + within 153c5c6e) : - Bandit pre-commit hook : pass_filenames: false + .bandit-baseline.json waiver (13 pre-existing medium+ findings tracked under follow-up Story #892) - Flake8 hook : block-style YAML for --ignore=E501,W503,E203 (commas were splitting in flow style) - .gitignore : catboost_info/ (CB.fit writes TFEvents + .tsv locally)

The PR is purely additive — zero call-site changes. Phase 4 cutover is behind CVN_USE_HARNESS=1 (default OFF). Phase 4.1 follow-up Story handles the full cutover (XGB + CB + ablation_runner registry-driven + legacy deletion).

2. Trigger — why this Story exists

The Track 11 sweep ftf_20260508_210336_1a1160 (factor ensemble_diversity, status running until 2026-05-09) showed LGB structurally over-trading on the live PG finetune_results table :

variant fold n_trades avg sortino avg win_rate avg
cb_only 3 68 +0.93 53%
lgb_only 3 251 −5.87 32%
none 3 44 +0.50 57%

LGB took 5× more trades than every other variant, with a 32% win_rate vs 53-65% for the others. Root cause = the divergent LGB-specific θ-sweep added by PR #872 hotfix that pushed the decision threshold too low under scale_pos_weight. Third LGB-specific bug in 2 weeks rooted in the trainer divergence.

Sweep aborted on 2026-05-09 per committee mandate 8.1 (PG status=aborted, 5 K8s pods deleted). Track 11 verdict will be re-computed on the harness post-cutover (Phase 4.1).

3. Operator's 3 non-negotiable requirements — DEMONSTRATED

The operator stated three explicit requirements during the 2026-05-09 design session ; the PR demonstrates all three end-to-end :

  1. New model = drop a DAG file — proven by xgboost_dag.py + lightgbm_dag.py + catboost_dag.py. The synthetic 4th-model pickup test (tests/unit/training_harness/test_phase4_lgb_cutover.py::TestSyntheticFourthModelPickup) registers a dummy_constant model in-test and verifies train_one() picks it up with zero edits to FTF runner / autonomous trainers / any other call site.
  2. Stacking = composition, easy — proven by blend_avg_dag.py + stack_logreg_dag.py. Base models declared as Hamilton inputs, resolved + cached automatically across ensemble DAGs sharing the same bases.
  3. Trainer = commodity — single train_one(model_name, datasets, hpo_params) + train_ensemble(name, datasets, hpo_params) entry points. No caller-side branching on model_type. Phase 4 LGB autonomous now optionally a thin wrapper via CVN_USE_HARNESS=1.

4. Architecture in one diagram

src/training/harness/
├─ contracts.py              # Datasets, HPOParams, SplitMetrics, TrainedArtifact, FeatureVersion (typed)
├─ registry.py               # MODEL_REGISTRY + ENSEMBLE_REGISTRY (convention scan, committee 8.2)
├─ adapters/                 # The 15% per-model code
│  ├─ base.py                # BaseAdapter
│  ├─ xgb.py                 # wraps xgb.Booster + predict_proba(n,2) shim
│  ├─ lgb.py                 # wraps lgb.Booster + predict_proba(n,2) shim (PR #872 Bug #1 canonical fix)
│  ├─ cb.py                  # wraps CatBoostClassifier
│  └─ ensemble.py            # wraps N base TrainedArtifacts + meta aggregator
├─ nodes/                    # The 85% reused code (Hamilton-pure functions)
│  ├─ class_balance.py       # scale_pos_weight | sample_weights (single canonical impl)
│  ├─ eval_metrics.py        # f1/auc/brier/ECE/logloss (single source of truth — PR #872 Bug #7 canonical)
│  ├─ theta_sweep.py         # val-tuned threshold (single canonical algorithm — PR #872 Bug #6 canonical)
│  └─ log_emit.py            # event=training_started | event=training_complete | event=class_balance_applied | event=theta_picked
├─ dags/models/              # 1 file = 1 model
│  ├─ xgboost_dag.py         # mirrors legacy train_with_fixed_params byte-for-byte
│  ├─ lightgbm_dag.py        # opts INTO theta_sweep (committee 8.3 — Option B)
│  └─ catboost_dag.py        # opts INTO theta_sweep
└─ dags/ensembles/           # 1 file = 1 ensemble strategy
   ├─ blend_avg_dag.py       # FixedAverageAggregator (no fitted state)
   └─ stack_logreg_dag.py    # sklearn LogReg fitted on val + canonical L2 floor

5. Acceptance gate — current state

Check Status
Harness tests ✅ 36/36 (5 XGB + 4 LGB byte-parity 1e-9 + 5 CB + 5 logging contract + 11 ensembles + 6 cutover)
Existing training tests ✅ 116/116 (XGB / LGB / CB autonomous + ftf-7bugs-hotfix lock tests)
Total 162/162 in ~13s
Pre-commit hooks ✅ All passing (bandit + flake8 + mkdocs strict + secrets)
Mkdocs --strict ✅ Clean
ADR-89 + index ✅ Written + index updated (max → 0089)
Plan_review committee ✅ session 7f13b78d PASSED / OK / strong consensus

6. Specific questions for the committee

  1. Phase 4 feature flag is the right approach ? The flag (CVN_USE_HARNESS=1) lets the operator validate the harness on a real Track 11 re-run before flipping default-on. Default OFF preserves legacy behaviour. Phase 4.1 flips default + deletes legacy. Trade-off : delays the production benefit until operator validation. Alternative would be a hard cutover in this PR. Is the flag-gated approach acceptable ?

  2. 3-way parity contract — XGB and LGB are byte-for-byte 1e-9 vs legacy. CB is shape-equivalent because the legacy CB used a divergent metrics calculator (CVNTrade_ClassificationMetrics) that doesn't decompose to the canonical 13-field SplitMetrics. The CB harness path uses the canonical evaluate_split_binary which produces sensible values (AUC > 0.7 on synthetic) but isn't a strict match to legacy CB. Is this acceptable, or should CB get a "shadow run on real data" verification before Phase 4.1 cutover ?

  3. θ-sweep is opt-in (committee verdict 8.3 — Option B) — XGB stays at threshold=0.5 (legacy walk-forward calibrator handles it separately). LGB and CB opt in. This means the harness produces DIFFERENT XGB-vs-LGB threshold strategies, which is a known divergence. Should a follow-up Story evaluate XGB+canonical θ-sweep on val, or is the current asymmetry acceptable forever ?

  4. Ensemble DAGs train base models via train_one() — Hamilton resolves + caches them. But if multiple ensembles run in the same FTF (e.g., stack_3model_avg AND stack_3model_logreg_shrink), each ensemble DAG independently calls train_one() and Hamilton's per-driver caching may NOT prevent re-training across drivers. Is this a problem for the FTF re-run, or do we accept the 2× training cost in Phase 3 and address it in Phase 4.1 ?

  5. stack_3model_xgb_meta deferred — the existing XGBMetaAggregator.predict_proba is a WIP stub raising NotImplementedError (impl tracked under wp#45). The harness can't wire to a stub. We deferred to a separate Phase 3.1 follow-up Story. Is this acceptable, or should the PR include a placeholder ensemble that returns AUC=0.5 (so the registry-driven FTF doesn't break on iteration over ENSEMBLE_REGISTRY) ?

  6. Bandit baseline waiver — the PR fixes the bandit hook (was misconfigured, never ran) and adds a .bandit-baseline.json capturing 13 pre-existing medium+ findings. Hook now blocks NEW issues. Companion Story #892 tracks the triage. Is the baseline acceptable as a tech-debt waiver pattern (same as .secrets.baseline), or should the 13 findings block this PR ?

  7. Legacy code NOT deleted yettrain_with_fixed_params (XGB), train_with_fixed_params_lgbm (LGB), cvntrade_lightgbm_adapter.py (orphan) all stay in tree. They're parallel to the harness path. Phase 4.1 follow-up Story deletes them after the cutover. Is the parallel-coexistence acceptable, or should this PR delete the orphan adapter at minimum (it's confirmed dead code per committee verdict 8.5) ?

  8. Verdict tree — recommend among PASSED / PASSED-WITH-CHANGES / NEEDS-REVISION / REJECTED with rationale per finding.

7. Cross-references

  • Plan dossier : 2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md
  • Plan_review verdict : session 7f13b78d PASSED + 5 enhancements (integrated in §8.bis of plan dossier)
  • ADR-89 : documentation/adr/0089-training-harness-as-plugin-registry.md
  • Trigger PR : #872 (FTF 7-bug hotfix — demonstrated cost of triplicated trainers)
  • Related ADRs : ADR-25 (no silent fallback), ADR-30/32/38 (structured logs), ADR-61 (Hamilton for batch DAGs), ADR-67 (plugin registry pattern), ADR-68 (committee mandatory)
  • Sister registry : src/commun/finetune/feature_selection/registry.py (ADR-67 — same pattern)
  • Reused : commun.trading.ensemble_aggregator (existing v2 aggregators — FixedAverageAggregator, LogRegShrinkAggregator)
  • Companion Story : #892 (bandit baseline triage)
  • Follow-up Stories : Phase 4.1 cutover + Phase 3.1 xgb_meta ensemble (filed alongside this PR review submission)