PR review dossier — CVN-N001-EE-S16 Training Harness (PR #891)¶
Date : 2026-05-09
Story : CVN-N001-EE-S16 (GH #890, OP wp#143)
PR : #891 (4 commits, ready-for-review)
Author : Dominique (operator) + Claude
Session type : pr_review (mandatory per CLAUDE.md §8 — substantial PR touching src/training/)
Plan dossier : 2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md
Plan_review verdict : session 7f13b78d PASSED / OK / strong consensus, OP Meeting #125
1. What this PR delivers¶
The PR unifies the XGB / LGB / CB autonomous training paths into a single Hamilton-based training harness with plugin registries for models and ensembles. Lives at src/training/harness/. Public surface :
train_one(model_name, datasets, hpo_params) → TrainedArtifact— single-model training entry pointtrain_ensemble(name, datasets, hpo_params) → TrainedArtifact— ensemble training entry pointMODEL_REGISTRY/ENSEMBLE_REGISTRY— convention-scan plugin registries- Typed contracts (
Datasets,HPOParams,SplitMetrics,TrainedArtifact,FeatureVersion)
The 4 phases ship in 4 distinct commits on the same branch :
| Phase | Commit | Scope |
|---|---|---|
| 1 | 8415c919 |
Skeleton + XGB adapter + 1e-9 byte-for-byte parity vs legacy train_with_fixed_params |
| 2 | 153c5c6e |
LGB + CB adapters + 3-way parity + logging contract test |
| 3 | 5defb939 |
Ensemble registry + stack_3model_avg + stack_3model_logreg_shrink DAGs |
| 4 | 30072b22 |
LGB autonomous routed via harness behind CVN_USE_HARNESS=1 flag + ADR-89 + synthetic 4th-model pickup test |
Side-fixes shipped on the same branch (commits 2b5750e1 + 0ea1e72e + within 153c5c6e) :
- Bandit pre-commit hook : pass_filenames: false + .bandit-baseline.json waiver (13 pre-existing medium+ findings tracked under follow-up Story #892)
- Flake8 hook : block-style YAML for --ignore=E501,W503,E203 (commas were splitting in flow style)
- .gitignore : catboost_info/ (CB.fit writes TFEvents + .tsv locally)
The PR is purely additive — zero call-site changes. Phase 4 cutover is behind CVN_USE_HARNESS=1 (default OFF). Phase 4.1 follow-up Story handles the full cutover (XGB + CB + ablation_runner registry-driven + legacy deletion).
2. Trigger — why this Story exists¶
The Track 11 sweep ftf_20260508_210336_1a1160 (factor ensemble_diversity, status running until 2026-05-09) showed LGB structurally over-trading on the live PG finetune_results table :
| variant | fold | n_trades avg | sortino avg | win_rate avg |
|---|---|---|---|---|
cb_only |
3 | 68 | +0.93 | 53% |
lgb_only |
3 | 251 | −5.87 | 32% |
none |
3 | 44 | +0.50 | 57% |
LGB took 5× more trades than every other variant, with a 32% win_rate vs 53-65% for the others. Root cause = the divergent LGB-specific θ-sweep added by PR #872 hotfix that pushed the decision threshold too low under scale_pos_weight. Third LGB-specific bug in 2 weeks rooted in the trainer divergence.
Sweep aborted on 2026-05-09 per committee mandate 8.1 (PG status=aborted, 5 K8s pods deleted). Track 11 verdict will be re-computed on the harness post-cutover (Phase 4.1).
3. Operator's 3 non-negotiable requirements — DEMONSTRATED¶
The operator stated three explicit requirements during the 2026-05-09 design session ; the PR demonstrates all three end-to-end :
- New model = drop a DAG file — proven by
xgboost_dag.py+lightgbm_dag.py+catboost_dag.py. The synthetic 4th-model pickup test (tests/unit/training_harness/test_phase4_lgb_cutover.py::TestSyntheticFourthModelPickup) registers adummy_constantmodel in-test and verifiestrain_one()picks it up with zero edits to FTF runner / autonomous trainers / any other call site. - Stacking = composition, easy — proven by
blend_avg_dag.py+stack_logreg_dag.py. Base models declared as Hamilton inputs, resolved + cached automatically across ensemble DAGs sharing the same bases. - Trainer = commodity — single
train_one(model_name, datasets, hpo_params)+train_ensemble(name, datasets, hpo_params)entry points. No caller-side branching onmodel_type. Phase 4 LGB autonomous now optionally a thin wrapper viaCVN_USE_HARNESS=1.
4. Architecture in one diagram¶
src/training/harness/
├─ contracts.py # Datasets, HPOParams, SplitMetrics, TrainedArtifact, FeatureVersion (typed)
├─ registry.py # MODEL_REGISTRY + ENSEMBLE_REGISTRY (convention scan, committee 8.2)
│
├─ adapters/ # The 15% per-model code
│ ├─ base.py # BaseAdapter
│ ├─ xgb.py # wraps xgb.Booster + predict_proba(n,2) shim
│ ├─ lgb.py # wraps lgb.Booster + predict_proba(n,2) shim (PR #872 Bug #1 canonical fix)
│ ├─ cb.py # wraps CatBoostClassifier
│ └─ ensemble.py # wraps N base TrainedArtifacts + meta aggregator
│
├─ nodes/ # The 85% reused code (Hamilton-pure functions)
│ ├─ class_balance.py # scale_pos_weight | sample_weights (single canonical impl)
│ ├─ eval_metrics.py # f1/auc/brier/ECE/logloss (single source of truth — PR #872 Bug #7 canonical)
│ ├─ theta_sweep.py # val-tuned threshold (single canonical algorithm — PR #872 Bug #6 canonical)
│ └─ log_emit.py # event=training_started | event=training_complete | event=class_balance_applied | event=theta_picked
│
├─ dags/models/ # 1 file = 1 model
│ ├─ xgboost_dag.py # mirrors legacy train_with_fixed_params byte-for-byte
│ ├─ lightgbm_dag.py # opts INTO theta_sweep (committee 8.3 — Option B)
│ └─ catboost_dag.py # opts INTO theta_sweep
│
└─ dags/ensembles/ # 1 file = 1 ensemble strategy
├─ blend_avg_dag.py # FixedAverageAggregator (no fitted state)
└─ stack_logreg_dag.py # sklearn LogReg fitted on val + canonical L2 floor
5. Acceptance gate — current state¶
| Check | Status |
|---|---|
| Harness tests | ✅ 36/36 (5 XGB + 4 LGB byte-parity 1e-9 + 5 CB + 5 logging contract + 11 ensembles + 6 cutover) |
| Existing training tests | ✅ 116/116 (XGB / LGB / CB autonomous + ftf-7bugs-hotfix lock tests) |
| Total | ✅ 162/162 in ~13s |
| Pre-commit hooks | ✅ All passing (bandit + flake8 + mkdocs strict + secrets) |
Mkdocs --strict |
✅ Clean |
| ADR-89 + index | ✅ Written + index updated (max → 0089) |
| Plan_review committee | ✅ session 7f13b78d PASSED / OK / strong consensus |
6. Specific questions for the committee¶
-
Phase 4 feature flag is the right approach ? The flag (
CVN_USE_HARNESS=1) lets the operator validate the harness on a real Track 11 re-run before flipping default-on. Default OFF preserves legacy behaviour. Phase 4.1 flips default + deletes legacy. Trade-off : delays the production benefit until operator validation. Alternative would be a hard cutover in this PR. Is the flag-gated approach acceptable ? -
3-way parity contract — XGB and LGB are byte-for-byte 1e-9 vs legacy. CB is shape-equivalent because the legacy CB used a divergent metrics calculator (
CVNTrade_ClassificationMetrics) that doesn't decompose to the canonical 13-fieldSplitMetrics. The CB harness path uses the canonicalevaluate_split_binarywhich produces sensible values (AUC > 0.7 on synthetic) but isn't a strict match to legacy CB. Is this acceptable, or should CB get a "shadow run on real data" verification before Phase 4.1 cutover ? -
θ-sweep is opt-in (committee verdict 8.3 — Option B) — XGB stays at threshold=0.5 (legacy walk-forward calibrator handles it separately). LGB and CB opt in. This means the harness produces DIFFERENT XGB-vs-LGB threshold strategies, which is a known divergence. Should a follow-up Story evaluate XGB+canonical θ-sweep on val, or is the current asymmetry acceptable forever ?
-
Ensemble DAGs train base models via
train_one()— Hamilton resolves + caches them. But if multiple ensembles run in the same FTF (e.g.,stack_3model_avgANDstack_3model_logreg_shrink), each ensemble DAG independently callstrain_one()and Hamilton's per-driver caching may NOT prevent re-training across drivers. Is this a problem for the FTF re-run, or do we accept the 2× training cost in Phase 3 and address it in Phase 4.1 ? -
stack_3model_xgb_metadeferred — the existingXGBMetaAggregator.predict_probais a WIP stub raisingNotImplementedError(impl tracked under wp#45). The harness can't wire to a stub. We deferred to a separate Phase 3.1 follow-up Story. Is this acceptable, or should the PR include a placeholder ensemble that returns AUC=0.5 (so the registry-driven FTF doesn't break on iteration overENSEMBLE_REGISTRY) ? -
Bandit baseline waiver — the PR fixes the bandit hook (was misconfigured, never ran) and adds a
.bandit-baseline.jsoncapturing 13 pre-existing medium+ findings. Hook now blocks NEW issues. Companion Story #892 tracks the triage. Is the baseline acceptable as a tech-debt waiver pattern (same as.secrets.baseline), or should the 13 findings block this PR ? -
Legacy code NOT deleted yet —
train_with_fixed_params(XGB),train_with_fixed_params_lgbm(LGB),cvntrade_lightgbm_adapter.py(orphan) all stay in tree. They're parallel to the harness path. Phase 4.1 follow-up Story deletes them after the cutover. Is the parallel-coexistence acceptable, or should this PR delete the orphan adapter at minimum (it's confirmed dead code per committee verdict 8.5) ? -
Verdict tree — recommend among PASSED / PASSED-WITH-CHANGES / NEEDS-REVISION / REJECTED with rationale per finding.
7. Cross-references¶
- Plan dossier :
2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md - Plan_review verdict : session
7f13b78dPASSED + 5 enhancements (integrated in §8.bis of plan dossier) - ADR-89 :
documentation/adr/0089-training-harness-as-plugin-registry.md - Trigger PR : #872 (FTF 7-bug hotfix — demonstrated cost of triplicated trainers)
- Related ADRs : ADR-25 (no silent fallback), ADR-30/32/38 (structured logs), ADR-61 (Hamilton for batch DAGs), ADR-67 (plugin registry pattern), ADR-68 (committee mandatory)
- Sister registry :
src/commun/finetune/feature_selection/registry.py(ADR-67 — same pattern) - Reused :
commun.trading.ensemble_aggregator(existing v2 aggregators —FixedAverageAggregator,LogRegShrinkAggregator) - Companion Story : #892 (bandit baseline triage)
- Follow-up Stories : Phase 4.1 cutover + Phase 3.1 xgb_meta ensemble (filed alongside this PR review submission)