ADR-0089 — Training harness as plugin registry (Hamilton)¶
Status: active
Date: 2026-05-09
Introduced by: CVN-N001-EE-S16 (GH #890, OP wp#143, PR #891)
Supersedes: none
Related: ADR-25 (no silent fallback), ADR-30/32/38 (structured logs), ADR-61 (Hamilton for batch DAGs), ADR-67 (pluggable feature-selection registry — same plugin pattern), ADR-68 (plan_review committee mandatory)
Context¶
Until 2026-05-09, the project had three independent autonomous trainers
(CVNTrade_XGBoostAutonomousTrainer 1785 lines,
CVNTrade_LightGBMAutonomousTrainer 407 lines,
CVNTrade_CatBoostAutonomousTrainer 376 lines), each with its own
training loop, evaluation pipeline, threshold tuning, calibration, and
logging schema. Bugs were systematically triplicated — the FTF 7-bug
hotfix PR #872 fixed seven distinct issues of which four were
LGB-specific divergence from the XGB reference (Bug #1 LGB Booster
without predict_proba, Bug #6 LGB-only θ-sweep regression, Bug #7
LGB ECE/brier metrics returning NULL, Bug #2 CB nested vs flat metrics
shape).
The trigger that forced the architectural decision was the Track 11
sweep ftf_20260508_210336_1a1160 (factor ensemble_diversity,
2026-05-08) where LGB structurally over-traded — 251 trades per fold
versus 28-68 for the other variants, sortino −5.87 vs +0.93, win_rate
32% vs 53-65%. Root cause: the bespoke LGB θ-sweep added by PR #872
(intended as a hotfix) picked a too-low decision threshold under
scale_pos_weight, causing the model to over-emit BUY signals.
Three failure modes structurally rooted in the trainer divergence:
- Triplicated bugs — every fix had to be applied (or forgotten) in three places. The four LGB-specific bugs of PR #872 are illustrative ; over the project lifetime, dozens of similar divergences accumulated.
- Observability gap — the XGB autonomous trainer emits ~10
distinct
event=log types (event=threshold_method,event=calibrator_aggregated,event=focal_loss_active, …) ; the LGB autonomous trainer emits 1, the CB autonomous trainer emits 0. The Track 11 LGB anomaly was discovered only because the operator queried the PostgreSQLfinetune_resultstable directly — Loki / Grafana could not have surfaced it. - High friction for new models / ensembles — adding a new model
(e.g., LightGBM-GOSS, ExtraTrees, RandomForest) required ~400
lines of copy-pasted orchestration. Adding a new ensemble
(e.g., a stacker variant) required surgical edits to
ablation_runner.py's hardcoded if/elif dispatch.
The operator's three non-negotiable requirements for the resolution were stated explicitly during the 2026-05-09 design session :
- New model = drop a DAG file (no edits to FTF runner, autonomous trainers, or any other call site).
- Stacking = composition (declare which base predictions the ensemble consumes, the framework resolves dependencies).
- Trainer = commodity (single canonical training entry point ;
no caller-side branching on
model_type).
A plan dossier was prepared
(documentation/reviews/2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md)
and submitted to the Expert Committee per ADR-68. Verdict : session
7f13b78d PASSED / OK / strong consensus, 5 experts scoring 7.5-8.5
(mean 8.1), OP Meeting #125.
Decision¶
The training path is unified into a single Hamilton-based training
harness with plugin registries for models and ensembles.
The harness lives at src/training/harness/ and its public surface is :
train_one(model_name, datasets, hpo_params) → TrainedArtifact— canonical entry point for single-model training.train_ensemble(name, datasets, hpo_params) → TrainedArtifact— canonical entry point for ensemble training.MODEL_REGISTRY,ENSEMBLE_REGISTRY— plugin registries populated by convention scan ofdags/models/*.pyanddags/ensembles/*.py.Datasets,HPOParams,SplitMetrics,TrainedArtifact,FeatureVersion— typed contracts shared by every adapter, every Hamilton node, and every downstream consumer.
Internal layout :
| Layer | Responsibility | Files |
|---|---|---|
contracts.py |
Typed payloads (single source of truth) | 1 |
registry.py |
Plugin registries + decorators | 1 |
adapters/{base,xgb,lgb,cb,ensemble}.py |
The 15% per-model code (predict_proba shim, native handle wrap) | 5 |
nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py |
The 85% reused code (Hamilton-pure functions + Optuna HPO) | 5 |
dags/models/{xgboost,lightgbm,catboost}_dag.py |
One file = one model — declares the Hamilton DAG + the per-model HPO search space | 3 (extensible) |
dags/ensembles/{blend_avg,stack_logreg,stack_xgb_meta}_dag.py |
One file = one ensemble | 3 (extensible) |
autonomous_model_trainer.py |
Generic cache-aware wrapper around train_one(model_type, …) — used by the orchestrator for any registered model |
1 |
autonomous_ensemble_trainer.py |
Generic cache-aware wrapper around train_ensemble(ensemble_name, …) — used by the orchestrator for any registered ensemble |
1 |
The autonomous trainers (autonomous_*_trainer.py) intentionally live
inside the harness module (not under src/training/{XGBoost,LightGBM,CatBoost}/)
because they are 100% generic — the per-model file scaffold no longer
exists. Adding a new model = drop a DAG file under dags/models/,
register it via register_model(...). The orchestrator
_create_autonomous_trainer looks up the registry and instantiates
CVNTrade_AutonomousModelTrainer(model_type=<new_name>, …) — zero
new wrapper file, zero orchestrator edit.
The registry mechanism is convention scan over in-tree DAG modules (committee verdict 8.2 — Option A). The future entry_points extension for third-party plugins is a non-breaking addition.
The θ-sweep is opt-in via wrapper (committee verdict 8.3 — Option B). LGB and CB call the canonical sweep ; XGB stays on its legacy walk- forward calibrator path. A follow-up Story will evaluate XGB + canonical θ-sweep under its own FTF.
The harness uses full Hamilton for all training DAGs (committee verdict 8.4 — per ADR-61 "batch DAGs use Hamilton, not imperative code").
The orphan adapter src/training/patterns/adapters/cvntrade_lightgbm_adapter.py
is deleted in Phase 4.1 (committee verdict 8.5 — confirmed dead code).
Phased rollout¶
| Phase | Scope | Status |
|---|---|---|
| 1 | Skeleton + XGB adapter + 1e-9 byte-for-byte parity vs legacy | ✅ shipped (commit 8415c919) |
| 2 | LGB + CB adapters + 3-way parity + logging contract test | ✅ shipped (commit 153c5c6e) |
| 3 | Ensemble registry + blend_avg + stack_logreg DAGs |
✅ shipped (commit 5defb939) |
| 3.1 | stack_xgb_meta (deferred until XGBMetaAggregator.predict_proba ships) |
⏳ blocked on wp#45 |
| 4 | CVN_USE_HARNESS=1 flag routes LGB autonomous via harness ; ADR-89 ; 4th-model pickup test |
✅ shipped (PR #891) |
| 4.1 | Full cutover : XGB + CB autonomous routed via harness, ablation matrix wires CVN_MODEL_TYPE=stack_3model_* for ensemble dispatch, generic CVNTrade_AutonomousModelTrainer + CVNTrade_AutonomousEnsembleTrainer replace per-model autonomous wrappers, ~2400 LOC of legacy autonomous trainers deleted |
✅ shipped (PR #896 + #898) |
Compatibility & migration¶
Phase 4 ships a feature flag (CVN_USE_HARNESS) that, when set to
1 in production, routes the LGB autonomous trainer through the
harness. Default OFF preserves legacy behaviour. The legacy result-dict
shape consumed by regime_trainer.train_weighted_variant is preserved
on both paths — the cutover is transparent to downstream consumers.
The flag will flip default-on in Phase 4.1 once the operator validates
the harness path on a real Track 11 re-run. The legacy
train_with_fixed_params_lgbm is removed only after Phase 4.1 closes.
XGB and CB autonomous trainers stay on their legacy paths until Phase 4.1. Their MLflow registry, walk-forward optimisation, and meta-label training features are NOT yet wrapped in the harness — they will be moved to harness nodes in Phase 4.1 with the same parity-test gate.
What this ADR does NOT prescribe¶
- The harness is a TRAINING-path concern. The inference path
(
src/commun/inference/,src/commun/pipeline/) is unchanged. - The harness CONSUMES the existing
feature_selection_resultcontract produced byCVNTrade_AutonomousFeatureSelector. CUSUM, FE, labelling, and the FE pipeline are out of scope. - Multiclass training paths in the harness raise
NotImplementedErrorwith an explicit pointer (Phase 1.7 follow-up). All current production use is binary (perCVN_BINARY_CLASSIFICATION=1). - The harness does NOT replace the
CVNTrade_AutonomousOrchestratordispatch ; the orchestrator'sif/elif model_type ==will be replaced byMODEL_REGISTRY[model_type]lookup in Phase 4.1.
Consequences¶
Positive :
- Bug recurrence structurally prevented. The single canonical
theta_sweep,eval_metrics,class_balance, andlog_emitnodes make a future LGB-specific bug impossible by construction. - Observability uniform. Every model emits
event=training_started,event=class_balance_applied,event=training_completewith identical field schema. LGB and CB additionally emitevent=theta_picked. A single Loki query covers all 3 models. - New model = one file. The synthetic 4th-model pickup test
(
tests/unit/training_harness/test_phase4_lgb_cutover.py ::TestSyntheticFourthModelPickup) demonstrates this structurally. - Stacking = composition. The
EnsembleAdapterwraps N baseTrainedArtifacts; Hamilton resolves and caches the base model outputs across sibling ensemble DAGs. - Parity tests prove the cutover is safe. XGB and LGB are byte-for-byte 1e-9 matches vs their legacy paths on synthetic data. CB is shape-equivalent (legacy CB had a divergent metrics calculator).
Negative / trade-offs :
- Hamilton has a learning curve. Future contributors must read the
Hamilton documentation to understand how nodes resolve. Mitigated by
the existing precedent in
src/commun/finetune/feature_selection/dags/which uses the same pattern. - The harness has a startup cost (Hamilton driver build per call) of
~50ms. Acceptable for the FTF use case (training takes seconds-to-
minutes per fold). To be benchmarked in Phase 4.1 ; if > 5%
regression vs legacy, cache the driver per
model_name. - Phase 4 ships behind a flag, so the production benefit is delayed
until the operator flips it. Mitigated by the explicit
_via_harnessmarker in the result dict +event=lgb_autonomous_via_harnesslog so the operator can confirm the path activation in Loki.
Tracked follow-ups :
- Phase 3.1 —
stack_xgb_metaensemble DAG (blocked on wp#45'sXGBMetaAggregator.predict_probaimplementation). - Phase 4.1 — full cutover of XGB + CB +
ablation_runner; delete legacytrain_with_fixed_params+train_with_fixed_params_lgbm+ orphancvntrade_lightgbm_adapter.py. - Companion Story #892
(CVN-N015-EA-S10) — triage the 13 bandit medium+ findings waived
in
.bandit-baseline.json(introduced by the bandit hook fix shipped alongside this Story). - Future Story — evaluate XGB + canonical θ-sweep on val under its own FTF (committee verdict 8.3 follow-up).
Documented architectural trade-offs (pr_review committee dd30c7eb)¶
These trade-offs are explicitly accepted as part of the Phase 4 ship and documented here for future maintainers per the committee's "transparency for future maintainers" recommendation.
θ-sweep asymmetry across XGB / LGB / CB¶
The harness intentionally ships two threshold strategies in parallel :
- XGB : threshold = 0.5 (legacy walk-forward calibrator handles
per-fold threshold separately, outside the harness). Reason : XGB
legacy behaviour is preserved byte-for-byte to enable 1e-9 parity
testing in Phase 1, and XGB's walk-forward calibrator (
autonomous_trainer.pyL1297-L1425) implements a different, validated algorithm (f1_binary+ fallbackexpectancy). - LGB and CB : opt INTO the canonical
pick_threshold_on_valalgorithm (val-tuned θ, same byte-for-byte algorithm used by the legacytrain_with_fixed_params_lgbm).
This asymmetry is temporary. A follow-up Story (filed as a "high priority" recommendation by the pr_review committee) will evaluate XGB + canonical θ-sweep on val under its own FTF, and either fold XGB into the canonical path OR document the asymmetry as a permanent design choice with measured rationale.
Ensemble training cost — base model duplication across drivers¶
The current harness builds one Hamilton driver per train_one() /
train_ensemble() call. When multiple ensemble DAGs run in the same
FTF (e.g., stack_3model_avg AND stack_3model_logreg_shrink for the
same crypto + fold), each ensemble's driver independently calls
train_one("xgboost", ...), train_one("lightgbm", ...),
train_one("catboost", ...). Hamilton caches within a driver, NOT
across drivers, so the 3 base models get trained 2× (once per
ensemble variant) on the same data + HP.
This is accepted in Phase 3 for code-simplicity reasons (each
ensemble DAG is self-contained). Phase 4.1 (CVN-N001-EE-S16-cutover,
GH #893, OP wp#145) addresses cross-driver caching by hoisting base
model training into the FTF runner level (one train_one() call per
(crypto, fold, model), reused across all ensemble variants).
Estimated FTF wall-clock cost of the temporary duplication : ~2× the ensemble phase runtime, ~25% of the total FTF runtime (since base model training dominates). Acceptable for the single Track 11 re-run that unblocks the Story closure. Phase 4.1 amortises it.
Multi-class (3-class BUY/SELL/HOLD) handling¶
The harness ships binary-only in PR #891. Every xgb_metrics_val /
lgb_metrics_val / cb_metrics_val Hamilton node raises
NotImplementedError("multiclass path not yet wired (Phase 1.7)") if
called with is_binary_classification() == False. This is intentional
fail-fast per ADR-25 — the current production sweep is binary
(CVN_BINARY_CLASSIFICATION=1 per ADR-63 = "FTF mission mode is binary
BUY/NOT_BUY ; 3-class is legacy"). Re-enabling 3-class training requires
a deliberate Story that revisits the canonical θ-sweep semantics for
multi-class (which has no single threshold). Tracked as a future Story
under CVN-N001-EE-S16 follow-up family.
cvntrade_lightgbm_adapter.py — NOT actually orphan¶
The plan_review committee verdict 8.5 (session 7f13b78d) and the
pr_review committee blocker #1 (session dd30c7eb) both classified
src/training/patterns/adapters/cvntrade_lightgbm_adapter.py as dead
code. This classification is incorrect. Discovered during the
Phase 4 implementation : the file is referenced as a STRING-VALUED
config in src/training/patterns/cvntrade_model_registry.py L134-L135
(hpo_interface_class / model_interface_class), which is dynamically
imported via importlib.import_module() at registry resolution time.
The chain is :
cache_interface.get_hpo_params() → on cache MISS →
CVNTrade_AutonomousHPO →
model_registry.optimize_hyperparameters("lightgbm_direct") →
_load_interface() → importlib.import_module("…cvntrade_lightgbm_adapter")
A grep for static import statements (which both committees relied on)
misses this dynamic resolution. The file is therefore kept in this
PR ; deletion is deferred to Phase 4.1 once the legacy
CVNTrade_AutonomousHPO path is also retired (its only LGB-relevant
caller is the legacy LGB autonomous trainer which Phase 4 already
opt-in routes through the harness).
The pr_review committee will be re-engaged on this clarification ; if they confirm acceptance, this section becomes the canonical record.
References¶
- Plan dossier :
documentation/reviews/2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md - Committee verdict : session
7f13b78d(OP Meeting #125) - Trigger PR : #872 (FTF 7-bug hotfix — demonstrated cost of triplicated trainers)
- Implementation PR : #891
- Sister registry pattern :
src/commun/finetune/feature_selection/registry.py(ADR-67)