ADR-0089 — Training harness as plugin registry (Hamilton)¶

Status: active Date: 2026-05-09 Introduced by: CVN-N001-EE-S16 (GH #890, OP wp#143, PR #891) Supersedes: none Related: ADR-25 (no silent fallback), ADR-30/32/38 (structured logs), ADR-61 (Hamilton for batch DAGs), ADR-67 (pluggable feature-selection registry — same plugin pattern), ADR-68 (plan_review committee mandatory)

Context¶

Until 2026-05-09, the project had three independent autonomous trainers (CVNTrade_XGBoostAutonomousTrainer 1785 lines, CVNTrade_LightGBMAutonomousTrainer 407 lines, CVNTrade_CatBoostAutonomousTrainer 376 lines), each with its own training loop, evaluation pipeline, threshold tuning, calibration, and logging schema. Bugs were systematically triplicated — the FTF 7-bug hotfix PR #872 fixed seven distinct issues of which four were LGB-specific divergence from the XGB reference (Bug #1 LGB Booster without predict_proba, Bug #6 LGB-only θ-sweep regression, Bug #7 LGB ECE/brier metrics returning NULL, Bug #2 CB nested vs flat metrics shape).

The trigger that forced the architectural decision was the Track 11 sweep ftf_20260508_210336_1a1160 (factor ensemble_diversity, 2026-05-08) where LGB structurally over-traded — 251 trades per fold versus 28-68 for the other variants, sortino −5.87 vs +0.93, win_rate 32% vs 53-65%. Root cause: the bespoke LGB θ-sweep added by PR #872 (intended as a hotfix) picked a too-low decision threshold under scale_pos_weight, causing the model to over-emit BUY signals.

Three failure modes structurally rooted in the trainer divergence:

Triplicated bugs — every fix had to be applied (or forgotten) in three places. The four LGB-specific bugs of PR #872 are illustrative ; over the project lifetime, dozens of similar divergences accumulated.
Observability gap — the XGB autonomous trainer emits ~10 distinct event= log types (event=threshold_method, event=calibrator_aggregated, event=focal_loss_active, …) ; the LGB autonomous trainer emits 1, the CB autonomous trainer emits 0. The Track 11 LGB anomaly was discovered only because the operator queried the PostgreSQL finetune_results table directly — Loki / Grafana could not have surfaced it.
High friction for new models / ensembles — adding a new model (e.g., LightGBM-GOSS, ExtraTrees, RandomForest) required ~400 lines of copy-pasted orchestration. Adding a new ensemble (e.g., a stacker variant) required surgical edits to ablation_runner.py's hardcoded if/elif dispatch.

The operator's three non-negotiable requirements for the resolution were stated explicitly during the 2026-05-09 design session :

New model = drop a DAG file (no edits to FTF runner, autonomous trainers, or any other call site).
Stacking = composition (declare which base predictions the ensemble consumes, the framework resolves dependencies).
Trainer = commodity (single canonical training entry point ; no caller-side branching on model_type).

A plan dossier was prepared (documentation/reviews/2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md) and submitted to the Expert Committee per ADR-68. Verdict : session 7f13b78d PASSED / OK / strong consensus, 5 experts scoring 7.5-8.5 (mean 8.1), OP Meeting #125.

Decision¶

The training path is unified into a single Hamilton-based training harness with plugin registries for models and ensembles. The harness lives at src/training/harness/ and its public surface is :

train_one(model_name, datasets, hpo_params) → TrainedArtifact — canonical entry point for single-model training.
train_ensemble(name, datasets, hpo_params) → TrainedArtifact — canonical entry point for ensemble training.
MODEL_REGISTRY, ENSEMBLE_REGISTRY — plugin registries populated by convention scan of dags/models/*.py and dags/ensembles/*.py.
Datasets, HPOParams, SplitMetrics, TrainedArtifact, FeatureVersion — typed contracts shared by every adapter, every Hamilton node, and every downstream consumer.

Internal layout :

Layer	Responsibility	Files
`contracts.py`	Typed payloads (single source of truth)	1
`registry.py`	Plugin registries + decorators	1
`adapters/{base,xgb,lgb,cb,ensemble}.py`	The 15% per-model code (predict_proba shim, native handle wrap)	5
`nodes/{class_balance,eval_metrics,theta_sweep,log_emit,hpo_optuna}.py`	The 85% reused code (Hamilton-pure functions + Optuna HPO)	5
`dags/models/{xgboost,lightgbm,catboost}_dag.py`	One file = one model — declares the Hamilton DAG + the per-model HPO search space	3 (extensible)
`dags/ensembles/{blend_avg,stack_logreg,stack_xgb_meta}_dag.py`	One file = one ensemble	3 (extensible)
`autonomous_model_trainer.py`	Generic cache-aware wrapper around `train_one(model_type, …)` — used by the orchestrator for any registered model	1
`autonomous_ensemble_trainer.py`	Generic cache-aware wrapper around `train_ensemble(ensemble_name, …)` — used by the orchestrator for any registered ensemble	1

The autonomous trainers (autonomous_*_trainer.py) intentionally live inside the harness module (not under src/training/{XGBoost,LightGBM,CatBoost}/) because they are 100% generic — the per-model file scaffold no longer exists. Adding a new model = drop a DAG file under dags/models/, register it via register_model(...). The orchestrator _create_autonomous_trainer looks up the registry and instantiates CVNTrade_AutonomousModelTrainer(model_type=<new_name>, …) — zero new wrapper file, zero orchestrator edit.

The registry mechanism is convention scan over in-tree DAG modules (committee verdict 8.2 — Option A). The future entry_points extension for third-party plugins is a non-breaking addition.

The θ-sweep is opt-in via wrapper (committee verdict 8.3 — Option B). LGB and CB call the canonical sweep ; XGB stays on its legacy walk- forward calibrator path. A follow-up Story will evaluate XGB + canonical θ-sweep under its own FTF.

The harness uses full Hamilton for all training DAGs (committee verdict 8.4 — per ADR-61 "batch DAGs use Hamilton, not imperative code").

The orphan adapter src/training/patterns/adapters/cvntrade_lightgbm_adapter.py is deleted in Phase 4.1 (committee verdict 8.5 — confirmed dead code).

Phased rollout¶

Phase	Scope	Status
1	Skeleton + XGB adapter + 1e-9 byte-for-byte parity vs legacy	✅ shipped (commit `8415c919`)
2	LGB + CB adapters + 3-way parity + logging contract test	✅ shipped (commit `153c5c6e`)
3	Ensemble registry + `blend_avg` + `stack_logreg` DAGs	✅ shipped (commit `5defb939`)
3.1	`stack_xgb_meta` (deferred until `XGBMetaAggregator.predict_proba` ships)	⏳ blocked on wp#45
4	`CVN_USE_HARNESS=1` flag routes LGB autonomous via harness ; ADR-89 ; 4th-model pickup test	✅ shipped (PR #891)
4.1	Full cutover : XGB + CB autonomous routed via harness, ablation matrix wires `CVN_MODEL_TYPE=stack_3model_*` for ensemble dispatch, generic `CVNTrade_AutonomousModelTrainer` + `CVNTrade_AutonomousEnsembleTrainer` replace per-model autonomous wrappers, ~2400 LOC of legacy autonomous trainers deleted	✅ shipped (PR #896 + #898)

Compatibility & migration¶

Phase 4 ships a feature flag (CVN_USE_HARNESS) that, when set to 1 in production, routes the LGB autonomous trainer through the harness. Default OFF preserves legacy behaviour. The legacy result-dict shape consumed by regime_trainer.train_weighted_variant is preserved on both paths — the cutover is transparent to downstream consumers.

The flag will flip default-on in Phase 4.1 once the operator validates the harness path on a real Track 11 re-run. The legacy train_with_fixed_params_lgbm is removed only after Phase 4.1 closes.

XGB and CB autonomous trainers stay on their legacy paths until Phase 4.1. Their MLflow registry, walk-forward optimisation, and meta-label training features are NOT yet wrapped in the harness — they will be moved to harness nodes in Phase 4.1 with the same parity-test gate.

What this ADR does NOT prescribe¶

The harness is a TRAINING-path concern. The inference path (src/commun/inference/, src/commun/pipeline/) is unchanged.
The harness CONSUMES the existing feature_selection_result contract produced by CVNTrade_AutonomousFeatureSelector. CUSUM, FE, labelling, and the FE pipeline are out of scope.
Multiclass training paths in the harness raise NotImplementedError with an explicit pointer (Phase 1.7 follow-up). All current production use is binary (per CVN_BINARY_CLASSIFICATION=1).
The harness does NOT replace the CVNTrade_AutonomousOrchestrator dispatch ; the orchestrator's if/elif model_type == will be replaced by MODEL_REGISTRY[model_type] lookup in Phase 4.1.

Consequences¶

Positive :

Bug recurrence structurally prevented. The single canonical theta_sweep, eval_metrics, class_balance, and log_emit nodes make a future LGB-specific bug impossible by construction.
Observability uniform. Every model emits event=training_started, event=class_balance_applied, event=training_complete with identical field schema. LGB and CB additionally emit event=theta_picked. A single Loki query covers all 3 models.
New model = one file. The synthetic 4th-model pickup test (tests/unit/training_harness/test_phase4_lgb_cutover.py ::TestSyntheticFourthModelPickup) demonstrates this structurally.
Stacking = composition. The EnsembleAdapter wraps N base TrainedArtifacts ; Hamilton resolves and caches the base model outputs across sibling ensemble DAGs.
Parity tests prove the cutover is safe. XGB and LGB are byte-for-byte 1e-9 matches vs their legacy paths on synthetic data. CB is shape-equivalent (legacy CB had a divergent metrics calculator).

Negative / trade-offs :

Hamilton has a learning curve. Future contributors must read the Hamilton documentation to understand how nodes resolve. Mitigated by the existing precedent in src/commun/finetune/feature_selection/dags/ which uses the same pattern.
The harness has a startup cost (Hamilton driver build per call) of ~50ms. Acceptable for the FTF use case (training takes seconds-to- minutes per fold). To be benchmarked in Phase 4.1 ; if > 5% regression vs legacy, cache the driver per model_name.
Phase 4 ships behind a flag, so the production benefit is delayed until the operator flips it. Mitigated by the explicit _via_harness marker in the result dict + event=lgb_autonomous_via_harness log so the operator can confirm the path activation in Loki.

Tracked follow-ups :

Phase 3.1 — stack_xgb_meta ensemble DAG (blocked on wp#45's XGBMetaAggregator.predict_proba implementation).
Phase 4.1 — full cutover of XGB + CB + ablation_runner ; delete legacy train_with_fixed_params + train_with_fixed_params_lgbm + orphan cvntrade_lightgbm_adapter.py.
Companion Story #892 (CVN-N015-EA-S10) — triage the 13 bandit medium+ findings waived in .bandit-baseline.json (introduced by the bandit hook fix shipped alongside this Story).
Future Story — evaluate XGB + canonical θ-sweep on val under its own FTF (committee verdict 8.3 follow-up).

Documented architectural trade-offs (pr_review committee `dd30c7eb`)¶

These trade-offs are explicitly accepted as part of the Phase 4 ship and documented here for future maintainers per the committee's "transparency for future maintainers" recommendation.

θ-sweep asymmetry across XGB / LGB / CB¶

The harness intentionally ships two threshold strategies in parallel :

XGB : threshold = 0.5 (legacy walk-forward calibrator handles per-fold threshold separately, outside the harness). Reason : XGB legacy behaviour is preserved byte-for-byte to enable 1e-9 parity testing in Phase 1, and XGB's walk-forward calibrator (autonomous_trainer.py L1297-L1425) implements a different, validated algorithm (f1_binary + fallback expectancy).
LGB and CB : opt INTO the canonical pick_threshold_on_val algorithm (val-tuned θ, same byte-for-byte algorithm used by the legacy train_with_fixed_params_lgbm).

This asymmetry is temporary. A follow-up Story (filed as a "high priority" recommendation by the pr_review committee) will evaluate XGB + canonical θ-sweep on val under its own FTF, and either fold XGB into the canonical path OR document the asymmetry as a permanent design choice with measured rationale.

Ensemble training cost — base model duplication across drivers¶

The current harness builds one Hamilton driver per train_one() / train_ensemble() call. When multiple ensemble DAGs run in the same FTF (e.g., stack_3model_avg AND stack_3model_logreg_shrink for the same crypto + fold), each ensemble's driver independently calls train_one("xgboost", ...), train_one("lightgbm", ...), train_one("catboost", ...). Hamilton caches within a driver, NOT across drivers, so the 3 base models get trained 2× (once per ensemble variant) on the same data + HP.

This is accepted in Phase 3 for code-simplicity reasons (each ensemble DAG is self-contained). Phase 4.1 (CVN-N001-EE-S16-cutover, GH #893, OP wp#145) addresses cross-driver caching by hoisting base model training into the FTF runner level (one train_one() call per (crypto, fold, model), reused across all ensemble variants).

Estimated FTF wall-clock cost of the temporary duplication : ~2× the ensemble phase runtime, ~25% of the total FTF runtime (since base model training dominates). Acceptable for the single Track 11 re-run that unblocks the Story closure. Phase 4.1 amortises it.

Multi-class (3-class BUY/SELL/HOLD) handling¶

The harness ships binary-only in PR #891. Every xgb_metrics_val / lgb_metrics_val / cb_metrics_val Hamilton node raises NotImplementedError("multiclass path not yet wired (Phase 1.7)") if called with is_binary_classification() == False. This is intentional fail-fast per ADR-25 — the current production sweep is binary (CVN_BINARY_CLASSIFICATION=1 per ADR-63 = "FTF mission mode is binary BUY/NOT_BUY ; 3-class is legacy"). Re-enabling 3-class training requires a deliberate Story that revisits the canonical θ-sweep semantics for multi-class (which has no single threshold). Tracked as a future Story under CVN-N001-EE-S16 follow-up family.

`cvntrade_lightgbm_adapter.py` — NOT actually orphan¶

The plan_review committee verdict 8.5 (session 7f13b78d) and the pr_review committee blocker #1 (session dd30c7eb) both classified src/training/patterns/adapters/cvntrade_lightgbm_adapter.py as dead code. This classification is incorrect. Discovered during the Phase 4 implementation : the file is referenced as a STRING-VALUED config in src/training/patterns/cvntrade_model_registry.py L134-L135 (hpo_interface_class / model_interface_class), which is dynamically imported via importlib.import_module() at registry resolution time. The chain is :

cache_interface.get_hpo_params() → on cache MISS →
  CVNTrade_AutonomousHPO →
    model_registry.optimize_hyperparameters("lightgbm_direct") →
      _load_interface() → importlib.import_module("…cvntrade_lightgbm_adapter")

A grep for static import statements (which both committees relied on) misses this dynamic resolution. The file is therefore kept in this PR ; deletion is deferred to Phase 4.1 once the legacy CVNTrade_AutonomousHPO path is also retired (its only LGB-relevant caller is the legacy LGB autonomous trainer which Phase 4 already opt-in routes through the harness).

The pr_review committee will be re-engaged on this clarification ; if they confirm acceptance, this section becomes the canonical record.

References¶

Plan dossier : documentation/reviews/2026-05-09-cvn-n001-ee-s16-training-harness-unification-plan.md
Committee verdict : session 7f13b78d (OP Meeting #125)
Trigger PR : #872 (FTF 7-bug hotfix — demonstrated cost of triplicated trainers)
Implementation PR : #891
Sister registry pattern : src/commun/finetune/feature_selection/registry.py (ADR-67)