Plan dossier — CVN-N001-EE-S16 : Training Harness unifié Hamilton + plugin registry¶

Date : 2026-05-09 Story : CVN-N001-EE-S16 (GH #890 · OP wp#143) Author : Dominique (operator) + Claude Session type : plan_review (mandatory before implementation per ADR-68 + CLAUDE.md §3) Status : ✅ committee verdict obtained — PASSED / OK / strong consensus (session 7f13b78d, OP Meeting #125, 5 experts scores 7.5-8.5, mean 8.1) Sizing estimate : 5-7 days Branch : feat/CVN-N001-EE-S16-training-harness-unification (active)

1. The discovery — what triggered this Story¶

1.1 Symptom : LGB completely broken in Track 11 sweep¶

The Track 11 sweep (ensemble_diversity factor, run ftf_20260508_210336_1a1160, currently running) shows structurally invalid LGB results :

variant	fold	n_trades avg	sortino avg	min/max	win_rate avg
`cb_only`	3	68	+0.93	-1.96 / +2.27	53%
`cb_only`	4	67	+0.76	-0.38 / +1.61	48%
`lgb_only`	3	251	−5.87	-6.56 / -5.25	32%
`lgb_only`	4	142	−1.64	-3.28 / +0.77	34%
`none`	3	44	+0.50	-0.58 / +1.08	57%
`stack_3model_avg`	3	28	+0.92	-0.70 / +3.14	60%
`stack_3model_logreg_shrink`	3	26	+0.91	-0.21 / +1.82	64%

Smoking gun : LGB takes 5× more trades than every other variant and has win_rate ~32% (vs 55-65%). Sortino is not bounded — it crashes to −6 because LGB over-trades systematically.

Probable root cause : the val-tuned theta_sweep in src/training/LightGBM/cvntrade_LightGBM_grid_utils.py:160-175 (added by PR #872 hotfix) picks a too-low threshold under scale_pos_weight (the calibrated probabilities concentrate above 0.05 which makes the sweep choose ~0.05-0.10 → over-trade in test). XGB does NOT have this bug because its θ-tuning lives in the walk-forward calibrator (_calibrator_aggregated ligne 1425). CB has no θ-sweep at all.

This is the third LGB-specific bug found in 2 weeks (Bugs #1, #6, #7 of the FTF 7-bug hotfix PR #872 were all LGB-specific). The pattern is structural : LGB has its own training/eval/threshold codepath, divergent from XGB and from CB.

1.2 Symptom : logging gap¶

XGB autonomous trainer = 1785 lines with ~10 distinct event= log types (event=threshold_method, event=calibrator_aggregated, event=focal_loss_active, event=temperature_scaling_fit, event=xgboost_training_failed, ...). LGB autonomous = 407 lines with 1 event= log (event=lgb_autonomous_threshold_default). CB autonomous = 376 lines with 0 structured logs.

Consequence : LGB and CB training runs are invisible in Loki. Neither best_iteration, nor training_time_sec, nor n_estimators_final, nor class_distribution, nor dataset_size, nor the actual θ retained — none of these can be queried in production. We discovered the LGB Track 11 anomaly only because the operator looked at the PG finetune_results table. In a smaller anomaly we would not have noticed.

1.3 Root cause — three divergent codepaths¶

Today there are three separate trainers with overlapping but divergent responsibilities :

Concern	XGB	LGB	CB
Train function	`train_with_fixed_params` (in autonomous_trainer.py L1627)	`train_with_fixed_params_lgbm` (grid_utils.py)	inline in `_regenerate` (autonomous L247)
Eval metrics	`_eval_split` (L1717) — 11 metrics	inline in grid_utils L177-268 — 14 metrics	`trainer.train()` returns nested dict
θ-tuning	walk-forward calibrator + `_calibrator_aggregated`	ad-hoc sweep in grid_utils L160-175 (PR #872 hotfix)	none
Class balancing	`compute_class_weight` + `sample_weights`	`scale_pos_weight` (binary) OR `sample_weights` (multi)	(catboost native)
Calibration	isotonic + temperature_scaling per fold	none	none
Meta-label	yes	no	no
Walk-forward	yes (`_walk_forward_optimization`)	no	no
MLflow registry	yes (model card + features artifact + LdP params)	no	no
Logging	rich (~10 events, 50+ INFO)	minimal (1 event, 6 INFO)	minimal (0 events, 8 INFO)

Each new bug found in any trainer requires a triplicated fix (or, more realistically, fixed in one trainer and forgotten in the others — exactly what Bugs #1, #6, #7 demonstrated). Each new model adds another full codepath (~400 lines minimum). Each new ensemble strategy is hardcoded inline in the FTF runner rather than registered as a plugin.

The Training Harness unification addresses all three issues with a single architectural move.

2. Goals — operator's explicit requirements¶

The operator has stated three non-negotiable requirements :

New model = drop a DAG file — adding LightGBM-GOSS, ExtraTrees, RandomForest, Logreg-baseline must NOT touch the training core, the FTF runner, the autonomous orchestrator, or the walk-forward path. One file, one registration, the system picks it up.
Stacking = composition, not bespoke code — adding a new stacker (mean, weighted, logreg meta, xgb meta, attention meta, ...) must be a one-file affair that DECLARES which base-model predictions it consumes. Hamilton resolves the dependencies.
Trainer = commodity — the autonomous trainers, the FTF screening loop, the walk-forward predictor, all use the SAME canonical training function train_one(model_name, datasets, hpo_params) → TrainedArtifact. No model-specific orchestration code at the call sites.

Acceptance bar for "done" : - Adding a 4th model (e.g., logistic regression baseline) takes < 1 day end-to-end (file + adapter + registration + test). - Adding a new ensemble variant takes < 0.5 day. - A single Loki query returns training events for ALL models with the SAME field schema. - Bug #6 of PR #872 (LGB-only θ-sweep regression) becomes structurally impossible (single canonical θ-sweep).

3. Architecture sketch¶

src/training/harness/
├─ contracts.py           # TypedDict / dataclass : Datasets, Predictions, Metrics, TrainedArtifact, HPOParams
├─ registry.py            # @register_model("xgboost"), @register_ensemble("stack_logreg")
│                         # plugin discovery via convention scan + (optional) entry_points
│
├─ adapters/              # ModelAdapter implementations (per-model — the 15% that differs)
│  ├─ base.py             # protocol : fit(), predict_proba(), best_iteration, native_handle
│  ├─ xgb.py              # @register_model — wrap xgb.Booster
│  ├─ lgb.py              # @register_model — wrap lgb.Booster + predict_proba(n,2) shim
│  └─ cb.py               # @register_model — wrap catboost.CatBoostClassifier
│
├─ nodes/                 # Hamilton nodes — pure typed functions (the 85% reused)
│  ├─ class_balance.py    # scale_pos_weight | sample_weights (single canonical impl)
│  ├─ theta_sweep.py      # val-tuned threshold (UN seul algo — bug LGB θ structurellement impossible)
│  ├─ eval_metrics.py     # f1/auc/brier/ECE/logloss (UNE seule fonction shared)
│  ├─ calibrate.py        # isotonic / sigmoid / temperature
│  ├─ persist.py          # MLflow + cache.store_*
│  ├─ log_emit.py         # event=training_complete + event=training_started + event=θ_picked … common payload
│  └─ predict_oos.py      # walk-forward OOS predictions (model-agnostic via adapter)
│
├─ dags/models/           # 1 fichier = 1 modèle
│  ├─ xgboost.py          # @register_model — Hamilton DAG "xgboost_predictions"
│  ├─ lightgbm.py         # @register_model — Hamilton DAG "lightgbm_predictions"
│  ├─ catboost.py         # @register_model — Hamilton DAG "catboost_predictions"
│  └─ <new_model>.py      # ← drop a file ⇒ model available
│
├─ dags/ensembles/        # 1 fichier = 1 ensemble strategy
│  ├─ blend_avg.py        # @register_ensemble — average of base model predictions
│  ├─ blend_weighted.py   # weighted by AUC val
│  ├─ stack_logreg.py     # logreg meta on stacked OOS predictions
│  ├─ stack_xgb_meta.py   # xgboost meta
│  └─ <new_stack>.py      # ← drop a file ⇒ stack available
│
├─ dags/wrappers/         # meta-DAGs that compose the others
│  ├─ hpo.py              # Optuna around any model DAG
│  ├─ walk_forward.py     # walk-forward folds around any model DAG
│  └─ fixed_params.py     # FTF screening — fixed HP, around any model DAG
│
└─ tests/
   ├─ parity/             # XGB ≡ LGB ≡ CB on synthetic data (same dataset → same metrics shape)
   ├─ contract/           # adapter contract per model + registry round-trip
   ├─ logging/            # all 3 emit same events with same fields
   └─ plugin/             # add a synthetic 4th model in test, verify pickup

3.1 Concrete API surface¶

Adding a new model :

# src/training/harness/dags/models/extreme_trees.py
from sklearn.ensemble import ExtraTreesClassifier
from training.harness.registry import register_model
from training.harness.contracts import Datasets, HPOParams, TrainedArtifact
from training.harness.adapters.sklearn_proba import SklearnAdapter

@register_model("extreme_trees")
def train(datasets: Datasets, hpo_params: HPOParams) -> TrainedArtifact:
    model = ExtraTreesClassifier(**hpo_params.to_sklearn()).fit(*datasets.train)
    return TrainedArtifact(adapter=SklearnAdapter(model), best_iteration=None)

The FTF, the autonomous trainer, the walk-forward — all see this model automatically. Zero other files modified.

Adding a new stacker :

# src/training/harness/dags/ensembles/stack_attention.py
from training.harness.registry import register_ensemble
from training.harness.contracts import OOSPredictions, TrainedArtifact

@register_ensemble("stack_attention", base_models=["xgboost", "lightgbm", "catboost"])
def meta_features(predictions_per_model: dict[str, OOSPredictions]) -> StackFeatures: ...
def meta_model(meta_features: StackFeatures, datasets: Datasets) -> TrainedArtifact: ...

Hamilton automatically composes the dependency on the three base-model DAGs.

Calling the harness from FTF :

# src/commun/finetune/ablation_runner.py (refactored)
from training.harness.registry import ModelRegistry, EnsembleRegistry

for variant_name in factor.variants:
    if variant_name in ModelRegistry:
        artifact = ModelRegistry[variant_name].train(datasets, hpo_params)
    elif variant_name in EnsembleRegistry:
        artifact = EnsembleRegistry[variant_name].train(datasets, hpo_params, base_predictions=...)

FTF doesn't know what XGB/LGB/CB are anymore — it iterates over registry entries.

3.2 The TrainedArtifact contract (single-source-of-truth payload)¶

@dataclass(frozen=True)
class TrainedArtifact:
    adapter: ModelAdapter            # exposes predict_proba, native_handle, best_iteration
    metrics_val: SplitMetrics        # f1_buy, auc_buy, brier, ECE, …
    metrics_test: SplitMetrics       # same shape
    threshold_buy: float             # val-tuned, canonical algo
    cusum_metadata: CusumMetadata    # fitted_sigma, threshold_h_calibrated
    selected_features: list[str]
    feature_names: list[str]         # alias for backward compat
    hyperparams: dict
    training_time_sec: float
    dataset_shape: DatasetShape      # n_train, n_val, n_test, n_features, class_dist

Every consumer (regime_trainer, FTF agg, autonomous cache.store, MLflow) reads from this same contract. No more "shape divergence" bugs like Bug #2 of PR #872 (CB nested vs flat metrics).

4. Migration plan — 4 phases, sequenced for safety¶

Phase 1 — Skeleton + XGB adapter (1.5 d)¶

Create src/training/harness/{contracts,registry,adapters,nodes,dags,tests}/
Implement ModelAdapter protocol + XGBAdapter (the reference impl — XGB already has the cleanest separation)
Implement Hamilton dags/models/xgboost.py reproducing train_with_fixed_params from autonomous_trainer.py L1627
Implement core nodes : class_balance, eval_metrics, theta_sweep, log_emit
Parity test : harness XGB on synthetic dataset must produce metrics within 1e-9 of legacy train_with_fixed_params
Gate : parity tests green before phase 2.

Phase 2 — LGB + CB adapters + parity (1.5 d)¶

Implement LGBAdapter (wraps lgb.Booster, exposes predict_proba(n,2) shim — folds in PR #872 Bug #1 fix as the canonical solution)
Implement CBAdapter (wraps catboost.CatBoostClassifier)
Implement Hamilton dags/models/lightgbm.py and dags/models/catboost.py using the SAME nodes as XGB
Parity test : LGB and CB on synthetic dataset produce metrics with the same SHAPE and SAME schema as XGB
Logging contract test : all 3 emit event=training_started, event=training_complete, event=theta_picked, event=class_balance_applied with identical field set
Gate : 3-way parity green; FTF run in shadow mode (write to finetune_results_shadow table) shows harness output ≈ legacy output for 5 cryptos × 3 folds.

Phase 3 — Wrappers + ensemble registry (1.5 d)¶

Implement dags/wrappers/{hpo,walk_forward,fixed_params}.py (Hamilton DAGs that take a base-model DAG name as parameter)
Implement EnsembleRegistry + 4 ensemble DAGs reproducing the current Track 11 variants (blend_avg, stack_logreg_shrink, stack_xgb_meta, none baseline)
Parity test : ensemble outputs match the current Track 11 sweep on the SAME synthetic seed (modulo the LGB θ-sweep bug fix — which is intentional regression)
Gate : ensemble parity green except for documented intentional fixes (the LGB θ-sweep regression fix, which is the entire point — this is documented and accepted, not silently introduced).

Phase 4 — Migration of call sites + cutover (1.5 d)¶

Refactor XGBoostAutonomousTrainer._regenerate to call harness.train_one("xgboost", datasets, hpo_params) — trainer shrinks from 1785 to ~80 lines
Same for LightGBMAutonomousTrainer and CatBoostAutonomousTrainer
Refactor src/commun/finetune/ablation_runner.py to iterate over ModelRegistry and EnsembleRegistry instead of hardcoded if/elif branches
Refactor regime_trainer / walk-forward consumers to read from TrainedArtifact contract
Delete the legacy train_with_fixed_params* functions from autonomous_trainer.py + grid_utils.py + (orphan adapter cvntrade_lightgbm_adapter.py which has zero current callers)
Update CLAUDE.md "Architecture" section pointing to the new harness layout
New ADR : ADR-89 — Training harness as plugin registry (Hamilton)

Total : 6 days × 1 dev. Sized 5-7 days = realistic with 1 day buffer.¶

5. Parity contract — how we prove the harness is safe¶

The migration must NOT silently change ML behaviour. Parity is enforced at three levels :

5.1 Unit parity (synthetic data)¶

tests/parity/test_xgb_harness_vs_legacy.py : - Seed : np.random.seed(42), synthetic dataset n=2000 features=20 binary label - Train via legacy train_with_fixed_params AND via harness.train_one("xgboost", ...) - Assert : metrics_val.f1_buy, auc_buy, brier, ECE match within 1e-9 (or 1e-6 for non-deterministic ops) - Same test for LGB and CB

5.2 Integration parity (FTF shadow run)¶

Phase 2 gate : run the harness in shadow mode alongside the legacy training. Both write to PG, harness to finetune_results_shadow. Compare 5 cryptos × 3 folds × 3 models on the SAME run : - For XGB and CB : delta on sortino < 0.05, on n_trades < 5%, on f1_buy < 0.01 - For LGB : intentional regression (the θ-sweep fix changes behaviour) — delta documented + reviewed by operator before phase 3

5.3 Logging contract test¶

tests/logging/test_event_schema_parity.py : - Capture all event= logs emitted during a synthetic training run for each of the 3 models - Assert : same event names, same field names, same dtype per field - Specifically check : event=training_complete carries model_type, n_estimators_final, best_iteration, training_time_sec, dataset_n_train, dataset_n_features, class_dist_buy_pct, theta_picked, auc_val, f1_buy_val, brier_val, ECE_val for ALL 3 models - This test will structurally prevent the kind of "logging gap" we have today

6. Risk matrix¶

#	Risk	Likelihood	Impact	Mitigation
1	Parity tests reveal LGB/CB legacy was actually using a divergent algo (e.g., different random seed) → harness "regression" is in fact a bugfix the operator wants to review	High	Medium	Phase 2 gate forces explicit operator sign-off on documented deltas before phase 3.
2	Hamilton DAG composition slow at startup (driver.execute overhead × N variants × N folds)	Medium	Medium	Benchmark in phase 2 ; if > 5% slowdown vs legacy, cache the driver per model_name.
3	Registry plugin discovery breaks if an adapter file fails to import (e.g., catboost not installed in some environment)	Medium	Low	Lazy import per adapter ; registry skips failed registrations and logs `event=adapter_unavailable`.
4	Refactor of regime_trainer + ablation_runner is invasive — breaks unrelated tests	Medium	High	Phase 4 gated by full `make test-unit` + `make test-integration` green ; staged commits per file (small reviewable diffs).
5	Operator runs a FTF sweep mid-migration on a stale branch → results conflated with the harness	Low	High	Branch protection : harness work in `feat/CVN-N001-EE-S16-*` branch only ; FTF DAGs on main keep old codepath until phase 4 cutover commit.
6	Stacking ensembles need OOS predictions from base models — base models must run before stacker. Hamilton resolves this but the OOS prediction artifacts must be cached, otherwise re-training base models for each stacker triples the FTF runtime	Medium	High	Phase 3 spec : `dags/ensembles/*` consume `OOSPredictions` from a Hamilton-cached node ; runner orchestrates base models first, stackers second. Already a documented Hamilton pattern.
7	The 4 phases are too coupled to parallelize → critical path is 6 days serial	Medium	Low	Accepted ; Story sized 5-7 d explicitly assumes serial execution.

7. Decoupling — what does this Story NOT do¶

This Story is infrastructure. It does NOT : - Add new models (out of scope ; future Stories under the registry pattern). - Add new ensembles (out of scope ; same). - Resolve Track 11 sweep verdict (the sweep is abandoned per §8.1 below ; re-run on the harness post-cutover). - Touch the inference path (src/commun/inference/) — only the training path. - Touch the CUSUM / FE / labels — only consumes the existing feature_selection_result contract.

This Story DOES : - Refactor the 3 trainers + the FTF runner + the walk-forward predictor. - Establish the registry as the single point of truth for "what models / ensembles exist". - Establish the TrainedArtifact contract as the single payload all consumers read from. - Establish the canonical theta_sweep algorithm (resolving the LGB θ-sweep regression). - Establish the canonical logging schema (resolving the XGB-vs-LGB-vs-CB observability gap).

8. Open decisions for the committee¶

8.1 Track 11 sweep currently running — abandon or wait ?¶

The sweep ftf_20260508_210336_1a1160 (ensemble_diversity factor) is currently running. With LGB structurally over-trading, the lgb_only variant sortino is unusable, and the 3 stacking variants that include LGB are contaminated.

Operator's preliminary recommendation : abandon the current sweep. Rationale : (a) re-running on the harness post-cutover gives valid results in ~1 day extra, (b) finishing the current sweep wastes compute on known-broken results, (c) any verdict drawn from the contaminated sweep would have to be re-computed anyway.

Committee question : do experts agree, or is there value in completing the current sweep for partial signal (cb_only + none baseline) before kill ?

8.2 Registry mechanism — convention scan vs entry_points ?¶

Two options for plugin discovery :

Option A (convention scan) : registry.py walks dags/models/*.py at import time, calls import_module() on each, the @register_model decorator side-effects into a global dict. - Pros : zero packaging overhead, drop a file = available. - Cons : import-order coupling, harder to install third-party model plugins as a separate pip package.

Option B (entry_points) : models register via pyproject.toml entry_points. Third-party plugins install as separate packages and self-register. - Pros : true plugin architecture, future-proof. - Cons : requires editing pyproject.toml for every new in-tree model = friction the operator explicitly wants to avoid.

Operator's preliminary recommendation : A (convention scan) for in-tree models, OPEN to B as a future addition if a third-party plugin use case materialises. Committee question : is the convention scan robust enough, or do we want both (in-tree A + third-party B) from day 1 ?

8.3 Should θ-sweep be a node OR an opt-in wrapper ?¶

Today XGB does NOT use a per-fold θ-sweep on val (it has the walk-forward calibrator instead). LGB has the bug θ-sweep. CB has none.

Option A (always-on θ-sweep node) : every model runs through nodes/theta_sweep.py ; XGB's behaviour changes (new threshold strategy). Option B (opt-in via wrapper DAG) : models that want θ-sweep call dags/wrappers/theta_sweep.py ; XGB stays as-is, LGB and CB get the (correct) sweep.

Operator's preliminary recommendation : B (opt-in) for phase 1-3, with a follow-up Story to evaluate whether XGB benefits from the θ-sweep on val (a question that needs its own FTF sweep — not in scope here). Committee question : agree, or push for unified θ-sweep from day 1 ?

8.4 Hamilton vs lightweight DI ?¶

ADR-61 mandates Hamilton for batch DAGs. This Story uses Hamilton end-to-end. But Hamilton has a learning curve and the 3 current trainers are imperative Python — the cognitive shift for future contributors is non-trivial.

Committee question : is full Hamilton justified for this work, or should the harness use a lighter-weight DI pattern (e.g., simple registry of callables + manual orchestration) and reserve Hamilton for the wrappers ?

8.5 What about the orphan `cvntrade_lightgbm_adapter.py` ?¶

This file (src/training/patterns/adapters/cvntrade_lightgbm_adapter.py) is dead code — zero imports across the codebase, dating back to a prior aborted refactor. Phase 4 deletes it. Committee question : confirm acceptable ?

8.6 Verdict tree¶

Recommend among the 4 phases : KEEP / DROP / SPLIT / DEFER. Recommend on the 5 open decisions above. Identify any 6th risk or 7th open decision the operator + Claude missed.

8.bis Committee verdict applied (2026-05-09 — session 7f13b78d)¶

Verdict : ✅ PASSED / OK / strong consensus · 5 experts (data-scientist, crypto-trader, ml-engineer, architect, ops), scores 7.5-8.5 (mean 8.1) · OP Meeting #125.

Decisions taken on §8 open questions¶

§	Question	Verdict	Notes
8.1	Track 11 sweep — abandon or wait ?	ABANDON (unanimous)	Sweep `ftf_20260508_210336_1a1160` aborted 2026-05-09 (PG `status=aborted`, 5 K8s pods deleted) ; comments posted on OP wp#102 + wp#45 ; Track 11 verdict re-computed on harness post-cutover.
8.2	Registry — convention scan vs entry_points ?	Option A (convention scan)	Design the registry so that future Option B (entry_points) is a non-breaking extension if/when a third-party plugin use case materialises.
8.3	θ-sweep — always-on node vs opt-in wrapper ?	Option B (opt-in wrapper)	Phase 1-3 ; XGB stays as-is (walk-forward calibrator) ; LGB and CB get the canonical sweep. Follow-up Story to evaluate XGB+θ-sweep on val (own FTF needed, out of scope here).
8.4	Hamilton vs lightweight DI ?	Full Hamilton	Per ADR-61. Onboarding doc pointing to Hamilton patterns added in Phase 4.
8.5	Orphan adapter `cvntrade_lightgbm_adapter.py` ?	DELETE in Phase 4	Confirmed dead code (zero imports).

Enhancement recommendations to integrate (5 — non-blocking)¶

These extend the plan ; they are integrated into the relevant phases below.

Performance monitoring (extends §6 Risk #2 + §11 acceptance) — measure per-fold training time for XGB / LGB / CB on the SAME synthetic dataset, legacy vs harness, in Phase 2 benchmark. Establish SLOs : harness driver overhead ≤ 5% wall-clock vs legacy. Alarm in Grafana on regression. Add a event=training_complete field training_time_sec (already in §3.2 contract — reaffirmed) so Loki can query per-model training latency over time.
Calibration always-on (extends §3 nodes/calibrate.py) — per the ML-engineer's ask, calibrate.py becomes an always-on node for every model that produces probabilities. Default = isotonic ; per-model override possible via the adapter. Rationale : θ-sweep on miscalibrated probabilities is the bug LGB just exhibited ; canonical calibration BEFORE θ-sweep prevents recurrence by construction.
Meta-labeling clarification (extends §3 nodes/meta_label.py) — the dossier was ambiguous : XGB does meta-label as a secondary training step on TP/FP, while the LdP filter chain has a post-inference meta_label filter. They are different. The harness nodes/meta_label.py covers ONLY the training-time variant ; it produces a secondary classifier surfaced in TrainedArtifact.meta_label_model. The post-inference filter stays in src/commun/filters/meta_label.py untouched (out of scope). Documented in ADR-89.
Security of model artifacts (extends §6 Risk + ADR-89) — ADR-89 must include a "Security model" section : (a) AuthN/AuthZ for the MLflow registry — leverage existing IAM, no new secret store ; (b) TrainedArtifact does NOT serialize raw HPO datasets (already true by contract — only dataset_shape summary is persisted) ; (c) MLflow params are scrubbed of any env values matching *_KEY|*_TOKEN|*_PASSWORD patterns (regex filter in nodes/persist.py). No encryption-at-rest change required at this stage ; tracked in a follow-up Story under CVN-N001-EF (operational prereqs).
Feature store integration with Feast (extends §3.2 Datasets contract) — Datasets carries an additional field feature_version: FeatureVersion (commit SHA + Feast registry hash + selected_features hash). nodes/persist.py writes this field into MLflow as tags + into cache.store_trained_model as a column on mlflow_models (migration in Phase 4). Future Feast integration becomes a non-event : just populate the field from Feast instead of from FE pipeline. Documented in ADR-89.

Phase impact of the 5 enhancements¶

Enhancement	Phase	Extra effort
1 — Perf SLO	Phase 2 (benchmark)	+0.25 d (Grafana panel + SLO doc)
2 — Calibration always-on	Phase 1 (node)	+0.25 d (always-on default + per-model override hook)
3 — Meta-label clarification	Phase 1 (ADR draft) + Phase 4 (ADR-89 final)	+0.1 d (mostly docs)
4 — Security section	Phase 4 (ADR-89)	+0.25 d (write Security model + scrub regex impl)
5 — Feast / FeatureVersion	Phase 1 (contract) + Phase 4 (migration + persist)	+0.5 d
Total extra	distributed	+1.35 d

Sizing revised : 6 d → ~7 d under integrated enhancements. Buffer +1 d → matches 5-7 d window with ~0 days slack. Flag for monitoring during impl ; if slip, the calibration always-on (#2) and Feast (#5) can be split to a Phase 5 "stabilisation" Story without breaking the harness MVP.

Acceptance criteria addendum (§11)¶

Per-fold training time logged and within SLO (≤ +5% legacy) on Phase 2 benchmark for XGB / LGB / CB.
calibrate.py always-on, per-model override hook in adapter ; documented in ADR-89.
Meta-labeling : training-time vs post-inference distinction in ADR-89 + harness covers only training-time variant.
ADR-89 Security model section written and reviewed by ops expert ; secret-scrub regex shipped in nodes/persist.py.
Datasets contract carries feature_version field ; persisted to MLflow + mlflow_models table.

9. Cross-references¶

Today's broken sweep : ftf_20260508_210336_1a1160_ATR0.5_1.5_H4 (factor ensemble_diversity, status running) — the immediate trigger for this Story.
PR #872 (FTF 7-bug hotfix, sha 461a39b2, merged 2026-05-08) : the PR that demonstrated the cost of triplicated trainers — Bug #1 (LGB Booster wrap), Bug #2 (CB nested metrics), Bug #6 (LGB θ-sweep) all rooted in trainer divergence.
CVN-N001-EE-S06 (issue #802, PR #803) : the previous Track 11 dispatcher — establishes CVN_MODEL_TYPE env var routing, which the harness consumes from the registry instead of from the dispatcher.
ADR-61 : "Batch DAGs use Hamilton, not imperative code" — this Story is a direct application.
ADR-25 : "No silent fallback" — theta_sweep node fail-fast on missing val data, no default 0.5.
ADR-30/32/38 : structured logs — the log_emit.py node is the canonical implementation.
ADR-67 : pluggable feature-selection framework — the harness applies the same plugin pattern to models + ensembles.
CLAUDE.md §3 process : plan_review committee mandatory before substantial implementation — this dossier IS that artifact.

10. Sizing summary¶

Phase	Effort (days)	Critical artifact
1 — Skeleton + XGB adapter + parity	1.5	`harness.train_one("xgboost", ...)` matches legacy 1e-9
2 — LGB + CB adapters + 3-way parity + logging contract	1.5	All 3 models emit identical `event=training_complete` schema
3 — Wrappers + ensemble registry	1.5	`EnsembleRegistry` round-trips all 4 current Track 11 variants
4 — Migration of call sites + cutover + ADR-89	1.5	3 autonomous trainers ≤ 80 lines each, FTF runner registry-driven
Total	6 d	All `make test-unit` + `make test-integration` green

Buffer +1 d for committee CR rounds and unexpected Hamilton resolution issues.