Skip to content

Plan dossier — CVN-N001-EE-S16 : Training Harness unifié Hamilton + plugin registry

Date : 2026-05-09 Story : CVN-N001-EE-S16 (GH #890 · OP wp#143) Author : Dominique (operator) + Claude Session type : plan_review (mandatory before implementation per ADR-68 + CLAUDE.md §3) Status : ✅ committee verdict obtained — PASSED / OK / strong consensus (session 7f13b78d, OP Meeting #125, 5 experts scores 7.5-8.5, mean 8.1) Sizing estimate : 5-7 days Branch : feat/CVN-N001-EE-S16-training-harness-unification (active)


1. The discovery — what triggered this Story

1.1 Symptom : LGB completely broken in Track 11 sweep

The Track 11 sweep (ensemble_diversity factor, run ftf_20260508_210336_1a1160, currently running) shows structurally invalid LGB results :

variant fold n_trades avg sortino avg min/max win_rate avg
cb_only 3 68 +0.93 -1.96 / +2.27 53%
cb_only 4 67 +0.76 -0.38 / +1.61 48%
lgb_only 3 251 −5.87 -6.56 / -5.25 32%
lgb_only 4 142 −1.64 -3.28 / +0.77 34%
none 3 44 +0.50 -0.58 / +1.08 57%
stack_3model_avg 3 28 +0.92 -0.70 / +3.14 60%
stack_3model_logreg_shrink 3 26 +0.91 -0.21 / +1.82 64%

Smoking gun : LGB takes 5× more trades than every other variant and has win_rate ~32% (vs 55-65%). Sortino is not bounded — it crashes to −6 because LGB over-trades systematically.

Probable root cause : the val-tuned theta_sweep in src/training/LightGBM/cvntrade_LightGBM_grid_utils.py:160-175 (added by PR #872 hotfix) picks a too-low threshold under scale_pos_weight (the calibrated probabilities concentrate above 0.05 which makes the sweep choose ~0.05-0.10 → over-trade in test). XGB does NOT have this bug because its θ-tuning lives in the walk-forward calibrator (_calibrator_aggregated ligne 1425). CB has no θ-sweep at all.

This is the third LGB-specific bug found in 2 weeks (Bugs #1, #6, #7 of the FTF 7-bug hotfix PR #872 were all LGB-specific). The pattern is structural : LGB has its own training/eval/threshold codepath, divergent from XGB and from CB.

1.2 Symptom : logging gap

XGB autonomous trainer = 1785 lines with ~10 distinct event= log types (event=threshold_method, event=calibrator_aggregated, event=focal_loss_active, event=temperature_scaling_fit, event=xgboost_training_failed, ...). LGB autonomous = 407 lines with 1 event= log (event=lgb_autonomous_threshold_default). CB autonomous = 376 lines with 0 structured logs.

Consequence : LGB and CB training runs are invisible in Loki. Neither best_iteration, nor training_time_sec, nor n_estimators_final, nor class_distribution, nor dataset_size, nor the actual θ retained — none of these can be queried in production. We discovered the LGB Track 11 anomaly only because the operator looked at the PG finetune_results table. In a smaller anomaly we would not have noticed.

1.3 Root cause — three divergent codepaths

Today there are three separate trainers with overlapping but divergent responsibilities :

Concern XGB LGB CB
Train function train_with_fixed_params (in autonomous_trainer.py L1627) train_with_fixed_params_lgbm (grid_utils.py) inline in _regenerate (autonomous L247)
Eval metrics _eval_split (L1717) — 11 metrics inline in grid_utils L177-268 — 14 metrics trainer.train() returns nested dict
θ-tuning walk-forward calibrator + _calibrator_aggregated ad-hoc sweep in grid_utils L160-175 (PR #872 hotfix) none
Class balancing compute_class_weight + sample_weights scale_pos_weight (binary) OR sample_weights (multi) (catboost native)
Calibration isotonic + temperature_scaling per fold none none
Meta-label yes no no
Walk-forward yes (_walk_forward_optimization) no no
MLflow registry yes (model card + features artifact + LdP params) no no
Logging rich (~10 events, 50+ INFO) minimal (1 event, 6 INFO) minimal (0 events, 8 INFO)

Each new bug found in any trainer requires a triplicated fix (or, more realistically, fixed in one trainer and forgotten in the others — exactly what Bugs #1, #6, #7 demonstrated). Each new model adds another full codepath (~400 lines minimum). Each new ensemble strategy is hardcoded inline in the FTF runner rather than registered as a plugin.

The Training Harness unification addresses all three issues with a single architectural move.


2. Goals — operator's explicit requirements

The operator has stated three non-negotiable requirements :

  1. New model = drop a DAG file — adding LightGBM-GOSS, ExtraTrees, RandomForest, Logreg-baseline must NOT touch the training core, the FTF runner, the autonomous orchestrator, or the walk-forward path. One file, one registration, the system picks it up.

  2. Stacking = composition, not bespoke code — adding a new stacker (mean, weighted, logreg meta, xgb meta, attention meta, ...) must be a one-file affair that DECLARES which base-model predictions it consumes. Hamilton resolves the dependencies.

  3. Trainer = commodity — the autonomous trainers, the FTF screening loop, the walk-forward predictor, all use the SAME canonical training function train_one(model_name, datasets, hpo_params) → TrainedArtifact. No model-specific orchestration code at the call sites.

Acceptance bar for "done" : - Adding a 4th model (e.g., logistic regression baseline) takes < 1 day end-to-end (file + adapter + registration + test). - Adding a new ensemble variant takes < 0.5 day. - A single Loki query returns training events for ALL models with the SAME field schema. - Bug #6 of PR #872 (LGB-only θ-sweep regression) becomes structurally impossible (single canonical θ-sweep).


3. Architecture sketch

src/training/harness/
├─ contracts.py           # TypedDict / dataclass : Datasets, Predictions, Metrics, TrainedArtifact, HPOParams
├─ registry.py            # @register_model("xgboost"), @register_ensemble("stack_logreg")
│                         # plugin discovery via convention scan + (optional) entry_points
├─ adapters/              # ModelAdapter implementations (per-model — the 15% that differs)
│  ├─ base.py             # protocol : fit(), predict_proba(), best_iteration, native_handle
│  ├─ xgb.py              # @register_model — wrap xgb.Booster
│  ├─ lgb.py              # @register_model — wrap lgb.Booster + predict_proba(n,2) shim
│  └─ cb.py               # @register_model — wrap catboost.CatBoostClassifier
├─ nodes/                 # Hamilton nodes — pure typed functions (the 85% reused)
│  ├─ class_balance.py    # scale_pos_weight | sample_weights (single canonical impl)
│  ├─ theta_sweep.py      # val-tuned threshold (UN seul algo — bug LGB θ structurellement impossible)
│  ├─ eval_metrics.py     # f1/auc/brier/ECE/logloss (UNE seule fonction shared)
│  ├─ calibrate.py        # isotonic / sigmoid / temperature
│  ├─ persist.py          # MLflow + cache.store_*
│  ├─ log_emit.py         # event=training_complete + event=training_started + event=θ_picked … common payload
│  └─ predict_oos.py      # walk-forward OOS predictions (model-agnostic via adapter)
├─ dags/models/           # 1 fichier = 1 modèle
│  ├─ xgboost.py          # @register_model — Hamilton DAG "xgboost_predictions"
│  ├─ lightgbm.py         # @register_model — Hamilton DAG "lightgbm_predictions"
│  ├─ catboost.py         # @register_model — Hamilton DAG "catboost_predictions"
│  └─ <new_model>.py      # ← drop a file ⇒ model available
├─ dags/ensembles/        # 1 fichier = 1 ensemble strategy
│  ├─ blend_avg.py        # @register_ensemble — average of base model predictions
│  ├─ blend_weighted.py   # weighted by AUC val
│  ├─ stack_logreg.py     # logreg meta on stacked OOS predictions
│  ├─ stack_xgb_meta.py   # xgboost meta
│  └─ <new_stack>.py      # ← drop a file ⇒ stack available
├─ dags/wrappers/         # meta-DAGs that compose the others
│  ├─ hpo.py              # Optuna around any model DAG
│  ├─ walk_forward.py     # walk-forward folds around any model DAG
│  └─ fixed_params.py     # FTF screening — fixed HP, around any model DAG
└─ tests/
   ├─ parity/             # XGB ≡ LGB ≡ CB on synthetic data (same dataset → same metrics shape)
   ├─ contract/           # adapter contract per model + registry round-trip
   ├─ logging/            # all 3 emit same events with same fields
   └─ plugin/             # add a synthetic 4th model in test, verify pickup

3.1 Concrete API surface

Adding a new model :

# src/training/harness/dags/models/extreme_trees.py
from sklearn.ensemble import ExtraTreesClassifier
from training.harness.registry import register_model
from training.harness.contracts import Datasets, HPOParams, TrainedArtifact
from training.harness.adapters.sklearn_proba import SklearnAdapter

@register_model("extreme_trees")
def train(datasets: Datasets, hpo_params: HPOParams) -> TrainedArtifact:
    model = ExtraTreesClassifier(**hpo_params.to_sklearn()).fit(*datasets.train)
    return TrainedArtifact(adapter=SklearnAdapter(model), best_iteration=None)
The FTF, the autonomous trainer, the walk-forward — all see this model automatically. Zero other files modified.

Adding a new stacker :

# src/training/harness/dags/ensembles/stack_attention.py
from training.harness.registry import register_ensemble
from training.harness.contracts import OOSPredictions, TrainedArtifact

@register_ensemble("stack_attention", base_models=["xgboost", "lightgbm", "catboost"])
def meta_features(predictions_per_model: dict[str, OOSPredictions]) -> StackFeatures: ...
def meta_model(meta_features: StackFeatures, datasets: Datasets) -> TrainedArtifact: ...
Hamilton automatically composes the dependency on the three base-model DAGs.

Calling the harness from FTF :

# src/commun/finetune/ablation_runner.py (refactored)
from training.harness.registry import ModelRegistry, EnsembleRegistry

for variant_name in factor.variants:
    if variant_name in ModelRegistry:
        artifact = ModelRegistry[variant_name].train(datasets, hpo_params)
    elif variant_name in EnsembleRegistry:
        artifact = EnsembleRegistry[variant_name].train(datasets, hpo_params, base_predictions=...)
FTF doesn't know what XGB/LGB/CB are anymore — it iterates over registry entries.

3.2 The TrainedArtifact contract (single-source-of-truth payload)

@dataclass(frozen=True)
class TrainedArtifact:
    adapter: ModelAdapter            # exposes predict_proba, native_handle, best_iteration
    metrics_val: SplitMetrics        # f1_buy, auc_buy, brier, ECE, …
    metrics_test: SplitMetrics       # same shape
    threshold_buy: float             # val-tuned, canonical algo
    cusum_metadata: CusumMetadata    # fitted_sigma, threshold_h_calibrated
    selected_features: list[str]
    feature_names: list[str]         # alias for backward compat
    hyperparams: dict
    training_time_sec: float
    dataset_shape: DatasetShape      # n_train, n_val, n_test, n_features, class_dist

Every consumer (regime_trainer, FTF agg, autonomous cache.store, MLflow) reads from this same contract. No more "shape divergence" bugs like Bug #2 of PR #872 (CB nested vs flat metrics).


4. Migration plan — 4 phases, sequenced for safety

Phase 1 — Skeleton + XGB adapter (1.5 d)

  • Create src/training/harness/{contracts,registry,adapters,nodes,dags,tests}/
  • Implement ModelAdapter protocol + XGBAdapter (the reference impl — XGB already has the cleanest separation)
  • Implement Hamilton dags/models/xgboost.py reproducing train_with_fixed_params from autonomous_trainer.py L1627
  • Implement core nodes : class_balance, eval_metrics, theta_sweep, log_emit
  • Parity test : harness XGB on synthetic dataset must produce metrics within 1e-9 of legacy train_with_fixed_params
  • Gate : parity tests green before phase 2.

Phase 2 — LGB + CB adapters + parity (1.5 d)

  • Implement LGBAdapter (wraps lgb.Booster, exposes predict_proba(n,2) shim — folds in PR #872 Bug #1 fix as the canonical solution)
  • Implement CBAdapter (wraps catboost.CatBoostClassifier)
  • Implement Hamilton dags/models/lightgbm.py and dags/models/catboost.py using the SAME nodes as XGB
  • Parity test : LGB and CB on synthetic dataset produce metrics with the same SHAPE and SAME schema as XGB
  • Logging contract test : all 3 emit event=training_started, event=training_complete, event=theta_picked, event=class_balance_applied with identical field set
  • Gate : 3-way parity green; FTF run in shadow mode (write to finetune_results_shadow table) shows harness output ≈ legacy output for 5 cryptos × 3 folds.

Phase 3 — Wrappers + ensemble registry (1.5 d)

  • Implement dags/wrappers/{hpo,walk_forward,fixed_params}.py (Hamilton DAGs that take a base-model DAG name as parameter)
  • Implement EnsembleRegistry + 4 ensemble DAGs reproducing the current Track 11 variants (blend_avg, stack_logreg_shrink, stack_xgb_meta, none baseline)
  • Parity test : ensemble outputs match the current Track 11 sweep on the SAME synthetic seed (modulo the LGB θ-sweep bug fix — which is intentional regression)
  • Gate : ensemble parity green except for documented intentional fixes (the LGB θ-sweep regression fix, which is the entire point — this is documented and accepted, not silently introduced).

Phase 4 — Migration of call sites + cutover (1.5 d)

  • Refactor XGBoostAutonomousTrainer._regenerate to call harness.train_one("xgboost", datasets, hpo_params) — trainer shrinks from 1785 to ~80 lines
  • Same for LightGBMAutonomousTrainer and CatBoostAutonomousTrainer
  • Refactor src/commun/finetune/ablation_runner.py to iterate over ModelRegistry and EnsembleRegistry instead of hardcoded if/elif branches
  • Refactor regime_trainer / walk-forward consumers to read from TrainedArtifact contract
  • Delete the legacy train_with_fixed_params* functions from autonomous_trainer.py + grid_utils.py + (orphan adapter cvntrade_lightgbm_adapter.py which has zero current callers)
  • Update CLAUDE.md "Architecture" section pointing to the new harness layout
  • New ADR : ADR-89 — Training harness as plugin registry (Hamilton)

Total : 6 days × 1 dev. Sized 5-7 days = realistic with 1 day buffer.


5. Parity contract — how we prove the harness is safe

The migration must NOT silently change ML behaviour. Parity is enforced at three levels :

5.1 Unit parity (synthetic data)

tests/parity/test_xgb_harness_vs_legacy.py : - Seed : np.random.seed(42), synthetic dataset n=2000 features=20 binary label - Train via legacy train_with_fixed_params AND via harness.train_one("xgboost", ...) - Assert : metrics_val.f1_buy, auc_buy, brier, ECE match within 1e-9 (or 1e-6 for non-deterministic ops) - Same test for LGB and CB

5.2 Integration parity (FTF shadow run)

Phase 2 gate : run the harness in shadow mode alongside the legacy training. Both write to PG, harness to finetune_results_shadow. Compare 5 cryptos × 3 folds × 3 models on the SAME run : - For XGB and CB : delta on sortino < 0.05, on n_trades < 5%, on f1_buy < 0.01 - For LGB : intentional regression (the θ-sweep fix changes behaviour) — delta documented + reviewed by operator before phase 3

5.3 Logging contract test

tests/logging/test_event_schema_parity.py : - Capture all event= logs emitted during a synthetic training run for each of the 3 models - Assert : same event names, same field names, same dtype per field - Specifically check : event=training_complete carries model_type, n_estimators_final, best_iteration, training_time_sec, dataset_n_train, dataset_n_features, class_dist_buy_pct, theta_picked, auc_val, f1_buy_val, brier_val, ECE_val for ALL 3 models - This test will structurally prevent the kind of "logging gap" we have today


6. Risk matrix

# Risk Likelihood Impact Mitigation
1 Parity tests reveal LGB/CB legacy was actually using a divergent algo (e.g., different random seed) → harness "regression" is in fact a bugfix the operator wants to review High Medium Phase 2 gate forces explicit operator sign-off on documented deltas before phase 3.
2 Hamilton DAG composition slow at startup (driver.execute overhead × N variants × N folds) Medium Medium Benchmark in phase 2 ; if > 5% slowdown vs legacy, cache the driver per model_name.
3 Registry plugin discovery breaks if an adapter file fails to import (e.g., catboost not installed in some environment) Medium Low Lazy import per adapter ; registry skips failed registrations and logs event=adapter_unavailable.
4 Refactor of regime_trainer + ablation_runner is invasive — breaks unrelated tests Medium High Phase 4 gated by full make test-unit + make test-integration green ; staged commits per file (small reviewable diffs).
5 Operator runs a FTF sweep mid-migration on a stale branch → results conflated with the harness Low High Branch protection : harness work in feat/CVN-N001-EE-S16-* branch only ; FTF DAGs on main keep old codepath until phase 4 cutover commit.
6 Stacking ensembles need OOS predictions from base models — base models must run before stacker. Hamilton resolves this but the OOS prediction artifacts must be cached, otherwise re-training base models for each stacker triples the FTF runtime Medium High Phase 3 spec : dags/ensembles/* consume OOSPredictions from a Hamilton-cached node ; runner orchestrates base models first, stackers second. Already a documented Hamilton pattern.
7 The 4 phases are too coupled to parallelize → critical path is 6 days serial Medium Low Accepted ; Story sized 5-7 d explicitly assumes serial execution.

7. Decoupling — what does this Story NOT do

This Story is infrastructure. It does NOT : - Add new models (out of scope ; future Stories under the registry pattern). - Add new ensembles (out of scope ; same). - Resolve Track 11 sweep verdict (the sweep is abandoned per §8.1 below ; re-run on the harness post-cutover). - Touch the inference path (src/commun/inference/) — only the training path. - Touch the CUSUM / FE / labels — only consumes the existing feature_selection_result contract.

This Story DOES : - Refactor the 3 trainers + the FTF runner + the walk-forward predictor. - Establish the registry as the single point of truth for "what models / ensembles exist". - Establish the TrainedArtifact contract as the single payload all consumers read from. - Establish the canonical theta_sweep algorithm (resolving the LGB θ-sweep regression). - Establish the canonical logging schema (resolving the XGB-vs-LGB-vs-CB observability gap).


8. Open decisions for the committee

8.1 Track 11 sweep currently running — abandon or wait ?

The sweep ftf_20260508_210336_1a1160 (ensemble_diversity factor) is currently running. With LGB structurally over-trading, the lgb_only variant sortino is unusable, and the 3 stacking variants that include LGB are contaminated.

Operator's preliminary recommendation : abandon the current sweep. Rationale : (a) re-running on the harness post-cutover gives valid results in ~1 day extra, (b) finishing the current sweep wastes compute on known-broken results, (c) any verdict drawn from the contaminated sweep would have to be re-computed anyway.

Committee question : do experts agree, or is there value in completing the current sweep for partial signal (cb_only + none baseline) before kill ?

8.2 Registry mechanism — convention scan vs entry_points ?

Two options for plugin discovery :

Option A (convention scan) : registry.py walks dags/models/*.py at import time, calls import_module() on each, the @register_model decorator side-effects into a global dict. - Pros : zero packaging overhead, drop a file = available. - Cons : import-order coupling, harder to install third-party model plugins as a separate pip package.

Option B (entry_points) : models register via pyproject.toml entry_points. Third-party plugins install as separate packages and self-register. - Pros : true plugin architecture, future-proof. - Cons : requires editing pyproject.toml for every new in-tree model = friction the operator explicitly wants to avoid.

Operator's preliminary recommendation : A (convention scan) for in-tree models, OPEN to B as a future addition if a third-party plugin use case materialises. Committee question : is the convention scan robust enough, or do we want both (in-tree A + third-party B) from day 1 ?

8.3 Should θ-sweep be a node OR an opt-in wrapper ?

Today XGB does NOT use a per-fold θ-sweep on val (it has the walk-forward calibrator instead). LGB has the bug θ-sweep. CB has none.

Option A (always-on θ-sweep node) : every model runs through nodes/theta_sweep.py ; XGB's behaviour changes (new threshold strategy). Option B (opt-in via wrapper DAG) : models that want θ-sweep call dags/wrappers/theta_sweep.py ; XGB stays as-is, LGB and CB get the (correct) sweep.

Operator's preliminary recommendation : B (opt-in) for phase 1-3, with a follow-up Story to evaluate whether XGB benefits from the θ-sweep on val (a question that needs its own FTF sweep — not in scope here). Committee question : agree, or push for unified θ-sweep from day 1 ?

8.4 Hamilton vs lightweight DI ?

ADR-61 mandates Hamilton for batch DAGs. This Story uses Hamilton end-to-end. But Hamilton has a learning curve and the 3 current trainers are imperative Python — the cognitive shift for future contributors is non-trivial.

Committee question : is full Hamilton justified for this work, or should the harness use a lighter-weight DI pattern (e.g., simple registry of callables + manual orchestration) and reserve Hamilton for the wrappers ?

8.5 What about the orphan cvntrade_lightgbm_adapter.py ?

This file (src/training/patterns/adapters/cvntrade_lightgbm_adapter.py) is dead code — zero imports across the codebase, dating back to a prior aborted refactor. Phase 4 deletes it. Committee question : confirm acceptable ?

8.6 Verdict tree

Recommend among the 4 phases : KEEP / DROP / SPLIT / DEFER. Recommend on the 5 open decisions above. Identify any 6th risk or 7th open decision the operator + Claude missed.


8.bis Committee verdict applied (2026-05-09 — session 7f13b78d)

Verdict : ✅ PASSED / OK / strong consensus · 5 experts (data-scientist, crypto-trader, ml-engineer, architect, ops), scores 7.5-8.5 (mean 8.1) · OP Meeting #125.

Decisions taken on §8 open questions

§ Question Verdict Notes
8.1 Track 11 sweep — abandon or wait ? ABANDON (unanimous) Sweep ftf_20260508_210336_1a1160 aborted 2026-05-09 (PG status=aborted, 5 K8s pods deleted) ; comments posted on OP wp#102 + wp#45 ; Track 11 verdict re-computed on harness post-cutover.
8.2 Registry — convention scan vs entry_points ? Option A (convention scan) Design the registry so that future Option B (entry_points) is a non-breaking extension if/when a third-party plugin use case materialises.
8.3 θ-sweep — always-on node vs opt-in wrapper ? Option B (opt-in wrapper) Phase 1-3 ; XGB stays as-is (walk-forward calibrator) ; LGB and CB get the canonical sweep. Follow-up Story to evaluate XGB+θ-sweep on val (own FTF needed, out of scope here).
8.4 Hamilton vs lightweight DI ? Full Hamilton Per ADR-61. Onboarding doc pointing to Hamilton patterns added in Phase 4.
8.5 Orphan adapter cvntrade_lightgbm_adapter.py ? DELETE in Phase 4 Confirmed dead code (zero imports).

Enhancement recommendations to integrate (5 — non-blocking)

These extend the plan ; they are integrated into the relevant phases below.

  1. Performance monitoring (extends §6 Risk #2 + §11 acceptance) — measure per-fold training time for XGB / LGB / CB on the SAME synthetic dataset, legacy vs harness, in Phase 2 benchmark. Establish SLOs : harness driver overhead ≤ 5% wall-clock vs legacy. Alarm in Grafana on regression. Add a event=training_complete field training_time_sec (already in §3.2 contract — reaffirmed) so Loki can query per-model training latency over time.

  2. Calibration always-on (extends §3 nodes/calibrate.py) — per the ML-engineer's ask, calibrate.py becomes an always-on node for every model that produces probabilities. Default = isotonic ; per-model override possible via the adapter. Rationale : θ-sweep on miscalibrated probabilities is the bug LGB just exhibited ; canonical calibration BEFORE θ-sweep prevents recurrence by construction.

  3. Meta-labeling clarification (extends §3 nodes/meta_label.py) — the dossier was ambiguous : XGB does meta-label as a secondary training step on TP/FP, while the LdP filter chain has a post-inference meta_label filter. They are different. The harness nodes/meta_label.py covers ONLY the training-time variant ; it produces a secondary classifier surfaced in TrainedArtifact.meta_label_model. The post-inference filter stays in src/commun/filters/meta_label.py untouched (out of scope). Documented in ADR-89.

  4. Security of model artifacts (extends §6 Risk + ADR-89) — ADR-89 must include a "Security model" section : (a) AuthN/AuthZ for the MLflow registry — leverage existing IAM, no new secret store ; (b) TrainedArtifact does NOT serialize raw HPO datasets (already true by contract — only dataset_shape summary is persisted) ; (c) MLflow params are scrubbed of any env values matching *_KEY|*_TOKEN|*_PASSWORD patterns (regex filter in nodes/persist.py). No encryption-at-rest change required at this stage ; tracked in a follow-up Story under CVN-N001-EF (operational prereqs).

  5. Feature store integration with Feast (extends §3.2 Datasets contract)Datasets carries an additional field feature_version: FeatureVersion (commit SHA + Feast registry hash + selected_features hash). nodes/persist.py writes this field into MLflow as tags + into cache.store_trained_model as a column on mlflow_models (migration in Phase 4). Future Feast integration becomes a non-event : just populate the field from Feast instead of from FE pipeline. Documented in ADR-89.

Phase impact of the 5 enhancements

Enhancement Phase Extra effort
1 — Perf SLO Phase 2 (benchmark) +0.25 d (Grafana panel + SLO doc)
2 — Calibration always-on Phase 1 (node) +0.25 d (always-on default + per-model override hook)
3 — Meta-label clarification Phase 1 (ADR draft) + Phase 4 (ADR-89 final) +0.1 d (mostly docs)
4 — Security section Phase 4 (ADR-89) +0.25 d (write Security model + scrub regex impl)
5 — Feast / FeatureVersion Phase 1 (contract) + Phase 4 (migration + persist) +0.5 d
Total extra distributed +1.35 d

Sizing revised : 6 d → ~7 d under integrated enhancements. Buffer +1 d → matches 5-7 d window with ~0 days slack. Flag for monitoring during impl ; if slip, the calibration always-on (#2) and Feast (#5) can be split to a Phase 5 "stabilisation" Story without breaking the harness MVP.

Acceptance criteria addendum (§11)

  • Per-fold training time logged and within SLO (≤ +5% legacy) on Phase 2 benchmark for XGB / LGB / CB.
  • calibrate.py always-on, per-model override hook in adapter ; documented in ADR-89.
  • Meta-labeling : training-time vs post-inference distinction in ADR-89 + harness covers only training-time variant.
  • ADR-89 Security model section written and reviewed by ops expert ; secret-scrub regex shipped in nodes/persist.py.
  • Datasets contract carries feature_version field ; persisted to MLflow + mlflow_models table.

9. Cross-references

  • Today's broken sweep : ftf_20260508_210336_1a1160_ATR0.5_1.5_H4 (factor ensemble_diversity, status running) — the immediate trigger for this Story.
  • PR #872 (FTF 7-bug hotfix, sha 461a39b2, merged 2026-05-08) : the PR that demonstrated the cost of triplicated trainers — Bug #1 (LGB Booster wrap), Bug #2 (CB nested metrics), Bug #6 (LGB θ-sweep) all rooted in trainer divergence.
  • CVN-N001-EE-S06 (issue #802, PR #803) : the previous Track 11 dispatcher — establishes CVN_MODEL_TYPE env var routing, which the harness consumes from the registry instead of from the dispatcher.
  • ADR-61 : "Batch DAGs use Hamilton, not imperative code" — this Story is a direct application.
  • ADR-25 : "No silent fallback" — theta_sweep node fail-fast on missing val data, no default 0.5.
  • ADR-30/32/38 : structured logs — the log_emit.py node is the canonical implementation.
  • ADR-67 : pluggable feature-selection framework — the harness applies the same plugin pattern to models + ensembles.
  • CLAUDE.md §3 process : plan_review committee mandatory before substantial implementation — this dossier IS that artifact.

10. Sizing summary

Phase Effort (days) Critical artifact
1 — Skeleton + XGB adapter + parity 1.5 harness.train_one("xgboost", ...) matches legacy 1e-9
2 — LGB + CB adapters + 3-way parity + logging contract 1.5 All 3 models emit identical event=training_complete schema
3 — Wrappers + ensemble registry 1.5 EnsembleRegistry round-trips all 4 current Track 11 variants
4 — Migration of call sites + cutover + ADR-89 1.5 3 autonomous trainers ≤ 80 lines each, FTF runner registry-driven
Total 6 d All make test-unit + make test-integration green

Buffer +1 d for committee CR rounds and unexpected Hamilton resolution issues.


11. Acceptance criteria (Story closure gate)

  • All make test-unit + make test-integration green on feat/CVN-N001-EE-S16-* branch.
  • Parity tests green for the 3 current models (XGB / LGB / CB) on synthetic data.
  • FTF shadow run on 5 cryptos × 3 folds shows harness output within tolerance of legacy (XGB / CB) and documents intentional deltas (LGB θ-sweep fix).
  • All 3 autonomous trainers ≤ 80 lines each.
  • FTF ablation_runner.py iterates over registry instead of hardcoded if/elif.
  • Single Loki query returns training events for all 3 models with same field schema.
  • Test that adds a synthetic 4th model (e.g., dummy_classifier) verifies the registry picks it up with no other change.
  • CLAUDE.md "Architecture" section + new ADR-89 documenting the harness merged.
  • Track 11 sweep re-run on the harness ; verdict on ensemble_diversity factor recorded.
  • PR review : CodeRabbit + committee pr_review (mandatory per CLAUDE.md §8 — substantial PR touching src/commun/finetune/ + src/training/).
  • MLOps readiness template completed (ADR-70 — touches model artefacts + training path).
  • CVN-N001-EE-S16 OP wp transitions In testingTestedClosed per ADR-81.