Plan dossier — CVN-N001-EE-S16 : Training Harness unifié Hamilton + plugin registry¶
Date : 2026-05-09
Story : CVN-N001-EE-S16 (GH #890 · OP wp#143)
Author : Dominique (operator) + Claude
Session type : plan_review (mandatory before implementation per ADR-68 + CLAUDE.md §3)
Status : ✅ committee verdict obtained — PASSED / OK / strong consensus (session 7f13b78d, OP Meeting #125, 5 experts scores 7.5-8.5, mean 8.1)
Sizing estimate : 5-7 days
Branch : feat/CVN-N001-EE-S16-training-harness-unification (active)
1. The discovery — what triggered this Story¶
1.1 Symptom : LGB completely broken in Track 11 sweep¶
The Track 11 sweep (ensemble_diversity factor, run ftf_20260508_210336_1a1160, currently running) shows structurally invalid LGB results :
| variant | fold | n_trades avg | sortino avg | min/max | win_rate avg |
|---|---|---|---|---|---|
cb_only |
3 | 68 | +0.93 | -1.96 / +2.27 | 53% |
cb_only |
4 | 67 | +0.76 | -0.38 / +1.61 | 48% |
lgb_only |
3 | 251 | −5.87 | -6.56 / -5.25 | 32% |
lgb_only |
4 | 142 | −1.64 | -3.28 / +0.77 | 34% |
none |
3 | 44 | +0.50 | -0.58 / +1.08 | 57% |
stack_3model_avg |
3 | 28 | +0.92 | -0.70 / +3.14 | 60% |
stack_3model_logreg_shrink |
3 | 26 | +0.91 | -0.21 / +1.82 | 64% |
Smoking gun : LGB takes 5× more trades than every other variant and has win_rate ~32% (vs 55-65%). Sortino is not bounded — it crashes to −6 because LGB over-trades systematically.
Probable root cause : the val-tuned theta_sweep in src/training/LightGBM/cvntrade_LightGBM_grid_utils.py:160-175 (added by PR #872 hotfix) picks a too-low threshold under scale_pos_weight (the calibrated probabilities concentrate above 0.05 which makes the sweep choose ~0.05-0.10 → over-trade in test). XGB does NOT have this bug because its θ-tuning lives in the walk-forward calibrator (_calibrator_aggregated ligne 1425). CB has no θ-sweep at all.
This is the third LGB-specific bug found in 2 weeks (Bugs #1, #6, #7 of the FTF 7-bug hotfix PR #872 were all LGB-specific). The pattern is structural : LGB has its own training/eval/threshold codepath, divergent from XGB and from CB.
1.2 Symptom : logging gap¶
XGB autonomous trainer = 1785 lines with ~10 distinct event= log types (event=threshold_method, event=calibrator_aggregated, event=focal_loss_active, event=temperature_scaling_fit, event=xgboost_training_failed, ...). LGB autonomous = 407 lines with 1 event= log (event=lgb_autonomous_threshold_default). CB autonomous = 376 lines with 0 structured logs.
Consequence : LGB and CB training runs are invisible in Loki. Neither best_iteration, nor training_time_sec, nor n_estimators_final, nor class_distribution, nor dataset_size, nor the actual θ retained — none of these can be queried in production. We discovered the LGB Track 11 anomaly only because the operator looked at the PG finetune_results table. In a smaller anomaly we would not have noticed.
1.3 Root cause — three divergent codepaths¶
Today there are three separate trainers with overlapping but divergent responsibilities :
| Concern | XGB | LGB | CB |
|---|---|---|---|
| Train function | train_with_fixed_params (in autonomous_trainer.py L1627) |
train_with_fixed_params_lgbm (grid_utils.py) |
inline in _regenerate (autonomous L247) |
| Eval metrics | _eval_split (L1717) — 11 metrics |
inline in grid_utils L177-268 — 14 metrics | trainer.train() returns nested dict |
| θ-tuning | walk-forward calibrator + _calibrator_aggregated |
ad-hoc sweep in grid_utils L160-175 (PR #872 hotfix) | none |
| Class balancing | compute_class_weight + sample_weights |
scale_pos_weight (binary) OR sample_weights (multi) |
(catboost native) |
| Calibration | isotonic + temperature_scaling per fold | none | none |
| Meta-label | yes | no | no |
| Walk-forward | yes (_walk_forward_optimization) |
no | no |
| MLflow registry | yes (model card + features artifact + LdP params) | no | no |
| Logging | rich (~10 events, 50+ INFO) | minimal (1 event, 6 INFO) | minimal (0 events, 8 INFO) |
Each new bug found in any trainer requires a triplicated fix (or, more realistically, fixed in one trainer and forgotten in the others — exactly what Bugs #1, #6, #7 demonstrated). Each new model adds another full codepath (~400 lines minimum). Each new ensemble strategy is hardcoded inline in the FTF runner rather than registered as a plugin.
The Training Harness unification addresses all three issues with a single architectural move.
2. Goals — operator's explicit requirements¶
The operator has stated three non-negotiable requirements :
-
New model = drop a DAG file — adding LightGBM-GOSS, ExtraTrees, RandomForest, Logreg-baseline must NOT touch the training core, the FTF runner, the autonomous orchestrator, or the walk-forward path. One file, one registration, the system picks it up.
-
Stacking = composition, not bespoke code — adding a new stacker (mean, weighted, logreg meta, xgb meta, attention meta, ...) must be a one-file affair that DECLARES which base-model predictions it consumes. Hamilton resolves the dependencies.
-
Trainer = commodity — the autonomous trainers, the FTF screening loop, the walk-forward predictor, all use the SAME canonical training function
train_one(model_name, datasets, hpo_params) → TrainedArtifact. No model-specific orchestration code at the call sites.
Acceptance bar for "done" : - Adding a 4th model (e.g., logistic regression baseline) takes < 1 day end-to-end (file + adapter + registration + test). - Adding a new ensemble variant takes < 0.5 day. - A single Loki query returns training events for ALL models with the SAME field schema. - Bug #6 of PR #872 (LGB-only θ-sweep regression) becomes structurally impossible (single canonical θ-sweep).
3. Architecture sketch¶
src/training/harness/
├─ contracts.py # TypedDict / dataclass : Datasets, Predictions, Metrics, TrainedArtifact, HPOParams
├─ registry.py # @register_model("xgboost"), @register_ensemble("stack_logreg")
│ # plugin discovery via convention scan + (optional) entry_points
│
├─ adapters/ # ModelAdapter implementations (per-model — the 15% that differs)
│ ├─ base.py # protocol : fit(), predict_proba(), best_iteration, native_handle
│ ├─ xgb.py # @register_model — wrap xgb.Booster
│ ├─ lgb.py # @register_model — wrap lgb.Booster + predict_proba(n,2) shim
│ └─ cb.py # @register_model — wrap catboost.CatBoostClassifier
│
├─ nodes/ # Hamilton nodes — pure typed functions (the 85% reused)
│ ├─ class_balance.py # scale_pos_weight | sample_weights (single canonical impl)
│ ├─ theta_sweep.py # val-tuned threshold (UN seul algo — bug LGB θ structurellement impossible)
│ ├─ eval_metrics.py # f1/auc/brier/ECE/logloss (UNE seule fonction shared)
│ ├─ calibrate.py # isotonic / sigmoid / temperature
│ ├─ persist.py # MLflow + cache.store_*
│ ├─ log_emit.py # event=training_complete + event=training_started + event=θ_picked … common payload
│ └─ predict_oos.py # walk-forward OOS predictions (model-agnostic via adapter)
│
├─ dags/models/ # 1 fichier = 1 modèle
│ ├─ xgboost.py # @register_model — Hamilton DAG "xgboost_predictions"
│ ├─ lightgbm.py # @register_model — Hamilton DAG "lightgbm_predictions"
│ ├─ catboost.py # @register_model — Hamilton DAG "catboost_predictions"
│ └─ <new_model>.py # ← drop a file ⇒ model available
│
├─ dags/ensembles/ # 1 fichier = 1 ensemble strategy
│ ├─ blend_avg.py # @register_ensemble — average of base model predictions
│ ├─ blend_weighted.py # weighted by AUC val
│ ├─ stack_logreg.py # logreg meta on stacked OOS predictions
│ ├─ stack_xgb_meta.py # xgboost meta
│ └─ <new_stack>.py # ← drop a file ⇒ stack available
│
├─ dags/wrappers/ # meta-DAGs that compose the others
│ ├─ hpo.py # Optuna around any model DAG
│ ├─ walk_forward.py # walk-forward folds around any model DAG
│ └─ fixed_params.py # FTF screening — fixed HP, around any model DAG
│
└─ tests/
├─ parity/ # XGB ≡ LGB ≡ CB on synthetic data (same dataset → same metrics shape)
├─ contract/ # adapter contract per model + registry round-trip
├─ logging/ # all 3 emit same events with same fields
└─ plugin/ # add a synthetic 4th model in test, verify pickup
3.1 Concrete API surface¶
Adding a new model :
# src/training/harness/dags/models/extreme_trees.py
from sklearn.ensemble import ExtraTreesClassifier
from training.harness.registry import register_model
from training.harness.contracts import Datasets, HPOParams, TrainedArtifact
from training.harness.adapters.sklearn_proba import SklearnAdapter
@register_model("extreme_trees")
def train(datasets: Datasets, hpo_params: HPOParams) -> TrainedArtifact:
model = ExtraTreesClassifier(**hpo_params.to_sklearn()).fit(*datasets.train)
return TrainedArtifact(adapter=SklearnAdapter(model), best_iteration=None)
Adding a new stacker :
# src/training/harness/dags/ensembles/stack_attention.py
from training.harness.registry import register_ensemble
from training.harness.contracts import OOSPredictions, TrainedArtifact
@register_ensemble("stack_attention", base_models=["xgboost", "lightgbm", "catboost"])
def meta_features(predictions_per_model: dict[str, OOSPredictions]) -> StackFeatures: ...
def meta_model(meta_features: StackFeatures, datasets: Datasets) -> TrainedArtifact: ...
Calling the harness from FTF :
# src/commun/finetune/ablation_runner.py (refactored)
from training.harness.registry import ModelRegistry, EnsembleRegistry
for variant_name in factor.variants:
if variant_name in ModelRegistry:
artifact = ModelRegistry[variant_name].train(datasets, hpo_params)
elif variant_name in EnsembleRegistry:
artifact = EnsembleRegistry[variant_name].train(datasets, hpo_params, base_predictions=...)
3.2 The TrainedArtifact contract (single-source-of-truth payload)¶
@dataclass(frozen=True)
class TrainedArtifact:
adapter: ModelAdapter # exposes predict_proba, native_handle, best_iteration
metrics_val: SplitMetrics # f1_buy, auc_buy, brier, ECE, …
metrics_test: SplitMetrics # same shape
threshold_buy: float # val-tuned, canonical algo
cusum_metadata: CusumMetadata # fitted_sigma, threshold_h_calibrated
selected_features: list[str]
feature_names: list[str] # alias for backward compat
hyperparams: dict
training_time_sec: float
dataset_shape: DatasetShape # n_train, n_val, n_test, n_features, class_dist
Every consumer (regime_trainer, FTF agg, autonomous cache.store, MLflow) reads from this same contract. No more "shape divergence" bugs like Bug #2 of PR #872 (CB nested vs flat metrics).
4. Migration plan — 4 phases, sequenced for safety¶
Phase 1 — Skeleton + XGB adapter (1.5 d)¶
- Create
src/training/harness/{contracts,registry,adapters,nodes,dags,tests}/ - Implement
ModelAdapterprotocol +XGBAdapter(the reference impl — XGB already has the cleanest separation) - Implement Hamilton
dags/models/xgboost.pyreproducingtrain_with_fixed_paramsfrom autonomous_trainer.py L1627 - Implement core nodes :
class_balance,eval_metrics,theta_sweep,log_emit - Parity test : harness XGB on synthetic dataset must produce metrics within 1e-9 of legacy
train_with_fixed_params - Gate : parity tests green before phase 2.
Phase 2 — LGB + CB adapters + parity (1.5 d)¶
- Implement
LGBAdapter(wrapslgb.Booster, exposespredict_proba(n,2)shim — folds in PR #872 Bug #1 fix as the canonical solution) - Implement
CBAdapter(wrapscatboost.CatBoostClassifier) - Implement Hamilton
dags/models/lightgbm.pyanddags/models/catboost.pyusing the SAME nodes as XGB - Parity test : LGB and CB on synthetic dataset produce metrics with the same SHAPE and SAME schema as XGB
- Logging contract test : all 3 emit
event=training_started,event=training_complete,event=theta_picked,event=class_balance_appliedwith identical field set - Gate : 3-way parity green; FTF run in shadow mode (write to
finetune_results_shadowtable) shows harness output ≈ legacy output for 5 cryptos × 3 folds.
Phase 3 — Wrappers + ensemble registry (1.5 d)¶
- Implement
dags/wrappers/{hpo,walk_forward,fixed_params}.py(Hamilton DAGs that take a base-model DAG name as parameter) - Implement
EnsembleRegistry+ 4 ensemble DAGs reproducing the current Track 11 variants (blend_avg,stack_logreg_shrink,stack_xgb_meta,nonebaseline) - Parity test : ensemble outputs match the current Track 11 sweep on the SAME synthetic seed (modulo the LGB θ-sweep bug fix — which is intentional regression)
- Gate : ensemble parity green except for documented intentional fixes (the LGB θ-sweep regression fix, which is the entire point — this is documented and accepted, not silently introduced).
Phase 4 — Migration of call sites + cutover (1.5 d)¶
- Refactor
XGBoostAutonomousTrainer._regenerateto callharness.train_one("xgboost", datasets, hpo_params)— trainer shrinks from 1785 to ~80 lines - Same for
LightGBMAutonomousTrainerandCatBoostAutonomousTrainer - Refactor
src/commun/finetune/ablation_runner.pyto iterate overModelRegistryandEnsembleRegistryinstead of hardcoded if/elif branches - Refactor regime_trainer / walk-forward consumers to read from
TrainedArtifactcontract - Delete the legacy
train_with_fixed_params*functions from autonomous_trainer.py + grid_utils.py + (orphan adaptercvntrade_lightgbm_adapter.pywhich has zero current callers) - Update CLAUDE.md "Architecture" section pointing to the new harness layout
- New ADR :
ADR-89 — Training harness as plugin registry (Hamilton)
Total : 6 days × 1 dev. Sized 5-7 days = realistic with 1 day buffer.¶
5. Parity contract — how we prove the harness is safe¶
The migration must NOT silently change ML behaviour. Parity is enforced at three levels :
5.1 Unit parity (synthetic data)¶
tests/parity/test_xgb_harness_vs_legacy.py :
- Seed : np.random.seed(42), synthetic dataset n=2000 features=20 binary label
- Train via legacy train_with_fixed_params AND via harness.train_one("xgboost", ...)
- Assert : metrics_val.f1_buy, auc_buy, brier, ECE match within 1e-9 (or 1e-6 for non-deterministic ops)
- Same test for LGB and CB
5.2 Integration parity (FTF shadow run)¶
Phase 2 gate : run the harness in shadow mode alongside the legacy training. Both write to PG, harness to finetune_results_shadow. Compare 5 cryptos × 3 folds × 3 models on the SAME run :
- For XGB and CB : delta on sortino < 0.05, on n_trades < 5%, on f1_buy < 0.01
- For LGB : intentional regression (the θ-sweep fix changes behaviour) — delta documented + reviewed by operator before phase 3
5.3 Logging contract test¶
tests/logging/test_event_schema_parity.py :
- Capture all event= logs emitted during a synthetic training run for each of the 3 models
- Assert : same event names, same field names, same dtype per field
- Specifically check : event=training_complete carries model_type, n_estimators_final, best_iteration, training_time_sec, dataset_n_train, dataset_n_features, class_dist_buy_pct, theta_picked, auc_val, f1_buy_val, brier_val, ECE_val for ALL 3 models
- This test will structurally prevent the kind of "logging gap" we have today
6. Risk matrix¶
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | Parity tests reveal LGB/CB legacy was actually using a divergent algo (e.g., different random seed) → harness "regression" is in fact a bugfix the operator wants to review | High | Medium | Phase 2 gate forces explicit operator sign-off on documented deltas before phase 3. |
| 2 | Hamilton DAG composition slow at startup (driver.execute overhead × N variants × N folds) | Medium | Medium | Benchmark in phase 2 ; if > 5% slowdown vs legacy, cache the driver per model_name. |
| 3 | Registry plugin discovery breaks if an adapter file fails to import (e.g., catboost not installed in some environment) | Medium | Low | Lazy import per adapter ; registry skips failed registrations and logs event=adapter_unavailable. |
| 4 | Refactor of regime_trainer + ablation_runner is invasive — breaks unrelated tests | Medium | High | Phase 4 gated by full make test-unit + make test-integration green ; staged commits per file (small reviewable diffs). |
| 5 | Operator runs a FTF sweep mid-migration on a stale branch → results conflated with the harness | Low | High | Branch protection : harness work in feat/CVN-N001-EE-S16-* branch only ; FTF DAGs on main keep old codepath until phase 4 cutover commit. |
| 6 | Stacking ensembles need OOS predictions from base models — base models must run before stacker. Hamilton resolves this but the OOS prediction artifacts must be cached, otherwise re-training base models for each stacker triples the FTF runtime | Medium | High | Phase 3 spec : dags/ensembles/* consume OOSPredictions from a Hamilton-cached node ; runner orchestrates base models first, stackers second. Already a documented Hamilton pattern. |
| 7 | The 4 phases are too coupled to parallelize → critical path is 6 days serial | Medium | Low | Accepted ; Story sized 5-7 d explicitly assumes serial execution. |
7. Decoupling — what does this Story NOT do¶
This Story is infrastructure. It does NOT :
- Add new models (out of scope ; future Stories under the registry pattern).
- Add new ensembles (out of scope ; same).
- Resolve Track 11 sweep verdict (the sweep is abandoned per §8.1 below ; re-run on the harness post-cutover).
- Touch the inference path (src/commun/inference/) — only the training path.
- Touch the CUSUM / FE / labels — only consumes the existing feature_selection_result contract.
This Story DOES :
- Refactor the 3 trainers + the FTF runner + the walk-forward predictor.
- Establish the registry as the single point of truth for "what models / ensembles exist".
- Establish the TrainedArtifact contract as the single payload all consumers read from.
- Establish the canonical theta_sweep algorithm (resolving the LGB θ-sweep regression).
- Establish the canonical logging schema (resolving the XGB-vs-LGB-vs-CB observability gap).
8. Open decisions for the committee¶
8.1 Track 11 sweep currently running — abandon or wait ?¶
The sweep ftf_20260508_210336_1a1160 (ensemble_diversity factor) is currently running. With LGB structurally over-trading, the lgb_only variant sortino is unusable, and the 3 stacking variants that include LGB are contaminated.
Operator's preliminary recommendation : abandon the current sweep. Rationale : (a) re-running on the harness post-cutover gives valid results in ~1 day extra, (b) finishing the current sweep wastes compute on known-broken results, (c) any verdict drawn from the contaminated sweep would have to be re-computed anyway.
Committee question : do experts agree, or is there value in completing the current sweep for partial signal (cb_only + none baseline) before kill ?
8.2 Registry mechanism — convention scan vs entry_points ?¶
Two options for plugin discovery :
Option A (convention scan) : registry.py walks dags/models/*.py at import time, calls import_module() on each, the @register_model decorator side-effects into a global dict.
- Pros : zero packaging overhead, drop a file = available.
- Cons : import-order coupling, harder to install third-party model plugins as a separate pip package.
Option B (entry_points) : models register via pyproject.toml entry_points. Third-party plugins install as separate packages and self-register.
- Pros : true plugin architecture, future-proof.
- Cons : requires editing pyproject.toml for every new in-tree model = friction the operator explicitly wants to avoid.
Operator's preliminary recommendation : A (convention scan) for in-tree models, OPEN to B as a future addition if a third-party plugin use case materialises. Committee question : is the convention scan robust enough, or do we want both (in-tree A + third-party B) from day 1 ?
8.3 Should θ-sweep be a node OR an opt-in wrapper ?¶
Today XGB does NOT use a per-fold θ-sweep on val (it has the walk-forward calibrator instead). LGB has the bug θ-sweep. CB has none.
Option A (always-on θ-sweep node) : every model runs through nodes/theta_sweep.py ; XGB's behaviour changes (new threshold strategy).
Option B (opt-in via wrapper DAG) : models that want θ-sweep call dags/wrappers/theta_sweep.py ; XGB stays as-is, LGB and CB get the (correct) sweep.
Operator's preliminary recommendation : B (opt-in) for phase 1-3, with a follow-up Story to evaluate whether XGB benefits from the θ-sweep on val (a question that needs its own FTF sweep — not in scope here). Committee question : agree, or push for unified θ-sweep from day 1 ?
8.4 Hamilton vs lightweight DI ?¶
ADR-61 mandates Hamilton for batch DAGs. This Story uses Hamilton end-to-end. But Hamilton has a learning curve and the 3 current trainers are imperative Python — the cognitive shift for future contributors is non-trivial.
Committee question : is full Hamilton justified for this work, or should the harness use a lighter-weight DI pattern (e.g., simple registry of callables + manual orchestration) and reserve Hamilton for the wrappers ?
8.5 What about the orphan cvntrade_lightgbm_adapter.py ?¶
This file (src/training/patterns/adapters/cvntrade_lightgbm_adapter.py) is dead code — zero imports across the codebase, dating back to a prior aborted refactor. Phase 4 deletes it. Committee question : confirm acceptable ?
8.6 Verdict tree¶
Recommend among the 4 phases : KEEP / DROP / SPLIT / DEFER. Recommend on the 5 open decisions above. Identify any 6th risk or 7th open decision the operator + Claude missed.
8.bis Committee verdict applied (2026-05-09 — session 7f13b78d)¶
Verdict : ✅ PASSED / OK / strong consensus · 5 experts (data-scientist, crypto-trader, ml-engineer, architect, ops), scores 7.5-8.5 (mean 8.1) · OP Meeting #125.
Decisions taken on §8 open questions¶
| § | Question | Verdict | Notes |
|---|---|---|---|
| 8.1 | Track 11 sweep — abandon or wait ? | ABANDON (unanimous) | Sweep ftf_20260508_210336_1a1160 aborted 2026-05-09 (PG status=aborted, 5 K8s pods deleted) ; comments posted on OP wp#102 + wp#45 ; Track 11 verdict re-computed on harness post-cutover. |
| 8.2 | Registry — convention scan vs entry_points ? | Option A (convention scan) | Design the registry so that future Option B (entry_points) is a non-breaking extension if/when a third-party plugin use case materialises. |
| 8.3 | θ-sweep — always-on node vs opt-in wrapper ? | Option B (opt-in wrapper) | Phase 1-3 ; XGB stays as-is (walk-forward calibrator) ; LGB and CB get the canonical sweep. Follow-up Story to evaluate XGB+θ-sweep on val (own FTF needed, out of scope here). |
| 8.4 | Hamilton vs lightweight DI ? | Full Hamilton | Per ADR-61. Onboarding doc pointing to Hamilton patterns added in Phase 4. |
| 8.5 | Orphan adapter cvntrade_lightgbm_adapter.py ? |
DELETE in Phase 4 | Confirmed dead code (zero imports). |
Enhancement recommendations to integrate (5 — non-blocking)¶
These extend the plan ; they are integrated into the relevant phases below.
-
Performance monitoring (extends §6 Risk #2 + §11 acceptance) — measure per-fold training time for XGB / LGB / CB on the SAME synthetic dataset, legacy vs harness, in Phase 2 benchmark. Establish SLOs : harness driver overhead ≤ 5% wall-clock vs legacy. Alarm in Grafana on regression. Add a
event=training_completefieldtraining_time_sec(already in §3.2 contract — reaffirmed) so Loki can query per-model training latency over time. -
Calibration always-on (extends §3 nodes/calibrate.py) — per the ML-engineer's ask,
calibrate.pybecomes an always-on node for every model that produces probabilities. Default = isotonic ; per-model override possible via the adapter. Rationale : θ-sweep on miscalibrated probabilities is the bug LGB just exhibited ; canonical calibration BEFORE θ-sweep prevents recurrence by construction. -
Meta-labeling clarification (extends §3 nodes/meta_label.py) — the dossier was ambiguous : XGB does meta-label as a secondary training step on TP/FP, while the LdP filter chain has a post-inference
meta_labelfilter. They are different. The harnessnodes/meta_label.pycovers ONLY the training-time variant ; it produces a secondary classifier surfaced inTrainedArtifact.meta_label_model. The post-inference filter stays insrc/commun/filters/meta_label.pyuntouched (out of scope). Documented in ADR-89. -
Security of model artifacts (extends §6 Risk + ADR-89) — ADR-89 must include a "Security model" section : (a) AuthN/AuthZ for the MLflow registry — leverage existing IAM, no new secret store ; (b)
TrainedArtifactdoes NOT serialize raw HPO datasets (already true by contract — onlydataset_shapesummary is persisted) ; (c) MLflowparamsare scrubbed of any env values matching*_KEY|*_TOKEN|*_PASSWORDpatterns (regex filter innodes/persist.py). No encryption-at-rest change required at this stage ; tracked in a follow-up Story under CVN-N001-EF (operational prereqs). -
Feature store integration with Feast (extends §3.2 Datasets contract) —
Datasetscarries an additional fieldfeature_version: FeatureVersion(commit SHA + Feast registry hash + selected_features hash).nodes/persist.pywrites this field into MLflow as tags + intocache.store_trained_modelas a column onmlflow_models(migration in Phase 4). Future Feast integration becomes a non-event : just populate the field from Feast instead of from FE pipeline. Documented in ADR-89.
Phase impact of the 5 enhancements¶
| Enhancement | Phase | Extra effort |
|---|---|---|
| 1 — Perf SLO | Phase 2 (benchmark) | +0.25 d (Grafana panel + SLO doc) |
| 2 — Calibration always-on | Phase 1 (node) | +0.25 d (always-on default + per-model override hook) |
| 3 — Meta-label clarification | Phase 1 (ADR draft) + Phase 4 (ADR-89 final) | +0.1 d (mostly docs) |
| 4 — Security section | Phase 4 (ADR-89) | +0.25 d (write Security model + scrub regex impl) |
| 5 — Feast / FeatureVersion | Phase 1 (contract) + Phase 4 (migration + persist) | +0.5 d |
| Total extra | distributed | +1.35 d |
Sizing revised : 6 d → ~7 d under integrated enhancements. Buffer +1 d → matches 5-7 d window with ~0 days slack. Flag for monitoring during impl ; if slip, the calibration always-on (#2) and Feast (#5) can be split to a Phase 5 "stabilisation" Story without breaking the harness MVP.
Acceptance criteria addendum (§11)¶
- Per-fold training time logged and within SLO (≤ +5% legacy) on Phase 2 benchmark for XGB / LGB / CB.
-
calibrate.pyalways-on, per-model override hook in adapter ; documented in ADR-89. - Meta-labeling : training-time vs post-inference distinction in ADR-89 + harness covers only training-time variant.
- ADR-89 Security model section written and reviewed by ops expert ; secret-scrub regex shipped in
nodes/persist.py. -
Datasetscontract carriesfeature_versionfield ; persisted to MLflow +mlflow_modelstable.
9. Cross-references¶
- Today's broken sweep :
ftf_20260508_210336_1a1160_ATR0.5_1.5_H4(factorensemble_diversity, statusrunning) — the immediate trigger for this Story. - PR #872 (FTF 7-bug hotfix, sha 461a39b2, merged 2026-05-08) : the PR that demonstrated the cost of triplicated trainers — Bug #1 (LGB Booster wrap), Bug #2 (CB nested metrics), Bug #6 (LGB θ-sweep) all rooted in trainer divergence.
- CVN-N001-EE-S06 (issue #802, PR #803) : the previous Track 11 dispatcher — establishes
CVN_MODEL_TYPEenv var routing, which the harness consumes from the registry instead of from the dispatcher. - ADR-61 : "Batch DAGs use Hamilton, not imperative code" — this Story is a direct application.
- ADR-25 : "No silent fallback" —
theta_sweepnode fail-fast on missing val data, no default 0.5. - ADR-30/32/38 : structured logs — the
log_emit.pynode is the canonical implementation. - ADR-67 : pluggable feature-selection framework — the harness applies the same plugin pattern to models + ensembles.
- CLAUDE.md §3 process : plan_review committee mandatory before substantial implementation — this dossier IS that artifact.
10. Sizing summary¶
| Phase | Effort (days) | Critical artifact |
|---|---|---|
| 1 — Skeleton + XGB adapter + parity | 1.5 | harness.train_one("xgboost", ...) matches legacy 1e-9 |
| 2 — LGB + CB adapters + 3-way parity + logging contract | 1.5 | All 3 models emit identical event=training_complete schema |
| 3 — Wrappers + ensemble registry | 1.5 | EnsembleRegistry round-trips all 4 current Track 11 variants |
| 4 — Migration of call sites + cutover + ADR-89 | 1.5 | 3 autonomous trainers ≤ 80 lines each, FTF runner registry-driven |
| Total | 6 d | All make test-unit + make test-integration green |
Buffer +1 d for committee CR rounds and unexpected Hamilton resolution issues.
11. Acceptance criteria (Story closure gate)¶
- All
make test-unit+make test-integrationgreen onfeat/CVN-N001-EE-S16-*branch. - Parity tests green for the 3 current models (XGB / LGB / CB) on synthetic data.
- FTF shadow run on 5 cryptos × 3 folds shows harness output within tolerance of legacy (XGB / CB) and documents intentional deltas (LGB θ-sweep fix).
- All 3 autonomous trainers ≤ 80 lines each.
- FTF
ablation_runner.pyiterates over registry instead of hardcoded if/elif. - Single Loki query returns training events for all 3 models with same field schema.
- Test that adds a synthetic 4th model (e.g.,
dummy_classifier) verifies the registry picks it up with no other change. - CLAUDE.md "Architecture" section + new ADR-89 documenting the harness merged.
- Track 11 sweep re-run on the harness ; verdict on
ensemble_diversityfactor recorded. - PR review : CodeRabbit + committee
pr_review(mandatory per CLAUDE.md §8 — substantial PR touchingsrc/commun/finetune/+src/training/). - MLOps readiness template completed (ADR-70 — touches model artefacts + training path).
- CVN-N001-EE-S16 OP wp transitions
In testing→Tested→Closedper ADR-81.