Architecture — Pluggable Feature-Selection Framework (CVN-N001-ED extension)¶
Status: Implemented in PR #685, bound by ADR-67. Version: 1.0 Date: 2026-04-25 ADRs in scope: ADR-25 (no silent fallback), ADR-56 (every change A/B testable), ADR-61 (Hamilton for batch DAGs), ADR-62 (OTel observability), ADR-64 (preflight first-class — amended by ADR-67), ADR-65 (Console-driven config), ADR-67 (this framework) Source: GitHub issue #684
1. Purpose¶
Replace the hardcoded variance / FI selection in cvntrade_autonomous_fe.py with a Hamilton-DAG-based framework that:
- Treats every selector (variance, covariance, FI, SHAP, future permutation / SHAP / mutual_info / ...) as a 1-file drop-in
- Lets the FTF compare 4 ranking paradigms × 2 compute scopes in a single ablation run (variance / covariance / fi / shap × global / perfold)
- Eliminates the stale-FI defect: per-fold scope mirrors production retrain cadence (operator weekly retrain ⇒ fresh FI on the latest training window)
- Makes K-overshoot a structured failure mode (
code=k_above_floorJSON infinetune_results.error) instead of an opaqueRuntimeError("training_failed")
2. Component view¶
The framework is internal to the FTF — it slots between regime_trainer (which knows the fold) and cvntrade_autonomous_fe (which knows the FE-transformed (X, y)). All compute is Hamilton per ADR-61; the dispatcher is orchestration only.
flowchart LR
operator([Operator])
console[Console
ftf_config.base_env]
airflow[Airflow
finetune__pte DAG]
regime[regime_trainer
fold loop]
fe[CVNTrade_Autonomous
FeatureEngineering]
dispatcher[feature_selection
dispatch.get_selection]
registry[REGISTRY
method → DAG]
subgraph dags[Hamilton DAGs]
shared[shared.py
x_train / y_train /
reference_model]
fi[fi_dag.py
fi_scores]
shap[shap_dag.py
shap_scores]
var[variance_dag.py
variance_scores]
cov[covariance_dag.py
covariance_scores]
end
cache[(cache/feature_selection/
method/scope/...)]
pg[(PostgreSQL
finetune_results)]
grafana[Grafana
OTel events]
operator -- "set CVN_FEATURE_SELECTION_*" --> console
console -- "base_env" --> airflow
airflow -- "trigger" --> regime
regime -- "for each fold:
set CVN_FOLD_ID,
invoke FE" --> fe
fe -- "method, scope, fold_id,
X_train, y_train" --> dispatcher
dispatcher -- "lookup" --> registry
dispatcher -- "cache hit?" --> cache
dispatcher -- "build Driver(shared + selector_dag)" --> dags
shared -- "reference_model
(shared by fi + shap)" --> fi
shared --> shap
dags -- "scores: pd.Series" --> dispatcher
dispatcher -- "save" --> cache
dispatcher -- "OTel events" --> grafana
fe -- "top-K applied,
error_payload(KAboveFloorError)
on overshoot" --> pg
Two key invariants visible in the diagram:
- Anti-leakage by signature —
dispatcher.get_selection(...)requires(x_train, y_train)from the caller. The Hamilton DAG never queries the cache directly. Per-fold scope passes the fold's training window; global scope passes the preflight reference window. No code path can confuse train and test. - Shared-node deduplication —
reference_modellives inshared.py. Whenfi_dagandshap_dagboth consume it (same execute call), Hamilton materialises it exactly once. Adding a 5th model-based selector would inherit the same dedup for free.
3. Hamilton DAGs (visualizations)¶
Each (method, scope) combination produces a distinct DAG. The visualizations below are rendered from commun.finetune.feature_selection.dispatch.visualize() — they reflect the actual execution graph Hamilton resolves at run time.
3.1 variance — inline (target-blind, no model fit)¶
3.2 covariance — inline (univariate target correlation)¶
3.3 fi — global / perfold (XGBoost gain on shared reference_model)¶
fi_global and fi_perfold share the same DAG; only the input window differs at runtime (and therefore the cache key).
3.4 shap — global / perfold (Shapley values on the same shared reference_model)¶
When fi_scores and shap_scores are requested in the same execute call, Hamilton sees one reference_model node and trains XGBoost exactly once. SHAP becomes a ~30 s extractor on the existing model rather than a 5 min retrain.
4. Cache layout¶
cache/feature_selection/
├── fi/
│ ├── global/<SYMBOL>_<strategy>_<tf>.json
│ └── perfold/<SYMBOL>_<strategy>_<tf>_fold<N>.json
├── shap/
│ ├── global/<SYMBOL>_<strategy>_<tf>.json
│ └── perfold/<SYMBOL>_<strategy>_<tf>_fold<N>.json
├── variance/
│ └── inline/<SYMBOL>_<strategy>_<tf>[_fold<N>].json
└── covariance/
└── inline/<SYMBOL>_<strategy>_<tf>[_fold<N>].json
Per-scope cache key semantics (refined post PR #686):
| Scope | fold_id in path | Sharing | Mismatch behaviour |
|---|---|---|---|
global |
no | shared across all folds (sharing intent) | warning event feature_selection_cache_window_mismatch, scores still served |
perfold |
required | per-fold isolation | raise RuntimeError (programming bug — caller passed a different (X, y) for the same fold) |
inline |
when supplied | per-fold isolation when fold_id present, else single shared file | same as perfold — raise |
Variants on the same (method, scope, fold_id) SHARE the cache file (e.g. fi_30_perfold and fi_150_perfold on fold 3 ⇒ one compute, two top-K reads). Top-K splitting happens AFTER the cache load.
The cache fingerprint stored in each payload is a SHA-256 of (index.min, index.max, shape, ordered column names) — including the column list (PR #686 CR pass 2) closes a false-negative where two X_train matrices with the same row range and shape but different feature columns would otherwise be served interchangeably.
A sample payload (truncated):
{
"method": "fi",
"scope": "perfold",
"symbol": "UNIUSDC",
"strategy": "ATR0.5_1.5_H4",
"timeframe": "15m",
"fold_id": 3,
"name": "fi_score",
"scores": {
"rsi_14": 0.0432,
"macd_signal_diff": 0.0418,
"...": 0.0
}
}
5. Sequence — per-fold compute (regime_trainer → dispatcher → Hamilton)¶
sequenceDiagram
autonumber
participant RT as regime_trainer
(fold N loop)
participant FE as Autonomous_FE
participant Dis as dispatcher.get_selection
participant Cache as cache/feature_selection/
fi/perfold/...json
participant Hamilton
participant XGB as XGBoost
(reference_model)
participant DB as finetune_results
RT->>RT: set CVN_FOLD_ID=N,
CVN_FEATURE_SELECTION_SCOPE=perfold
RT->>FE: invoke FE pipeline
FE->>FE: stationarise + standardise
→ X_train_transformed, y_train
FE->>Dis: get_selection(method=fi, scope=perfold, fold_id=N,
x_train=X_train_transformed, y_train)
Dis->>Dis: registry.resolve(fi, perfold) → SelectorEntry
Dis->>Cache: lookup fi/perfold/UNI_..._fold3.json
alt cache hit
Cache-->>Dis: scores Series
else cache miss
Dis->>Hamilton: execute([fi_scores], inputs={x_train_override, y_train_override, symbol})
Hamilton->>Hamilton: resolve shared.x_train + shared.y_train + shared.sample_weights
Hamilton->>XGB: train (depth=10, rounds=500, seed=42)
XGB-->>Hamilton: Booster
Hamilton->>Hamilton: fi_scores(reference_model, x_train) → pd.Series
Hamilton-->>Dis: {fi_scores: Series}
Dis->>Cache: save (atomic .tmp + replace)
end
Dis-->>FE: scores Series
FE->>FE: select_top_k(scores, K=80, policy=fail)
alt K ≤ non_zero
FE->>FE: apply top-K cap to X_train/val/test
FE-->>RT: cap applied (continues training)
else K > non_zero
FE->>FE: raise KAboveFloorError
FE->>DB: error_payload → {code:k_above_floor, requested_k, available_features, ...}
end
Critical points:
- Step 1:
regime_trainerexportsCVN_FOLD_IDalongside the existingCVN_FOLD_*env vars. The FE component reads it via the dispatcher when scope=perfold. - Step 4: the dispatcher signature requires
(x_train, y_train)— the FE component is the only place that can produce them with the right window, which keeps the anti-leakage contract auditable from a single call site. - Step 8 (XGBoost training): parameters (
depth=10,rounds=500) are surfaced as Console knobs (CVN_FI_REFERENCE_DEPTH,CVN_FI_REFERENCE_ROUNDS). Bumped from6/200to push the non-zero feature floor higher (the KAboveFloorError ceiling). Bounds enforced at the edge: depth ∈[1, 30], rounds ∈[10, 5000]— out-of-range values raiseValueErrorfrom_parse_positive_intinstead of crashing inside XGBoost (PR #686 guardrail). - Step 14 (
KAboveFloorError): structured payload{code: "k_above_floor", requested_k, available_features, selector, scope, symbol}lands infinetune_results.erroras JSON. The dashboard can filter byerror::json->>'code' = 'k_above_floor'to distinguish K-overshoot from real training crashes. Three other structured codes share the same payload shape:selector_not_registered(unknown method),scope_not_supported(invalid combo),training_failed(legacy fallback for non-framework exceptions). - Score alignment (post-cache, pre-top-K):
aligned = scores.reindex(X_train.columns)thenaligned.loc[~aligned.index.isin(scores.index)] = 0.0— features absent from the cache are de-prioritised to 0, while features present withNaNkeep theirNaN(selector-explicit "undefined") and are filtered byselect_top_k's> 0mask. PR #686 CR pass 5 fix: the previous.fillna(0.0)collapsed both cases.
6. Adding a new selector — checklist¶
Per ADR-67: drop-in surface is a single new file + a single registry entry. Demo with a hypothetical permutation importance:
- Create
src/commun/finetune/feature_selection/dags/permutation_dag.py:
def permutation_scores(reference_model, x_train, y_train, symbol) -> pd.Series:
from sklearn.inspection import permutation_importance
imp = permutation_importance(reference_model, x_train, y_train, n_repeats=5, random_state=42)
scores = pd.Series(imp.importances_mean, index=x_train.columns).sort_values(ascending=False)
_emit("feature_selection_node", node="permutation_scores", selector="permutation",
symbol=symbol, n_features=int(scores.shape[0]), non_zero=int((scores > 0).sum()))
return scores
- Add to
registry.py:
"permutation": SelectorEntry(
dag_module=permutation_dag,
output_node="permutation_scores",
supports_scopes=("perfold",), # too expensive globally
description="Permutation importance — measures actual predictive contribution.",
),
- Variants in
ablation_matrix.pygetpermutation_50_perfold,permutation_100_perfold, etc.
That's it. No edits to dispatch.py, no edits to the FE pipeline, no edits to the cache layout, no edits to regime_trainer.py. Hamilton automatically wires the new node to reference_model (and the rest of the shared chain) on its parameter names.
Auto-wired side-effects (no extra work needed when adding a selector):
- The FE pipeline's scope-fallback hook calls
registry.default_scope_for_method(method)which returnssupports_scopes[0]. Sopermutationdeclaringsupports_scopes=("perfold",)makesperfoldthe default scope when the operator setsCVN_FEATURE_SELECTION_METHOD=permutationwithout an explicit scope. - The Console save-time validator (
scripts/ftf_config_ui.py::_validate_feature_selection_combo) readsREGISTRYdynamically when the framework module is importable; otherwise it falls back to a hardcoded mirror that is sync-tested against the live REGISTRY (tests/unit/test_feature_selection_framework.py::test_console_validator_mirror_matches_registry). Adding a selector trips the test if the fallback isn't updated. - The structured error catalogue (
KAboveFloorError,SelectorScopeError,UnknownSelectorError) is shared — any new selector inherits the sameerror_payload(exc)JSON shape automatically.
7. Console keys¶
All knobs live in ftf_config.base_env (Console UI). Defaults documented in config/ftf_baseline.json.
| Key | Default | Effect |
|---|---|---|
CVN_PREFLIGHT_ENABLED |
1 |
Master kill-switch (existing) |
CVN_GLOBAL_PREFLIGHT |
1 |
Gate the preflight global compute |
CVN_PERFOLD_PREFLIGHT |
1 |
Gate the per-fold compute hook |
CVN_FI_REFERENCE_DEPTH |
10 |
XGBoost depth for fi/shap reference model (was 6) |
CVN_FI_REFERENCE_ROUNDS |
500 |
XGBoost rounds (was 200) |
CVN_FEATURE_SELECTION_METHOD |
(per variant) | variance / covariance / fi / shap |
CVN_FEATURE_SELECTION_SCOPE |
(per variant) | inline / global / perfold |
CVN_FEATURE_SELECTION_K_OVERSHOOT_POLICY |
fail |
fail (ADR-25) / truncate |
CVN_FOLD_ID |
(set by regime_trainer) | Per-fold cache key for scope=perfold |
8. Observability — OTel events catalogue¶
Every Hamilton node + every dispatcher transition emits a structured event via commun.observability.otel.emit_event (ADR-62). Event names and key attributes:
| Event | Source | Key attributes |
|---|---|---|
feature_selection_cache_hit |
dispatcher | method, scope, symbol, strategy, timeframe, fold_id, cache_path |
feature_selection_compute_start |
dispatcher | + n_train_rows, n_features_in |
feature_selection_compute_end |
dispatcher | + n_features_out, non_zero, elapsed_s, cache_path |
feature_selection_compute_failed |
dispatcher | + error_payload (JSON of the structured exception) |
feature_selection_cache_window_mismatch |
dispatcher | method, scope=global, symbol, fold_id, cached_fingerprint, caller_fingerprint — fires when a global cache hit is served to a caller whose training window doesn't match the cached one (sharing intent preserved, mismatch surfaced for audit). perfold / inline mismatches raise instead of emit. |
feature_selection_node |
each Hamilton node | node (= function name), selector, symbol, elapsed_s, plus node-specific attrs. Every node emits — including n_classes, sample_weights, fi_reference_depth, fi_reference_rounds (PR #686 CR pass 1 fix). |
feature_selection_truncated |
top_k helper | requested_k, actual_k, selector, scope, symbol |
Grafana Loki query example to pull all per-fold FI compute durations for a run:
{namespace="cvntrade"}
| json
| event="feature_selection_compute_end"
| method="fi"
| scope="perfold"
9. Failure modes and mitigations¶
| Failure | Detection | Mitigation |
|---|---|---|
| K > non_zero in cache | KAboveFloorError raised by select_top_k |
Structured code=k_above_floor payload in DB; dashboard filter; operator either lowers K, deepens reference model, or switches variant policy to truncate |
| Selector / scope combo unsupported | SelectorScopeError at parse time (registry.resolve) |
Variant rejected before any pod is provisioned. Editable from Console without redeploy. |
| Hamilton node raises (e.g. xgboost OOM) | Bubbles up; dispatcher emits feature_selection_compute_failed with error_payload |
Standard error_payload(exc) fallback to {code: training_failed, message} so legacy dashboards keep working |
| OTel collector unreachable | emit_event swallows exception, logs at debug level |
Observability never breaks compute — falls back to standard logger silently |
| Cache file corrupted | _load_cache raises JSONDecodeError |
Operator deletes the file → next run recomputes |
fold_id missing for scope=perfold |
cache_path raises ValueError |
Caught by dispatcher, becomes structured error in DB |
| Two variants on same fold race the cache write | _save_cache writes to .tmp then .replace (atomic) |
No corrupted partial file; later writer wins (same data anyway) |
10. Migration from legacy preflight¶
The legacy cache/feature_importance/<key>.json layout (written by FiReferenceStep, ADR-64) is untouched by this PR. Two coexistence paths:
- Operator action: re-run preflight after the deploy → new framework writes to
cache/feature_selection/fi/global/.... Both caches present until cleanup. - No action: the legacy
load_fi_reference()reader stays callable. Variants usingCVN_FEATURE_SELECTION_METHOD=fiwithout explicit scope default toglobaland read from the legacy path until the global cache gets repopulated through the new dispatcher.
Cleanup: planned as a follow-up issue once the framework has been validated in production for one full FTF cycle.
11. References¶
- Binding ADR: ADR-67 — Pluggable Feature-Selection Framework
- Issue: #684 — full design discussion
- Implementation PR: #685
- Source:
src/commun/finetune/feature_selection/ - Tests:
tests/unit/test_feature_selection_framework.py(32 tests) - Related ADRs: ADR-25 (no silent fallback), ADR-56 (A/B testable), ADR-61 (Hamilton for batch DAGs), ADR-62 (OTel), ADR-64 (preflight first-class — amended by ADR-67), ADR-65 (Console-driven config)
- Triggering analysis: feature_importance run
ftf_20260424_223437showing fi_65/80 fail at 50% rate due to FI cache capacity ceiling