Architecture — Pluggable Feature-Selection Framework (CVN-N001-ED extension)¶

Status: Implemented in PR #685, bound by ADR-67. Version: 1.0 Date: 2026-04-25 ADRs in scope: ADR-25 (no silent fallback), ADR-56 (every change A/B testable), ADR-61 (Hamilton for batch DAGs), ADR-62 (OTel observability), ADR-64 (preflight first-class — amended by ADR-67), ADR-65 (Console-driven config), ADR-67 (this framework) Source: GitHub issue #684

1. Purpose¶

Replace the hardcoded variance / FI selection in cvntrade_autonomous_fe.py with a Hamilton-DAG-based framework that:

Treats every selector (variance, covariance, FI, SHAP, future permutation / SHAP / mutual_info / ...) as a 1-file drop-in
Lets the FTF compare 4 ranking paradigms × 2 compute scopes in a single ablation run (variance / covariance / fi / shap × global / perfold)
Eliminates the stale-FI defect: per-fold scope mirrors production retrain cadence (operator weekly retrain ⇒ fresh FI on the latest training window)
Makes K-overshoot a structured failure mode (code=k_above_floor JSON in finetune_results.error) instead of an opaque RuntimeError("training_failed")

2. Component view¶

The framework is internal to the FTF — it slots between regime_trainer (which knows the fold) and cvntrade_autonomous_fe (which knows the FE-transformed (X, y)). All compute is Hamilton per ADR-61; the dispatcher is orchestration only.

flowchart LR
    operator([Operator])
    console[Console
ftf_config.base_env]
    airflow[Airflow
finetune__pte DAG]
    regime[regime_trainer
fold loop]
    fe[CVNTrade_Autonomous
FeatureEngineering]
    dispatcher[feature_selection
dispatch.get_selection]
    registry[REGISTRY
method → DAG]
    subgraph dags[Hamilton DAGs]
        shared[shared.py
x_train / y_train /
reference_model]
        fi[fi_dag.py
fi_scores]
        shap[shap_dag.py
shap_scores]
        var[variance_dag.py
variance_scores]
        cov[covariance_dag.py
covariance_scores]
    end
    cache[(cache/feature_selection/
method/scope/...)]
    pg[(PostgreSQL
finetune_results)]
    grafana[Grafana
OTel events]

    operator -- "set CVN_FEATURE_SELECTION_*" --> console
    console -- "base_env" --> airflow
    airflow -- "trigger" --> regime
    regime -- "for each fold:
set CVN_FOLD_ID,
invoke FE" --> fe
    fe -- "method, scope, fold_id,
X_train, y_train" --> dispatcher
    dispatcher -- "lookup" --> registry
    dispatcher -- "cache hit?" --> cache
    dispatcher -- "build Driver(shared + selector_dag)" --> dags
    shared -- "reference_model
(shared by fi + shap)" --> fi
    shared --> shap
    dags -- "scores: pd.Series" --> dispatcher
    dispatcher -- "save" --> cache
    dispatcher -- "OTel events" --> grafana
    fe -- "top-K applied,
error_payload(KAboveFloorError)
on overshoot" --> pg

Two key invariants visible in the diagram:

Anti-leakage by signature — dispatcher.get_selection(...) requires (x_train, y_train) from the caller. The Hamilton DAG never queries the cache directly. Per-fold scope passes the fold's training window; global scope passes the preflight reference window. No code path can confuse train and test.
Shared-node deduplication — reference_model lives in shared.py. When fi_dag and shap_dag both consume it (same execute call), Hamilton materialises it exactly once. Adding a 5th model-based selector would inherit the same dedup for free.

3. Hamilton DAGs (visualizations)¶

Each (method, scope) combination produces a distinct DAG. The visualizations below are rendered from commun.finetune.feature_selection.dispatch.visualize() — they reflect the actual execution graph Hamilton resolves at run time.

variance_inline

3.2 covariance — inline (univariate target correlation)¶

covariance_inline

3.3 fi — global / perfold (XGBoost gain on shared reference_model)¶

fi_global and fi_perfold share the same DAG; only the input window differs at runtime (and therefore the cache key).

fi_global

3.4 shap — global / perfold (Shapley values on the same shared reference_model)¶

When fi_scores and shap_scores are requested in the same execute call, Hamilton sees one reference_model node and trains XGBoost exactly once. SHAP becomes a ~30 s extractor on the existing model rather than a 5 min retrain.

shap_global

4. Cache layout¶

cache/feature_selection/
├── fi/
│   ├── global/<SYMBOL>_<strategy>_<tf>.json
│   └── perfold/<SYMBOL>_<strategy>_<tf>_fold<N>.json
├── shap/
│   ├── global/<SYMBOL>_<strategy>_<tf>.json
│   └── perfold/<SYMBOL>_<strategy>_<tf>_fold<N>.json
├── variance/
│   └── inline/<SYMBOL>_<strategy>_<tf>[_fold<N>].json
└── covariance/
    └── inline/<SYMBOL>_<strategy>_<tf>[_fold<N>].json

Per-scope cache key semantics (refined post PR #686):

Scope	fold_id in path	Sharing	Mismatch behaviour
`global`	no	shared across all folds (sharing intent)	warning event `feature_selection_cache_window_mismatch`, scores still served
`perfold`	required	per-fold isolation	raise RuntimeError (programming bug — caller passed a different `(X, y)` for the same fold)
`inline`	when supplied	per-fold isolation when fold_id present, else single shared file	same as `perfold` — raise

Variants on the same (method, scope, fold_id) SHARE the cache file (e.g. fi_30_perfold and fi_150_perfold on fold 3 ⇒ one compute, two top-K reads). Top-K splitting happens AFTER the cache load.

The cache fingerprint stored in each payload is a SHA-256 of (index.min, index.max, shape, ordered column names) — including the column list (PR #686 CR pass 2) closes a false-negative where two X_train matrices with the same row range and shape but different feature columns would otherwise be served interchangeably.

A sample payload (truncated):

{
  "method": "fi",
  "scope": "perfold",
  "symbol": "UNIUSDC",
  "strategy": "ATR0.5_1.5_H4",
  "timeframe": "15m",
  "fold_id": 3,
  "name": "fi_score",
  "scores": {
    "rsi_14": 0.0432,
    "macd_signal_diff": 0.0418,
    "...": 0.0
  }
}

5. Sequence — per-fold compute (regime_trainer → dispatcher → Hamilton)¶

sequenceDiagram
    autonumber
    participant RT as regime_trainer
(fold N loop)
    participant FE as Autonomous_FE
    participant Dis as dispatcher.get_selection
    participant Cache as cache/feature_selection/
fi/perfold/...json
    participant Hamilton
    participant XGB as XGBoost
(reference_model)
    participant DB as finetune_results

    RT->>RT: set CVN_FOLD_ID=N,
CVN_FEATURE_SELECTION_SCOPE=perfold
    RT->>FE: invoke FE pipeline
    FE->>FE: stationarise + standardise
→ X_train_transformed, y_train
    FE->>Dis: get_selection(method=fi, scope=perfold, fold_id=N,
x_train=X_train_transformed, y_train)
    Dis->>Dis: registry.resolve(fi, perfold) → SelectorEntry
    Dis->>Cache: lookup fi/perfold/UNI_..._fold3.json
    alt cache hit
        Cache-->>Dis: scores Series
    else cache miss
        Dis->>Hamilton: execute([fi_scores], inputs={x_train_override, y_train_override, symbol})
        Hamilton->>Hamilton: resolve shared.x_train + shared.y_train + shared.sample_weights
        Hamilton->>XGB: train (depth=10, rounds=500, seed=42)
        XGB-->>Hamilton: Booster
        Hamilton->>Hamilton: fi_scores(reference_model, x_train) → pd.Series
        Hamilton-->>Dis: {fi_scores: Series}
        Dis->>Cache: save (atomic .tmp + replace)
    end
    Dis-->>FE: scores Series
    FE->>FE: select_top_k(scores, K=80, policy=fail)
    alt K ≤ non_zero
        FE->>FE: apply top-K cap to X_train/val/test
        FE-->>RT: cap applied (continues training)
    else K > non_zero
        FE->>FE: raise KAboveFloorError
        FE->>DB: error_payload → {code:k_above_floor, requested_k, available_features, ...}
    end

Critical points:

Step 1: regime_trainer exports CVN_FOLD_ID alongside the existing CVN_FOLD_* env vars. The FE component reads it via the dispatcher when scope=perfold.
Step 4: the dispatcher signature requires (x_train, y_train) — the FE component is the only place that can produce them with the right window, which keeps the anti-leakage contract auditable from a single call site.
Step 8 (XGBoost training): parameters (depth=10, rounds=500) are surfaced as Console knobs (CVN_FI_REFERENCE_DEPTH, CVN_FI_REFERENCE_ROUNDS). Bumped from 6/200 to push the non-zero feature floor higher (the KAboveFloorError ceiling). Bounds enforced at the edge: depth ∈ [1, 30], rounds ∈ [10, 5000] — out-of-range values raise ValueError from _parse_positive_int instead of crashing inside XGBoost (PR #686 guardrail).
Step 14 (KAboveFloorError): structured payload {code: "k_above_floor", requested_k, available_features, selector, scope, symbol} lands in finetune_results.error as JSON. The dashboard can filter by error::json->>'code' = 'k_above_floor' to distinguish K-overshoot from real training crashes. Three other structured codes share the same payload shape: selector_not_registered (unknown method), scope_not_supported (invalid combo), training_failed (legacy fallback for non-framework exceptions).
Score alignment (post-cache, pre-top-K): aligned = scores.reindex(X_train.columns) then aligned.loc[~aligned.index.isin(scores.index)] = 0.0 — features absent from the cache are de-prioritised to 0, while features present with NaN keep their NaN (selector-explicit "undefined") and are filtered by select_top_k's > 0 mask. PR #686 CR pass 5 fix: the previous .fillna(0.0) collapsed both cases.

6. Adding a new selector — checklist¶

Per ADR-67: drop-in surface is a single new file + a single registry entry. Demo with a hypothetical permutation importance:

Create src/commun/finetune/feature_selection/dags/permutation_dag.py:

def permutation_scores(reference_model, x_train, y_train, symbol) -> pd.Series:
    from sklearn.inspection import permutation_importance
    imp = permutation_importance(reference_model, x_train, y_train, n_repeats=5, random_state=42)
    scores = pd.Series(imp.importances_mean, index=x_train.columns).sort_values(ascending=False)
    _emit("feature_selection_node", node="permutation_scores", selector="permutation",
          symbol=symbol, n_features=int(scores.shape[0]), non_zero=int((scores > 0).sum()))
    return scores

Add to registry.py:

"permutation": SelectorEntry(
    dag_module=permutation_dag,
    output_node="permutation_scores",
    supports_scopes=("perfold",),  # too expensive globally
    description="Permutation importance — measures actual predictive contribution.",
),

Variants in ablation_matrix.py get permutation_50_perfold, permutation_100_perfold, etc.

That's it. No edits to dispatch.py, no edits to the FE pipeline, no edits to the cache layout, no edits to regime_trainer.py. Hamilton automatically wires the new node to reference_model (and the rest of the shared chain) on its parameter names.

Auto-wired side-effects (no extra work needed when adding a selector):

The FE pipeline's scope-fallback hook calls registry.default_scope_for_method(method) which returns supports_scopes[0]. So permutation declaring supports_scopes=("perfold",) makes perfold the default scope when the operator sets CVN_FEATURE_SELECTION_METHOD=permutation without an explicit scope.
The Console save-time validator (scripts/ftf_config_ui.py::_validate_feature_selection_combo) reads REGISTRY dynamically when the framework module is importable; otherwise it falls back to a hardcoded mirror that is sync-tested against the live REGISTRY (tests/unit/test_feature_selection_framework.py::test_console_validator_mirror_matches_registry). Adding a selector trips the test if the fallback isn't updated.
The structured error catalogue (KAboveFloorError, SelectorScopeError, UnknownSelectorError) is shared — any new selector inherits the same error_payload(exc) JSON shape automatically.

7. Console keys¶

All knobs live in ftf_config.base_env (Console UI). Defaults documented in config/ftf_baseline.json.

Key	Default	Effect
`CVN_PREFLIGHT_ENABLED`	`1`	Master kill-switch (existing)
`CVN_GLOBAL_PREFLIGHT`	`1`	Gate the preflight global compute
`CVN_PERFOLD_PREFLIGHT`	`1`	Gate the per-fold compute hook
`CVN_FI_REFERENCE_DEPTH`	`10`	XGBoost depth for fi/shap reference model (was 6)
`CVN_FI_REFERENCE_ROUNDS`	`500`	XGBoost rounds (was 200)
`CVN_FEATURE_SELECTION_METHOD`	(per variant)	`variance` / `covariance` / `fi` / `shap`
`CVN_FEATURE_SELECTION_SCOPE`	(per variant)	`inline` / `global` / `perfold`
`CVN_FEATURE_SELECTION_K_OVERSHOOT_POLICY`	`fail`	`fail` (ADR-25) / `truncate`
`CVN_FOLD_ID`	(set by regime_trainer)	Per-fold cache key for scope=perfold

8. Observability — OTel events catalogue¶

Every Hamilton node + every dispatcher transition emits a structured event via commun.observability.otel.emit_event (ADR-62). Event names and key attributes:

Event	Source	Key attributes
`feature_selection_cache_hit`	dispatcher	`method`, `scope`, `symbol`, `strategy`, `timeframe`, `fold_id`, `cache_path`
`feature_selection_compute_start`	dispatcher	+ `n_train_rows`, `n_features_in`
`feature_selection_compute_end`	dispatcher	+ `n_features_out`, `non_zero`, `elapsed_s`, `cache_path`
`feature_selection_compute_failed`	dispatcher	+ `error_payload` (JSON of the structured exception)
`feature_selection_cache_window_mismatch`	dispatcher	`method`, `scope=global`, `symbol`, `fold_id`, `cached_fingerprint`, `caller_fingerprint` — fires when a `global` cache hit is served to a caller whose training window doesn't match the cached one (sharing intent preserved, mismatch surfaced for audit). `perfold` / `inline` mismatches raise instead of emit.
`feature_selection_node`	each Hamilton node	`node` (= function name), `selector`, `symbol`, `elapsed_s`, plus node-specific attrs. Every node emits — including `n_classes`, `sample_weights`, `fi_reference_depth`, `fi_reference_rounds` (PR #686 CR pass 1 fix).
`feature_selection_truncated`	top_k helper	`requested_k`, `actual_k`, `selector`, `scope`, `symbol`

Grafana Loki query example to pull all per-fold FI compute durations for a run:

{namespace="cvntrade"}
| json
| event="feature_selection_compute_end"
| method="fi"
| scope="perfold"

9. Failure modes and mitigations¶

Failure	Detection	Mitigation
K > non_zero in cache	`KAboveFloorError` raised by `select_top_k`	Structured `code=k_above_floor` payload in DB; dashboard filter; operator either lowers K, deepens reference model, or switches variant policy to `truncate`
Selector / scope combo unsupported	`SelectorScopeError` at parse time (registry.resolve)	Variant rejected before any pod is provisioned. Editable from Console without redeploy.
Hamilton node raises (e.g. xgboost OOM)	Bubbles up; dispatcher emits `feature_selection_compute_failed` with `error_payload`	Standard `error_payload(exc)` fallback to `{code: training_failed, message}` so legacy dashboards keep working
OTel collector unreachable	`emit_event` swallows exception, logs at debug level	Observability never breaks compute — falls back to standard logger silently
Cache file corrupted	`_load_cache` raises JSONDecodeError	Operator deletes the file → next run recomputes
`fold_id` missing for scope=perfold	`cache_path` raises ValueError	Caught by dispatcher, becomes structured error in DB
Two variants on same fold race the cache write	`_save_cache` writes to `.tmp` then `.replace` (atomic)	No corrupted partial file; later writer wins (same data anyway)

10. Migration from legacy preflight¶

The legacy cache/feature_importance/<key>.json layout (written by FiReferenceStep, ADR-64) is untouched by this PR. Two coexistence paths:

Operator action: re-run preflight after the deploy → new framework writes to cache/feature_selection/fi/global/.... Both caches present until cleanup.
No action: the legacy load_fi_reference() reader stays callable. Variants using CVN_FEATURE_SELECTION_METHOD=fi without explicit scope default to global and read from the legacy path until the global cache gets repopulated through the new dispatcher.

Cleanup: planned as a follow-up issue once the framework has been validated in production for one full FTF cycle.

11. References¶

Binding ADR: ADR-67 — Pluggable Feature-Selection Framework
Issue: #684 — full design discussion
Implementation PR: #685
Source: src/commun/finetune/feature_selection/
Tests: tests/unit/test_feature_selection_framework.py (32 tests)
Related ADRs: ADR-25 (no silent fallback), ADR-56 (A/B testable), ADR-61 (Hamilton for batch DAGs), ADR-62 (OTel), ADR-64 (preflight first-class — amended by ADR-67), ADR-65 (Console-driven config)
Triggering analysis: feature_importance run ftf_20260424_223437 showing fi_65/80 fail at 50% rate due to FI cache capacity ceiling