Skip to content

Architecture — Pluggable Feature-Selection Framework (CVN-N001-ED extension)

Status: Implemented in PR #685, bound by ADR-67. Version: 1.0 Date: 2026-04-25 ADRs in scope: ADR-25 (no silent fallback), ADR-56 (every change A/B testable), ADR-61 (Hamilton for batch DAGs), ADR-62 (OTel observability), ADR-64 (preflight first-class — amended by ADR-67), ADR-65 (Console-driven config), ADR-67 (this framework) Source: GitHub issue #684


1. Purpose

Replace the hardcoded variance / FI selection in cvntrade_autonomous_fe.py with a Hamilton-DAG-based framework that:

  1. Treats every selector (variance, covariance, FI, SHAP, future permutation / SHAP / mutual_info / ...) as a 1-file drop-in
  2. Lets the FTF compare 4 ranking paradigms × 2 compute scopes in a single ablation run (variance / covariance / fi / shap × global / perfold)
  3. Eliminates the stale-FI defect: per-fold scope mirrors production retrain cadence (operator weekly retrain ⇒ fresh FI on the latest training window)
  4. Makes K-overshoot a structured failure mode (code=k_above_floor JSON in finetune_results.error) instead of an opaque RuntimeError("training_failed")

2. Component view

The framework is internal to the FTF — it slots between regime_trainer (which knows the fold) and cvntrade_autonomous_fe (which knows the FE-transformed (X, y)). All compute is Hamilton per ADR-61; the dispatcher is orchestration only.

flowchart LR
    operator([Operator])
    console[Console
ftf_config.base_env] airflow[Airflow
finetune__pte DAG] regime[regime_trainer
fold loop] fe[CVNTrade_Autonomous
FeatureEngineering] dispatcher[feature_selection
dispatch.get_selection] registry[REGISTRY
method → DAG] subgraph dags[Hamilton DAGs] shared[shared.py
x_train / y_train /
reference_model] fi[fi_dag.py
fi_scores] shap[shap_dag.py
shap_scores] var[variance_dag.py
variance_scores] cov[covariance_dag.py
covariance_scores] end cache[(cache/feature_selection/
method/scope/...)] pg[(PostgreSQL
finetune_results)] grafana[Grafana
OTel events] operator -- "set CVN_FEATURE_SELECTION_*" --> console console -- "base_env" --> airflow airflow -- "trigger" --> regime regime -- "for each fold:
set CVN_FOLD_ID,
invoke FE" --> fe fe -- "method, scope, fold_id,
X_train, y_train" --> dispatcher dispatcher -- "lookup" --> registry dispatcher -- "cache hit?" --> cache dispatcher -- "build Driver(shared + selector_dag)" --> dags shared -- "reference_model
(shared by fi + shap)" --> fi shared --> shap dags -- "scores: pd.Series" --> dispatcher dispatcher -- "save" --> cache dispatcher -- "OTel events" --> grafana fe -- "top-K applied,
error_payload(KAboveFloorError)
on overshoot" --> pg

Two key invariants visible in the diagram:

  • Anti-leakage by signaturedispatcher.get_selection(...) requires (x_train, y_train) from the caller. The Hamilton DAG never queries the cache directly. Per-fold scope passes the fold's training window; global scope passes the preflight reference window. No code path can confuse train and test.
  • Shared-node deduplicationreference_model lives in shared.py. When fi_dag and shap_dag both consume it (same execute call), Hamilton materialises it exactly once. Adding a 5th model-based selector would inherit the same dedup for free.

3. Hamilton DAGs (visualizations)

Each (method, scope) combination produces a distinct DAG. The visualizations below are rendered from commun.finetune.feature_selection.dispatch.visualize() — they reflect the actual execution graph Hamilton resolves at run time.

3.1 variance — inline (target-blind, no model fit)

variance_inline

3.2 covariance — inline (univariate target correlation)

covariance_inline

3.3 fi — global / perfold (XGBoost gain on shared reference_model)

fi_global and fi_perfold share the same DAG; only the input window differs at runtime (and therefore the cache key).

fi_global

3.4 shap — global / perfold (Shapley values on the same shared reference_model)

When fi_scores and shap_scores are requested in the same execute call, Hamilton sees one reference_model node and trains XGBoost exactly once. SHAP becomes a ~30 s extractor on the existing model rather than a 5 min retrain.

shap_global

4. Cache layout

cache/feature_selection/
├── fi/
│   ├── global/<SYMBOL>_<strategy>_<tf>.json
│   └── perfold/<SYMBOL>_<strategy>_<tf>_fold<N>.json
├── shap/
│   ├── global/<SYMBOL>_<strategy>_<tf>.json
│   └── perfold/<SYMBOL>_<strategy>_<tf>_fold<N>.json
├── variance/
│   └── inline/<SYMBOL>_<strategy>_<tf>[_fold<N>].json
└── covariance/
    └── inline/<SYMBOL>_<strategy>_<tf>[_fold<N>].json

Per-scope cache key semantics (refined post PR #686):

Scope fold_id in path Sharing Mismatch behaviour
global no shared across all folds (sharing intent) warning event feature_selection_cache_window_mismatch, scores still served
perfold required per-fold isolation raise RuntimeError (programming bug — caller passed a different (X, y) for the same fold)
inline when supplied per-fold isolation when fold_id present, else single shared file same as perfold — raise

Variants on the same (method, scope, fold_id) SHARE the cache file (e.g. fi_30_perfold and fi_150_perfold on fold 3 ⇒ one compute, two top-K reads). Top-K splitting happens AFTER the cache load.

The cache fingerprint stored in each payload is a SHA-256 of (index.min, index.max, shape, ordered column names) — including the column list (PR #686 CR pass 2) closes a false-negative where two X_train matrices with the same row range and shape but different feature columns would otherwise be served interchangeably.

A sample payload (truncated):

{
  "method": "fi",
  "scope": "perfold",
  "symbol": "UNIUSDC",
  "strategy": "ATR0.5_1.5_H4",
  "timeframe": "15m",
  "fold_id": 3,
  "name": "fi_score",
  "scores": {
    "rsi_14": 0.0432,
    "macd_signal_diff": 0.0418,
    "...": 0.0
  }
}

5. Sequence — per-fold compute (regime_trainer → dispatcher → Hamilton)

sequenceDiagram
    autonumber
    participant RT as regime_trainer
(fold N loop) participant FE as Autonomous_FE participant Dis as dispatcher.get_selection participant Cache as cache/feature_selection/
fi/perfold/...json participant Hamilton participant XGB as XGBoost
(reference_model) participant DB as finetune_results RT->>RT: set CVN_FOLD_ID=N,
CVN_FEATURE_SELECTION_SCOPE=perfold RT->>FE: invoke FE pipeline FE->>FE: stationarise + standardise
→ X_train_transformed, y_train FE->>Dis: get_selection(method=fi, scope=perfold, fold_id=N,
x_train=X_train_transformed, y_train) Dis->>Dis: registry.resolve(fi, perfold) → SelectorEntry Dis->>Cache: lookup fi/perfold/UNI_..._fold3.json alt cache hit Cache-->>Dis: scores Series else cache miss Dis->>Hamilton: execute([fi_scores], inputs={x_train_override, y_train_override, symbol}) Hamilton->>Hamilton: resolve shared.x_train + shared.y_train + shared.sample_weights Hamilton->>XGB: train (depth=10, rounds=500, seed=42) XGB-->>Hamilton: Booster Hamilton->>Hamilton: fi_scores(reference_model, x_train) → pd.Series Hamilton-->>Dis: {fi_scores: Series} Dis->>Cache: save (atomic .tmp + replace) end Dis-->>FE: scores Series FE->>FE: select_top_k(scores, K=80, policy=fail) alt K ≤ non_zero FE->>FE: apply top-K cap to X_train/val/test FE-->>RT: cap applied (continues training) else K > non_zero FE->>FE: raise KAboveFloorError FE->>DB: error_payload → {code:k_above_floor, requested_k, available_features, ...} end

Critical points:

  • Step 1: regime_trainer exports CVN_FOLD_ID alongside the existing CVN_FOLD_* env vars. The FE component reads it via the dispatcher when scope=perfold.
  • Step 4: the dispatcher signature requires (x_train, y_train) — the FE component is the only place that can produce them with the right window, which keeps the anti-leakage contract auditable from a single call site.
  • Step 8 (XGBoost training): parameters (depth=10, rounds=500) are surfaced as Console knobs (CVN_FI_REFERENCE_DEPTH, CVN_FI_REFERENCE_ROUNDS). Bumped from 6/200 to push the non-zero feature floor higher (the KAboveFloorError ceiling). Bounds enforced at the edge: depth ∈ [1, 30], rounds ∈ [10, 5000] — out-of-range values raise ValueError from _parse_positive_int instead of crashing inside XGBoost (PR #686 guardrail).
  • Step 14 (KAboveFloorError): structured payload {code: "k_above_floor", requested_k, available_features, selector, scope, symbol} lands in finetune_results.error as JSON. The dashboard can filter by error::json->>'code' = 'k_above_floor' to distinguish K-overshoot from real training crashes. Three other structured codes share the same payload shape: selector_not_registered (unknown method), scope_not_supported (invalid combo), training_failed (legacy fallback for non-framework exceptions).
  • Score alignment (post-cache, pre-top-K): aligned = scores.reindex(X_train.columns) then aligned.loc[~aligned.index.isin(scores.index)] = 0.0 — features absent from the cache are de-prioritised to 0, while features present with NaN keep their NaN (selector-explicit "undefined") and are filtered by select_top_k's > 0 mask. PR #686 CR pass 5 fix: the previous .fillna(0.0) collapsed both cases.

6. Adding a new selector — checklist

Per ADR-67: drop-in surface is a single new file + a single registry entry. Demo with a hypothetical permutation importance:

  1. Create src/commun/finetune/feature_selection/dags/permutation_dag.py:
def permutation_scores(reference_model, x_train, y_train, symbol) -> pd.Series:
    from sklearn.inspection import permutation_importance
    imp = permutation_importance(reference_model, x_train, y_train, n_repeats=5, random_state=42)
    scores = pd.Series(imp.importances_mean, index=x_train.columns).sort_values(ascending=False)
    _emit("feature_selection_node", node="permutation_scores", selector="permutation",
          symbol=symbol, n_features=int(scores.shape[0]), non_zero=int((scores > 0).sum()))
    return scores
  1. Add to registry.py:
"permutation": SelectorEntry(
    dag_module=permutation_dag,
    output_node="permutation_scores",
    supports_scopes=("perfold",),  # too expensive globally
    description="Permutation importance — measures actual predictive contribution.",
),
  1. Variants in ablation_matrix.py get permutation_50_perfold, permutation_100_perfold, etc.

That's it. No edits to dispatch.py, no edits to the FE pipeline, no edits to the cache layout, no edits to regime_trainer.py. Hamilton automatically wires the new node to reference_model (and the rest of the shared chain) on its parameter names.

Auto-wired side-effects (no extra work needed when adding a selector):

  • The FE pipeline's scope-fallback hook calls registry.default_scope_for_method(method) which returns supports_scopes[0]. So permutation declaring supports_scopes=("perfold",) makes perfold the default scope when the operator sets CVN_FEATURE_SELECTION_METHOD=permutation without an explicit scope.
  • The Console save-time validator (scripts/ftf_config_ui.py::_validate_feature_selection_combo) reads REGISTRY dynamically when the framework module is importable; otherwise it falls back to a hardcoded mirror that is sync-tested against the live REGISTRY (tests/unit/test_feature_selection_framework.py::test_console_validator_mirror_matches_registry). Adding a selector trips the test if the fallback isn't updated.
  • The structured error catalogue (KAboveFloorError, SelectorScopeError, UnknownSelectorError) is shared — any new selector inherits the same error_payload(exc) JSON shape automatically.

7. Console keys

All knobs live in ftf_config.base_env (Console UI). Defaults documented in config/ftf_baseline.json.

Key Default Effect
CVN_PREFLIGHT_ENABLED 1 Master kill-switch (existing)
CVN_GLOBAL_PREFLIGHT 1 Gate the preflight global compute
CVN_PERFOLD_PREFLIGHT 1 Gate the per-fold compute hook
CVN_FI_REFERENCE_DEPTH 10 XGBoost depth for fi/shap reference model (was 6)
CVN_FI_REFERENCE_ROUNDS 500 XGBoost rounds (was 200)
CVN_FEATURE_SELECTION_METHOD (per variant) variance / covariance / fi / shap
CVN_FEATURE_SELECTION_SCOPE (per variant) inline / global / perfold
CVN_FEATURE_SELECTION_K_OVERSHOOT_POLICY fail fail (ADR-25) / truncate
CVN_FOLD_ID (set by regime_trainer) Per-fold cache key for scope=perfold

8. Observability — OTel events catalogue

Every Hamilton node + every dispatcher transition emits a structured event via commun.observability.otel.emit_event (ADR-62). Event names and key attributes:

Event Source Key attributes
feature_selection_cache_hit dispatcher method, scope, symbol, strategy, timeframe, fold_id, cache_path
feature_selection_compute_start dispatcher + n_train_rows, n_features_in
feature_selection_compute_end dispatcher + n_features_out, non_zero, elapsed_s, cache_path
feature_selection_compute_failed dispatcher + error_payload (JSON of the structured exception)
feature_selection_cache_window_mismatch dispatcher method, scope=global, symbol, fold_id, cached_fingerprint, caller_fingerprint — fires when a global cache hit is served to a caller whose training window doesn't match the cached one (sharing intent preserved, mismatch surfaced for audit). perfold / inline mismatches raise instead of emit.
feature_selection_node each Hamilton node node (= function name), selector, symbol, elapsed_s, plus node-specific attrs. Every node emits — including n_classes, sample_weights, fi_reference_depth, fi_reference_rounds (PR #686 CR pass 1 fix).
feature_selection_truncated top_k helper requested_k, actual_k, selector, scope, symbol

Grafana Loki query example to pull all per-fold FI compute durations for a run:

{namespace="cvntrade"}
| json
| event="feature_selection_compute_end"
| method="fi"
| scope="perfold"

9. Failure modes and mitigations

Failure Detection Mitigation
K > non_zero in cache KAboveFloorError raised by select_top_k Structured code=k_above_floor payload in DB; dashboard filter; operator either lowers K, deepens reference model, or switches variant policy to truncate
Selector / scope combo unsupported SelectorScopeError at parse time (registry.resolve) Variant rejected before any pod is provisioned. Editable from Console without redeploy.
Hamilton node raises (e.g. xgboost OOM) Bubbles up; dispatcher emits feature_selection_compute_failed with error_payload Standard error_payload(exc) fallback to {code: training_failed, message} so legacy dashboards keep working
OTel collector unreachable emit_event swallows exception, logs at debug level Observability never breaks compute — falls back to standard logger silently
Cache file corrupted _load_cache raises JSONDecodeError Operator deletes the file → next run recomputes
fold_id missing for scope=perfold cache_path raises ValueError Caught by dispatcher, becomes structured error in DB
Two variants on same fold race the cache write _save_cache writes to .tmp then .replace (atomic) No corrupted partial file; later writer wins (same data anyway)

10. Migration from legacy preflight

The legacy cache/feature_importance/<key>.json layout (written by FiReferenceStep, ADR-64) is untouched by this PR. Two coexistence paths:

  • Operator action: re-run preflight after the deploy → new framework writes to cache/feature_selection/fi/global/.... Both caches present until cleanup.
  • No action: the legacy load_fi_reference() reader stays callable. Variants using CVN_FEATURE_SELECTION_METHOD=fi without explicit scope default to global and read from the legacy path until the global cache gets repopulated through the new dispatcher.

Cleanup: planned as a follow-up issue once the framework has been validated in production for one full FTF cycle.

11. References

  • Binding ADR: ADR-67 — Pluggable Feature-Selection Framework
  • Issue: #684 — full design discussion
  • Implementation PR: #685
  • Source: src/commun/finetune/feature_selection/
  • Tests: tests/unit/test_feature_selection_framework.py (32 tests)
  • Related ADRs: ADR-25 (no silent fallback), ADR-56 (A/B testable), ADR-61 (Hamilton for batch DAGs), ADR-62 (OTel), ADR-64 (preflight first-class — amended by ADR-67), ADR-65 (Console-driven config)
  • Triggering analysis: feature_importance run ftf_20260424_223437 showing fi_65/80 fail at 50% rate due to FI cache capacity ceiling