SHAP × {global, perfold})¶

Status: active Date: 2026-04-25 Introduced by: CVN-N001-ED extension / issue #684 Supersedes: amends ADR-64 (preflight is no longer the unique site of FI compute)

Context¶

Two structural defects surfaced in the 2026-04-24/25 feature_importance FTF run series:

Single-shot, fold-0-only FI cache. The legacy FiReferenceStep (ADR-64) trains XGBoost on fold 0, persists scores to a single global cache, then reuses that ranking for every training fold. In production the operator retrains on the latest 9 months and recomputes feature importance at every release — the FTF was therefore measuring something different from production behaviour. Operator quote: "il n'est pas possible que variance_100 soit meilleur qu'un FI". It IS possible if today's FI is artificially stale.
Hardcoded ranking method. Adding a new selector (SHAP, covariance, mutual information, ...) required editing cvntrade_autonomous_fe.py (an if/elif/else block) plus the preflight step. There was no extension surface.

Two adjacent failure modes amplified the issues:

The FE pipeline's if method == 'fi' branch raised an opaque RuntimeError("FI cache has N non-zero features ... need K") — the dashboard could not distinguish "K-too-high for this dataset" from "real training crash".
The reference XGBoost model used depth=6, rounds=200, producing only ~50 features with non-zero importance per crypto. Variants requesting K > 50 always failed regardless of data.

Decision¶

Introduce a pluggable feature-selection framework under commun/finetune/feature_selection/, organised as a Hamilton DAG registry per ADR-61.

Components¶

commun/finetune/feature_selection/
├── dags/
│   ├── shared.py            ← x_train, y_train, sample_weights, reference_model
│   ├── variance_dag.py      ← variance_scores
│   ├── covariance_dag.py    ← covariance_scores
│   ├── fi_dag.py            ← fi_scores (consumes shared.reference_model)
│   └── shap_dag.py          ← shap_scores (consumes shared.reference_model)
├── registry.py              ← REGISTRY: name → (dag_module, output_node, supports_scopes)
├── dispatch.py              ← get_selection() entry point + cache + viz
├── top_k.py                 ← select_top_k(scores, K, policy={fail, truncate})
└── errors.py                ← KAboveFloorError, SelectorScopeError, UnknownSelectorError, error_payload

Selectors shipped¶

Name	Scope(s)	Compute
`variance`	`inline`	`var(X[:, i])` post-stationarisation. Cheap. Target-blind.
`covariance`	`inline`	`\\|cov(X[:, i], y)\\|` univariate target correlation.
`fi`	`global`, `perfold`	XGBoost gain on the shared `reference_model`.
`shap`	`global`, `perfold`	`mean(\\|SHAP value\\|)` from `shap.TreeExplainer` on the same `reference_model`. ~30s if `fi` already trained the model in the same execute call.

Scopes¶

inline — computed in the FE pipeline at training time (no preflight cache).
global — preflight one-shot, one cache per (symbol, strategy, timeframe). Cheap to read across folds, but stale (= legacy semantics).
perfold — recomputed at the start of each fold's training, on that fold's training window. Mirrors production retrain cadence; one cache per (symbol, strategy, timeframe, fold_id).

Console keys¶

CVN_PREFLIGHT_ENABLED                     master kill-switch (existing)
CVN_GLOBAL_PREFLIGHT                      gate the global compute
CVN_PERFOLD_PREFLIGHT                     gate the per-fold compute
CVN_FI_REFERENCE_DEPTH                    XGBoost depth for fi/shap (default 10, was 6)
CVN_FI_REFERENCE_ROUNDS                   XGBoost rounds (default 500, was 200)
CVN_FEATURE_SELECTION_METHOD              selector name (registry key)
CVN_FEATURE_SELECTION_SCOPE               inline | global | perfold
CVN_FEATURE_SELECTION_K_OVERSHOOT_POLICY  fail | truncate (default fail, ADR-25)
CVN_FOLD_ID                               injected by regime_trainer for perfold scope

Invariants¶

Adding a selector is a 2-step change. (1) New dags/<name>_dag.py with a single output function <name>_scores(...) returning pd.Series. (2) New entry in REGISTRY. No edits to dispatch, cache layout, FE pipeline, or regime_trainer.
All compute is Hamilton. Per ADR-61, no imperative class hierarchy. Selectors are typed pure functions; Hamilton resolves the DAG. The dispatcher orchestrates (cache + Hamilton execute), it does not compute.
Shared nodes deduplicate. When a single execute call requests multiple outputs that share a node (e.g. fi_scores + shap_scores both consume reference_model), Hamilton materialises the shared node exactly once. Zero double training.
Anti-leakage by construction. The dispatcher signature requires (x_train, y_train) from the caller. The DAG never queries the cache directly. Per-fold scope passes the fold's training window; global scope passes the preflight reference window. No code path can confuse train and test.
Cache key strict isolation. cache/feature_selection/<method>/<scope>/<SYMBOL>_<strategy>_<tf>[_fold<N>].json. Methods, scopes, folds, and runs cannot collide. Variants on the same fold (fi_50_perfold + fi_150_perfold) share the cache by design and never double-compute.
K-overshoot is structured. When select_top_k(scores, K, policy='fail') finds K > non-zero count, it raises KAboveFloorError carrying {requested_k, available_features, selector, scope, symbol}. The persistence layer serialises this to a JSON payload in finetune_results.error so the dashboard distinguishes code=k_above_floor from real training crashes (legacy opaque 'training_failed' still works during migration via error_payload()).
OTel events at every transition. Each Hamilton node emits event=feature_selection_node node=<name> via commun.observability.otel.emit_event. The dispatcher emits feature_selection_cache_hit, feature_selection_compute_start, feature_selection_compute_end, feature_selection_compute_failed, plus feature_selection_cache_window_mismatch when a global cache is served to a caller whose training window differs from the one the scores were computed on (sharing intent preserved, audit trail kept). Loki/Grafana introspection without log-grep.
Per-scope cache key + mismatch behaviour (PR #686 CR pass 2):
global : key = (method, sym, strat, tf) — shared across folds. Mismatch on cache hit → warning event, scores still served.
perfold : key = (method, sym, strat, tf, fold_id) — fold_id required. Mismatch on cache hit → RuntimeError (programming bug — caller passed a different (X, y) for the same fold).
inline : key includes fold_id when supplied (= same isolation as perfold), else falls back to the global-shaped key for ad-hoc one-off compute. Mismatch behaviour same as perfold.
Reference-model hyperparameter bounds (PR #686 CR pass 1): CVN_FI_REFERENCE_DEPTH ∈ [1, 30], CVN_FI_REFERENCE_ROUNDS ∈ [10, 5000]. Out-of-range or non-integer values raise ValueError at the edge (Console parse time) instead of crashing inside XGBoost.
Score alignment preserves NaN (PR #686 CR pass 5). The FE pipeline aligns scores to the actual feature columns via aligned = scores.reindex(...) then aligned.loc[~aligned.index.isin(scores.index)] = 0.0. Features absent from the cache get 0 (de-prioritised); features present in the cache with NaN keep NaN (selector-explicit "undefined") and are filtered by select_top_k's > 0 mask. Distinct semantics, not collapsed via .fillna(0).
Single canonical scope-fallback (PR #686 CR pass 4): commun.finetune.feature_selection.default_scope_for_method(method) returns REGISTRY[method].supports_scopes[0] and is the only place in the codebase that picks a default scope for a selector. The FE pipeline calls it directly; the Console save-time validator mirrors it. Adding a selector with a non-standard supports_scopes ordering automatically updates the default everywhere.

Alternatives rejected¶

OOP class hierarchy with abstract FeatureSelector base. Conflicts with ADR-61 ("batch DAGs use Hamilton, not imperative code"). Loses Hamilton's automatic shared-node deduplication. Rejected immediately when caught by operator before phase 2 of the implementation.
Per-method preflight steps. Each new selector would be a PreflightStep class. Multiplies the preflight surface; no clean way to share reference_model between FI and SHAP without a global cache; doesn't address the per-fold need.
Padding with zero-importance features when K > non-zero count. Violates ADR-25 (no silent fallback). Pads the model with noise features chosen at random — would skew Sortino comparisons silently.
Truncating by default when K > non-zero count. Easier on the operator but loses the signal that the FI procedure has hit its capacity ceiling. ADR-25 default is fail; truncate is opt-in per variant.
PCA / ICA / autoencoder selectors plugged into the same framework. They are dimensionality reduction, not feature selection — they output linear combinations of features, not a subset. Forcing them through select_top_k would be a semantic lie. Tracked as a separate factor dimensionality_reduction with its own framework.

Consequences¶

Forcing functions¶

New selector = new file + registry entry. Reviewer load minimal.
Variant naming carries the (method, scope, K) tuple explicitly: fi_80_perfold, shap_30_global. Dashboards group by prefix without parsing config.
The structured KAboveFloorError payload becomes a first-class column lens in Grafana (filter by error_code = 'k_above_floor'), so K-overshoot is visible without grep.

Costs¶

Per-fold compute adds N folds × ~3 min XGBoost training per crypto (was 1 × 3 min global). With defi_top5 × 5 folds × 1 reference_model per fold per (FI+SHAP combo) = ~75 min added wall-clock per FTF run. Acceptable; balanced by the deeper model + perfold caching.
Migration of the legacy cache/feature_importance/<key>.json files: the new framework writes to cache/feature_selection/fi/global/<key>.json. A one-shot migration script is OUT OF SCOPE for this PR — operators can either (a) re-run preflight to populate the new layout, or (b) symlink the old paths into the new structure. The legacy load_fi_reference() reader stays callable for the existing preflight step.

What becomes easier¶

Cross-method comparison in the same FTF run. A single ablation factor exercises 4 methods × 2 scopes simultaneously, with shared OHLCV / labels / FE pipeline and identical Optuna seeds. Statistical claims about "FI is better than variance" become defensible.
Production parity. Per-fold scope mirrors the deployment cadence (operator retrains weekly, recomputes FI on the new training window). The FTF measures what production will see.
Drop-in selectors. permutation, mutual_info, custom XGB-based variants (different hyperparameters), or third-party rankers (feature_engine.selection.*) can ship as PRs of ~80 lines each.

What becomes harder¶

Debugging "why did variant X fail" now requires reading the JSON error.code instead of grepping for "training_failed". Mitigated by the dashboard panel that surfaces error codes.
Schema migration of finetune_results.error. The column is text today; the framework writes JSON into it. Backward-compatible (legacy reads still work) but dashboards that string-match error = 'training_failed' need updating to error::json->>'code' = 'training_failed' (a one-time SQL fix).

Ownership¶

DRI for the framework: assigned at first follow-up issue. Responsibilities: registry hygiene (no duplicate names, scope declarations match implementation), cache layout enforcement, ADR-67 amendments when new scopes are added.

References¶

Parent: ADR-61 (Hamilton for batch dataflow), ADR-64 (preflight is a first-class phase — amended here)
Triggering analysis: GitHub issue #684, conversation 2026-04-24/25
Related: ADR-25 (no silent fallback, preserved by KAboveFloorError), ADR-56 (every change A/B testable, achieved by parametrising on (method, scope, K)), ADR-62 (OTel observability — selectors emit structured events at every node), ADR-65 (Console-driven toggles — all framework knobs surfaced in ftf_config.base_env)
Implementation: commun/finetune/feature_selection/ package