CVN-N001-EI-S03 — Decision: regime-signal source for the S41 split families¶
- Story: CVN-N001-EI-S03 (wp#226,
In progress) · GitHub: #1058 · Epic: CVN-N001-EI (#1055) - Type: implementation decision (instrument design) — arbitrate before wiring
s41_nodes/regime_split - Branch:
feat/CVN-N001-EI-S03-split-regime-reconstruction - Date: 2026-05-28
1. The decision¶
S41's regime/vol split families need a per-bar volatility signal and a trend signal to partition a fold's rows:
| Family | needs |
|---|---|
3 market_period |
trend sign (causal rolling return) |
4 volatility_bucket |
realized vol (train-region tertiles) |
6 regime_ab |
realized vol + trend sign → regime code, against absolute pre-registered bands |
The plan §4 wrote these in terms of realized vol of close (std(log-returns)) and rolling return of close. But the S41 cell consumes the S07-pinned captured parquet, whose columns are: the model feature set (all non-_ columns) + _label + _split + optional _weight (s22_a1_reproduction._load_captured_parquet). Raw close is not guaranteed to survive as a feature column — enrichment emits derived features (price_volatility_20, ATRr_16, distance_SMA_192, trend_strength, …), and close may have been dropped.
Question: where does the regime signal come from? This is load-bearing for instrument validity, reproducibility, and the S07 pinned-fold contract — hence a dossier, not a guess.
Prerequisite fact (confirm at wiring): inspect a captured-fold parquet's columns to see whether raw
closesurvived. The recommendation below (Option B) holds either way; Options A/C depend on the answer.
2. Options¶
Option A — close if present, else a captured vol feature (fallback path)¶
Use raw close when the parquet carries it; otherwise fall back to price_volatility_20 / ATRr_16.
- + Always works; uses true close when available.
- − Dual code path → the instrument's behaviour is data-dependent: a verdict's meaning ("which signal split this cell?") varies by crypto/fold. Reviewers (and the §1 pre-registration discipline) disfavour a branch that changes the measurement. Two paths to test, two to reason about.
Option B — always a captured vol/trend feature (single path) — recommended¶
Define the regime signal purely from features already in the model's input set: a pre-registered vol feature (e.g. price_volatility_20) for the vol bands, and a pre-registered trend feature (e.g. distance_SMA_192 or trend_strength) for the trend sign. The family-6 absolute bands are re-pre-registered in that feature's units (frozen tertiles of the feature over a disjoint reference window — same leak-safe construction, different unit).
- + Reproducible from the pinned parquet alone — no external data, so the S07 reproducibility contract is preserved (the warm re-audit stays a pure function of the pinned fold).
- + Single path — every cell's regime is defined the same way; a verdict is unambiguous.
- + Leak-safe — the features are already causal and in
X; splitting by them is "split the data the model actually sees", which is arguably the more faithful test of validation-design transfer. - + No new I/O.
- − The pre-registered absolute bands move from close-realized-vol units to feature units → a one-time re-pre-registration (committee-confirmable, Q3-adjacent).
regime_splitgeneralises slightly: accept a vol-signal / trend-signal array (or a chosen feature column) instead of requiringclose. - − A derived feature (e.g. a 20-bar rolling vol) is a smoothed proxy for per-bar realized vol — coarser, but consistent and transparent.
Option C — thread raw OHLCV close explicitly¶
Widen s41_io to load the raw close/OHLCV series alongside X/y (from the parquet if present, else re-derive from cache/source).
- + Most faithful to the plan's original close-based pre-registered bands; no re-pre-registration.
- − If
closeis not in the parquet, re-fetching/re-deriving it re-introduces an external data dependency — and a fold re-fetched outside the pinned parquet can drift vs the pinned fold, breaking the S07 reproducibility guarantee the whole capture cache was built to provide (CVN-N001-EI-S07). That is a material regression for a warm-re-audit instrument. - − Adds I/O + a recoverability assumption.
3. Criteria summary¶
| Criterion | A (close-else-feature) | B (feature, single-path) | C (thread close) |
|---|---|---|---|
| Reproducible from pinned parquet alone | partial | yes | only if close in parquet |
| Preserves S07 pinned-fold contract | partial | yes | at risk (re-fetch drift) |
| Single, unambiguous measurement | no (dual path) | yes | yes |
| Leak-safe | yes | yes | yes |
| Faithful to plan's close-based bands | when close present | needs re-pre-registration | yes |
| New I/O dependency | none | none | yes |
| Code paths to test | 2 | 1 | 1 (+ recovery) |
4. DECISION (operator, 2026-05-28) — Option B, UNCONDITIONAL¶
Define the S41 regime signal from captured features, single-path, for every cell, unconditionally — no fallback to Option C even when close is present. Rationale (decided): B is the only option that keeps the instrument a pure function of the S07-pinned fold (no external data, no drift). Crucially, the dossier's earlier "C if close present" clause is struck — a mix of B-cells (close absent) and C-cells (close present) is a program-level dual-path, the exact "two definitions of the measurement" defect that disqualified Option A. Inter-cell consistency of the measurement requires B to be inconditional; C's marginal fidelity to close-based bands does not justify re-introducing the incoherence. The cost is a one-time, committee-confirmable re-pre-registration of the family-6 bands in feature units.
Implementation under B: regime_split helpers generalise to take a vol_signal / trend_signal (pd.Series) — realized_vol/rolling_return/trend_sign keep their close convenience overload; s41_nodes builds the signals from the pre-registered feature columns and feeds the splits. Family-6 band constants become feature-unit constants (frozen from a disjoint reference window).
5. Reservation — family 6 construct validity (the C-d-verdict-bearing family)¶
Feature-derived regime is anodyne for families 3/4 (a smoothed vol/trend proxy — fine for a CV partition). But family 6 (regime_ab) is the direct C-d stressor that the §12 routing depends on, and a single in-model feature would reduce "regime-A→B" to "train on a low slice, test on a high slice of an axis the model already sees" — that is extrapolation along a seen axis, not transfer across market regimes. The regime-A→B label would over-promise what the partition realises. Resolution (decided):
- Multivariate regime code for family 6 — compose the regime from multiple captured features (vol band × trend sign × a third structural axis, e.g. SMA-distance band), so the regime stays genuinely multivariate even though it is feature-derived. (My current
regime_abis already bivariate, vol × trend; this adds a structural axis and pins it as the family-6 contract.) Bands re-pre-registered in the composite feature units, frozen from a disjoint reference window. - AND honest verdict scope — the family-6 result is reported as "transfer across the captured-feature-defined regime partition", not "across market regimes". The §12 routing must not over-read a feature-derived proxy as a full market-regime claim.
6. Q-feature — selection criteria for regime-defining features (a B-specific confound)¶
Beyond causal + present in all cells (a missing feature must NOT trigger a fallback — that re-opens the dual path), a confound specific to B:
The regime-defining feature must not be among the most label-predictive. If the vol feature strongly predicts
_label(plausible — volatility often predicts barrier outcomes), then low-vs-high tertiles ≈ low-signal vs high-signal regions; training on one and testing on the other makes ΔAUC mix regime transfer with intrinsic difficulty of the test region → the C-d verdict is confounded (difficulty-stratification disguised as transfer).
Selection rule (decided): choose regime features that are informative about market state but have LOW label correlation, and report |corr(feature, _label)| for every regime-defining feature in the results dossier. Proposed gate: a feature with |Spearman corr| > 0.10 to the label is rejected as a regime axis (or, if retained for coverage reasons, the family-6 verdict carries a DIFFICULTY_CONFOUND flag). Proposed signals to vet against this: vol (price_volatility_20), trend (distance_SMA_192), structure (a third axis TBD) — each confirmed causal, present in all cells, and label-corr reported before use.
Decision status: Option B adopted (unconditional) + family-6 multivariate regime + low-label-corr feature criterion. Wiring proceeds on this basis; the family-6 composite bands + the per-feature label-corr report are committee-confirmable at pr_review.