Skip to content

CVN-N001-EI-S03 — Decision: regime-signal source for the S41 split families

  • Story: CVN-N001-EI-S03 (wp#226, In progress) · GitHub: #1058 · Epic: CVN-N001-EI (#1055)
  • Type: implementation decision (instrument design) — arbitrate before wiring s41_nodes / regime_split
  • Branch: feat/CVN-N001-EI-S03-split-regime-reconstruction
  • Date: 2026-05-28

1. The decision

S41's regime/vol split families need a per-bar volatility signal and a trend signal to partition a fold's rows:

Family needs
3 market_period trend sign (causal rolling return)
4 volatility_bucket realized vol (train-region tertiles)
6 regime_ab realized vol + trend sign → regime code, against absolute pre-registered bands

The plan §4 wrote these in terms of realized vol of close (std(log-returns)) and rolling return of close. But the S41 cell consumes the S07-pinned captured parquet, whose columns are: the model feature set (all non-_ columns) + _label + _split + optional _weight (s22_a1_reproduction._load_captured_parquet). Raw close is not guaranteed to survive as a feature column — enrichment emits derived features (price_volatility_20, ATRr_16, distance_SMA_192, trend_strength, …), and close may have been dropped.

Question: where does the regime signal come from? This is load-bearing for instrument validity, reproducibility, and the S07 pinned-fold contract — hence a dossier, not a guess.

Prerequisite fact (confirm at wiring): inspect a captured-fold parquet's columns to see whether raw close survived. The recommendation below (Option B) holds either way; Options A/C depend on the answer.


2. Options

Option A — close if present, else a captured vol feature (fallback path)

Use raw close when the parquet carries it; otherwise fall back to price_volatility_20 / ATRr_16.

  • + Always works; uses true close when available.
  • Dual code path → the instrument's behaviour is data-dependent: a verdict's meaning ("which signal split this cell?") varies by crypto/fold. Reviewers (and the §1 pre-registration discipline) disfavour a branch that changes the measurement. Two paths to test, two to reason about.

Define the regime signal purely from features already in the model's input set: a pre-registered vol feature (e.g. price_volatility_20) for the vol bands, and a pre-registered trend feature (e.g. distance_SMA_192 or trend_strength) for the trend sign. The family-6 absolute bands are re-pre-registered in that feature's units (frozen tertiles of the feature over a disjoint reference window — same leak-safe construction, different unit).

  • + Reproducible from the pinned parquet alone — no external data, so the S07 reproducibility contract is preserved (the warm re-audit stays a pure function of the pinned fold).
  • + Single path — every cell's regime is defined the same way; a verdict is unambiguous.
  • + Leak-safe — the features are already causal and in X; splitting by them is "split the data the model actually sees", which is arguably the more faithful test of validation-design transfer.
  • + No new I/O.
  • The pre-registered absolute bands move from close-realized-vol units to feature units → a one-time re-pre-registration (committee-confirmable, Q3-adjacent). regime_split generalises slightly: accept a vol-signal / trend-signal array (or a chosen feature column) instead of requiring close.
  • A derived feature (e.g. a 20-bar rolling vol) is a smoothed proxy for per-bar realized vol — coarser, but consistent and transparent.

Option C — thread raw OHLCV close explicitly

Widen s41_io to load the raw close/OHLCV series alongside X/y (from the parquet if present, else re-derive from cache/source).

  • + Most faithful to the plan's original close-based pre-registered bands; no re-pre-registration.
  • If close is not in the parquet, re-fetching/re-deriving it re-introduces an external data dependency — and a fold re-fetched outside the pinned parquet can drift vs the pinned fold, breaking the S07 reproducibility guarantee the whole capture cache was built to provide (CVN-N001-EI-S07). That is a material regression for a warm-re-audit instrument.
  • Adds I/O + a recoverability assumption.

3. Criteria summary

Criterion A (close-else-feature) B (feature, single-path) C (thread close)
Reproducible from pinned parquet alone partial yes only if close in parquet
Preserves S07 pinned-fold contract partial yes at risk (re-fetch drift)
Single, unambiguous measurement no (dual path) yes yes
Leak-safe yes yes yes
Faithful to plan's close-based bands when close present needs re-pre-registration yes
New I/O dependency none none yes
Code paths to test 2 1 1 (+ recovery)

4. DECISION (operator, 2026-05-28) — Option B, UNCONDITIONAL

Define the S41 regime signal from captured features, single-path, for every cell, unconditionallyno fallback to Option C even when close is present. Rationale (decided): B is the only option that keeps the instrument a pure function of the S07-pinned fold (no external data, no drift). Crucially, the dossier's earlier "C if close present" clause is struck — a mix of B-cells (close absent) and C-cells (close present) is a program-level dual-path, the exact "two definitions of the measurement" defect that disqualified Option A. Inter-cell consistency of the measurement requires B to be inconditional; C's marginal fidelity to close-based bands does not justify re-introducing the incoherence. The cost is a one-time, committee-confirmable re-pre-registration of the family-6 bands in feature units.

Implementation under B: regime_split helpers generalise to take a vol_signal / trend_signal (pd.Series) — realized_vol/rolling_return/trend_sign keep their close convenience overload; s41_nodes builds the signals from the pre-registered feature columns and feeds the splits. Family-6 band constants become feature-unit constants (frozen from a disjoint reference window).


5. Reservation — family 6 construct validity (the C-d-verdict-bearing family)

Feature-derived regime is anodyne for families 3/4 (a smoothed vol/trend proxy — fine for a CV partition). But family 6 (regime_ab) is the direct C-d stressor that the §12 routing depends on, and a single in-model feature would reduce "regime-A→B" to "train on a low slice, test on a high slice of an axis the model already sees" — that is extrapolation along a seen axis, not transfer across market regimes. The regime-A→B label would over-promise what the partition realises. Resolution (decided):

  1. Multivariate regime code for family 6 — compose the regime from multiple captured features (vol band × trend sign × a third structural axis, e.g. SMA-distance band), so the regime stays genuinely multivariate even though it is feature-derived. (My current regime_ab is already bivariate, vol × trend; this adds a structural axis and pins it as the family-6 contract.) Bands re-pre-registered in the composite feature units, frozen from a disjoint reference window.
  2. AND honest verdict scope — the family-6 result is reported as "transfer across the captured-feature-defined regime partition", not "across market regimes". The §12 routing must not over-read a feature-derived proxy as a full market-regime claim.

6. Q-feature — selection criteria for regime-defining features (a B-specific confound)

Beyond causal + present in all cells (a missing feature must NOT trigger a fallback — that re-opens the dual path), a confound specific to B:

The regime-defining feature must not be among the most label-predictive. If the vol feature strongly predicts _label (plausible — volatility often predicts barrier outcomes), then low-vs-high tertiles ≈ low-signal vs high-signal regions; training on one and testing on the other makes ΔAUC mix regime transfer with intrinsic difficulty of the test region → the C-d verdict is confounded (difficulty-stratification disguised as transfer).

Selection rule (decided): choose regime features that are informative about market state but have LOW label correlation, and report |corr(feature, _label)| for every regime-defining feature in the results dossier. Proposed gate: a feature with |Spearman corr| > 0.10 to the label is rejected as a regime axis (or, if retained for coverage reasons, the family-6 verdict carries a DIFFICULTY_CONFOUND flag). Proposed signals to vet against this: vol (price_volatility_20), trend (distance_SMA_192), structure (a third axis TBD) — each confirmed causal, present in all cells, and label-corr reported before use.

Decision status: Option B adopted (unconditional) + family-6 multivariate regime + low-label-corr feature criterion. Wiring proceeds on this basis; the family-6 composite bands + the per-feature label-corr report are committee-confirmable at pr_review.