Stateful Enrichment Refactor — Design document¶
Status: v4 — committee-reviewed (v3 PASSED, strong consensus, 8.6/10 average). Phase 1 shipped 2026-04-21 — scaffolding + EMA stub + shadow harness, 63/63 tests green (post-CR-round-2 + post-rebase, 2026-05-03 ; original 57 grew to 63 with the FTF lock + streaming-skip tests added during CR). Phase 2 (indicator bulk port) unblocked and can start. Authors: CVNTrade research, 2026-04-20 (v1–v3) / 2026-04-21 (v4) Parent issue: #599 (shadow divergence diagnostic) Supersedes: the three fix options outlined in PR #609 (all three were either incorrect — option C splits the code path — or intractable — option B cannot handle infinite-memory indicators).
Changelog¶
- v4 (2026-04-21) — Phase 1 shipped and informs the doc:
- Protocol revision —
updateandemitare now separated methods (§4.2), resolving the ordering subtlety identified in the 2026-04-21 design discussion. R/B classes update on every candle for state correctness ; emit is lazy and skipped on CUSUM-filtered candles. The v3 conflatedupdate → (state', value)form is superseded. - §4.3 new — CUSUM/Enrichment ordering. Post-refactor, enrichment
updateruns BEFORE CUSUM gating ; CUSUM gatesemit, notupdate. This is a direct consequence of stateful correctness and was implicit in v3 ; now explicit. - §6 Phase 1 marked ✅ done with links to the shipped module.
- v3 (2026-04-20) — post committee review. Integrates 13 recommendations: timeline extended 4w → 6–8w, open questions §11 converted to decisions, operational safeguards added (kill switch, runbook, drift logging), shadow harness gains a divergence budget, Phase 0 inventory flagged as needing senior engineering time. No architectural change; the core proposal was unanimously supported.
- v2 (2026-04-20) — scaling semantics (§8bis) and multi-tenant separation of concerns (§8ter) added to preempt "stateful doesn't scale" objection.
- v1 (2026-04-20) — initial draft.
1. Executive summary¶
CVNTrade's enrichment stage today has a silent structural flaw. The
invariant advertised by EnrichmentAPI:
is false whenever the caller passes different windows to the two public methods. This is the norm — not the exception — in production:
- the monolith backtest engine calls
enrich_batch(full_history)once at start, - the pipeline runner calls
enrich_streaming(ohlcv.iloc[T-500:T+1])at every bar (seesrc/commun/pipeline/runner/steps.py:122).
For indicators with unbounded memory (EMA, Wilder RSI, ADX, MACD) and
multi-timeframe resamples (_1h, _6h), the 500-bar rolling window
produces measurably different feature values than the batch path.
Confirmed by PR #609 test_batch_full_vs_streaming_rolling_window_reproduces_shadow_drift:
15 features drift between batch and streaming at T=1000, top offender
atr_1h_zscore is NaN in streaming vs -1.54 in batch.
This drift propagates downstream:
- p(BUY) shifts up to 0.12 at the same bar (#599 Loki evidence)
- Trade counts diverge ±1/±3 per crypto per fold
- Backtest no longer predicts live behaviour — because the live path
(pipeline runner) is exactly the broken rolling-window path
The fix is not to hide the divergence behind two code paths ("batch for backtest, stream for live") — that would move the drift from "monolith vs pipeline" to "backtest vs live", where it becomes invisible.
The fix is a stateful, incremental enrichment engine shared by all modes. The state is initialised once (bootstrap on available history, batch or live) then updated one candle at a time via closed-form recursive rules. The output at any bar T is identical whether we got there by ingesting the whole history in batch or one candle at a time in streaming — by construction, not by testing.
This document proposes the architecture, the rollout plan, and the validation criteria. Implementation is a dedicated child issue, not this document.
2. Problem statement¶
2.1 Evidence¶
- PR #609 regression test: 15 columns drift at T=1000 on synthetic
OHLCV (
atr_1h_zscoreNaN vs -1.54 ;atr_1h_price4.6% drift ;trend_strength_6h1.1% ;ADX_48,RSI_48,MACD_*all non-zero). - Shadow run
ftf_20260420_133512_8328fb: 8 first-divergence events observed across OP/AAVE/LDO withsame_bar_prob_drift,|mono_p_buy − pipe_p_buy|up to 0.12,gap_to_test_end_bars4556–5853 (drift scatters across the test window, not at boundary). - Trade-count divergence: monolith vs pipeline open ±1 to ±3 trades per crypto per fold.
2.2 Root cause¶
CVNTrade_Enrich.process() is a pure batch function: it takes a
DataFrame, computes indicators on that DataFrame from scratch, returns
the enriched DataFrame. Calling it with a partial history produces
results that reflect only that partial history.
Indicator classes affected:
| Class | Memory | Example | Warmup needed |
|---|---|---|---|
| SMA / Bollinger | finite, N bars |
SMA_20 |
N bars |
| Wilder smoothing | unbounded | RSI, ADX, ATR |
full history, never converges exactly |
| EMA / MACD | unbounded | EMA_50, MACD |
full history |
| Multi-timeframe resamples | depends on tf | atr_1h_*, trend_strength_6h |
N × (higher_tf / base_tf) bars minimum |
| Rolling quantiles | finite N, but needs the ring |
*_zscore |
N bars |
| Custom xgb_* features | case-by-case | xgb_mining_pressure_* |
individual analysis |
For rolling-500 vs full-history on 15m candles:
- RSI_48, ADX_48, Wilder ATR: technically never exactly equal to
batch (Wilder smoothing has geometric memory), differences
measurable
- EMA_50, MACD_24_64_18: same — EMA recursion never forgets
- atr_1h_zscore: requires enough candles to compute 1h-resampled
ATR + its rolling z-score. On 15m base, 1h = 4 bars, z-score over
N=50 → 200+ bars minimum just for the window, plus the ATR warmup
on the 1h axis. 500 candles is not enough; hence NaN.
- trend_strength_6h: 6h = 24 bars of 15m, over a long-horizon
indicator → even more warmup.
2.3 Why "two paths" is the wrong answer¶
Considered and rejected in PR #609 discussion: make the pipeline runner
call enrich_batch on the full history in backtest, keep
enrich_streaming rolling window only for paper/live. This was
rejected by the user during the design discussion of 2026-04-20 with
the decisive argument:
"si on a 2 chemins différents entre le backtest et le trading, on backteste rien, c'est con"
— which is correct. Backtest must simulate live. If the enrichment code used in backtest diverges from the enrichment code used in live, the backtest predicts a version of the strategy that will never run in production. The drift we see today between monolith and pipeline would become invisible drift between backtest and live — worse, not better.
Non-negotiable design constraint: one enrichment code path, shared by backtest, paper trading and live trading.
3. Design principles¶
P1 — One code path¶
All callers (backtest engine, pipeline runner, paper trading adapter, live adapter) go through the same enrichment engine. The only parameter that varies between modes is the initial state: backtest bootstraps from the full test history at run start ; live bootstraps from a fetched chunk of recent candles at startup ; both then update candle-by-candle.
P2 — Stateful by design, update rule is closed-form¶
Each indicator exposes an update rule:
where state carries just enough information to produce the correct
output at the next candle (EMA carries its previous value ; SMA
carries a ring buffer ; Wilder carries its smoothed running values ;
multi-tf carries the current higher-tf partial bar + the state of the
higher-tf indicator engine).
The rule must be closed-form — no recomputing from history at each candle (the O(T²) trap). For indicators where no closed-form exists, we either keep a ring buffer (finite-window indicators) or approximate explicitly with a documented tolerance (rare case, to avoid if possible).
P3 — Bootstrap is the only bridge to batch¶
Enricher.bootstrap(history_df) is called once at startup. It walks
the history candle by candle internally (or batch-computes where
equivalent) and emerges with a state that is identical to what an
infinite streaming history would have produced. After bootstrap, the
streaming update(candle) takes over.
This means the entire _enrich_core body of today's code moves from
"orchestration of batch computations" to "bootstrap routine". The
public API gains a persistent state object.
P4 — Parity is proven by construction, not by tests¶
The test suite will pin the parity invariant (PR #609 will go green strict-xfail→passed), but the real guarantee is that there is only one computation path. "Backtest consumes the state transitions in one shot ; live consumes them gradually" — by definition, they produce the same output at the same bar.
P5 — Deterministic and reproducible¶
Bootstrap is deterministic: same history_df in → same initial
state out. Updates are deterministic: same (state, candle) in →
same (state', output) out. Consequence: a paper-live replay from a
known starting state reproduces the same decisions bit-for-bit.
P6 — State is an MLflow-logged artifact (optional, phase 2)¶
Just as CUSUM sigma is version-pinned in MLflow today, the bootstrapped enrichment state could be persisted per model version. Not required for v1 (we re-bootstrap from history each run) but it opens up: faster paper/live startup, auditable state, offline replay from a pinned state. Flagged as open question §11.
P7 — Fail-fast on divergence (with budget)¶
During the migration, both code paths coexist briefly. We ship a
shadow-comparison assertion: on each candle, if the stateful output
differs from the legacy batch output by more than a documented
tolerance (1e-9 relative, per §11.1 decision), we log a WARN.
Divergence budget (committee recommendation 11): the run blocks
if more than 5 WARNings per feature per fold fire. This keeps
isolated float-non-associativity noise tolerable while guaranteeing
any systematic drift is caught immediately. Past that budget, the
run emits CRIT enrichment_shadow_budget_exceeded and fails fast
(ADR-25).
Continuous shadow in production (optional, flag-gated): a
CVN_ENRICHMENT_SHADOW_MODE env var can keep the comparison active
on live trading beyond the migration window — useful as a long-term
drift sentinel. Defaults off after migration closes; ops can flip
it to on targeted at a single crypto for investigation without a
redeploy.
P8 — Shared state, per-user decisions¶
Enrichment state is deterministic from OHLCV: identical input +
identical indicator rules → identical state. Consequence: the
enricher state is shared across all users who consume the same
(symbol, timeframe, feature_set_version). It is not a per-user
resource. Per-user customisation happens strictly downstream of
enrichment — at inference threshold, PTE, sizing, filters and
execution. This holds today (stateless recompute already produces
the same features for any two users of the same feature set) and is
preserved by the refactor; the stateful engine does not change the
multi-tenancy semantics.
The state key is therefore (symbol, timeframe, feature_set_version),
not (user_id, …). One enricher instance per unique key observed,
serving all concurrent users of that key. See §8bis for the full
separation of concerns.
4. Architecture¶
4.1 Component layout¶
src/commun/pipeline/
enrichment_api.py (thin public API, unchanged surface)
enrichment/
state.py (EnrichmentState dataclass, serialisable)
engine.py (StatefulEnricher: bootstrap + update)
indicators/
simple_recursive.py (EMA, MACD, Wilder smoothing)
ring_buffer.py (SMA, rolling std, quantiles)
multi_timeframe.py (resample + sub-engine per tf)
xgb_custom.py (xgb_* features, per-feature)
EnrichmentAPI becomes a facade over StatefulEnricher:
class EnrichmentAPI:
def __init__(self, config: EnrichmentConfig):
self._engine = StatefulEnricher(config)
self._state: Optional[EnrichmentState] = None
def enrich_batch(self, ohlcv: pd.DataFrame) -> EnrichmentResult:
"""Bootstrap + emit all rows. Used by backtest at run start."""
self._state, rows = self._engine.bootstrap_and_emit_all(ohlcv)
return EnrichmentResult(features=rows, ...)
def enrich_incremental(self, candle: pd.Series) -> pd.Series:
"""Single-candle update. Used by pipeline runner / paper / live."""
if self._state is None:
raise RuntimeError("must bootstrap() before enrich_incremental()")
self._state, row = self._engine.update(self._state, candle)
return row
def bootstrap(self, history: pd.DataFrame) -> None:
"""Explicit bootstrap; used by paper/live after fetching history."""
self._state, _ = self._engine.bootstrap_and_emit_all(history)
The deprecated enrich_streaming(window) stays during migration but
emits a DeprecationWarning and internally: self.bootstrap(window);
return self.enrich_incremental(window.iloc[-1]). Removed in phase 4.
4.2 Indicator contract (v4 — update and emit separated)¶
class Indicator(Protocol):
name: str # column key in output
def bootstrap_state(self, history: pd.DataFrame) -> IndicatorState:
"""Compute initial state from a batch of candles."""
def update(self, state: IndicatorState, candle: pd.Series
) -> IndicatorState:
"""Ingest one candle, return the updated state. MANDATORY on every
candle for R/B classes — skipping corrupts state. No-op for class S.
"""
def emit(self, state: IndicatorState) -> float:
"""Read the current value from state. O(1) — no recomputation.
Called lazily (only on CUSUM-pass candles for compute economy)."""
v4 rationale : the v3 contract conflated update → (state', value)
in a single call. That was too coarse — stateful correctness demands
update on every candle (including CUSUM-filtered ones) while emit
is pure output and can be skipped when the value will not be consumed.
Splitting resolves the tension without adding a feature flag.
Every existing indicator is ported to this contract. Where a closed-form update exists (EMA, Wilder), we use it. Where only a finite-window dependency exists (SMA-N), state is a deque of the last N prices. For multi-timeframe, state includes the current partial higher-tf bar + the sub-indicator state (recursive structure).
Phase 1 shipped (2026-04-21) with the EMA stub implementing this contract
and a StatefulEnricher engine that orchestrates the three ops. Phase 2
ports the rest of the inventory class by class.
4.3 CUSUM / Enrichment ordering (v4 — explicit)¶
The post-refactor pipeline reorders CUSUM to sit after enrichment's
update but before enrichment's emit :
OHLCV candle T
│
▼
[Enrichment.update] — runs for EVERY candle, O(1), keeps state correct
│
▼
[CUSUM gate] — reads returns (raw OHLCV), gates downstream
│
├── non-event → skip emit + FE + inference
│
└── event → continue
▼
[Enrichment.emit] — reads state, ~gratuit for R/B, full compute for S
│
▼
[FE + Inference + LdP v2 chain]
Why the reorder vs today's CUSUM → enrich_streaming order :
- Today's
enrich_streamingis stateless-per-call (rebuilds from a rolling 500-bar window on every invocation). Skipping it on CUSUM-filtered candles was a valid compute-saving because re-entering later would not notice the skip. - Post-refactor, enrichment is stateful. Skipping
updateon even a single candle means the EMA / Wilder / ring-buffer state diverges from the "would-have-been" state if the candle had been seen. The drift accumulates and breaks the cross-mode parity claim (§P4). - Because
updateis O(1) per candle in the stateful design, running it unconditionally costs effectively nothing. The compute-saving that justified the old order has evaporated. emitremains skippable because it has no side effect on future bars — it only produces the value the downstream consumer wants. Skipping it on CUSUM-filtered candles keeps the savings for FE + inference + filter chain.
This reorder is implicit in the design but was only made explicit
2026-04-21 after operator feedback. Phase 2 call sites MUST honour this
ordering — an integration test in Phase 2 will fail any call site that
skips update.
4.4 Call patterns¶
Backtest (monolith):
engine.bootstrap(full_ohlcv) # O(T), once at init
for t in range(len(ohlcv)):
row = features.iloc[t] # already computed at bootstrap
...
Backtest (pipeline runner):
engine.bootstrap(full_ohlcv_up_to_test_start) # same O(T) call
for t in test_range:
row = engine.enrich_incremental(ohlcv.iloc[t]) # O(1) per candle
...
Because both paths share engine.bootstrap + update, the output at
bar T is identical (P4).
Paper / live:
history = fetch_recent_candles(crypto, n=WARMUP_BARS)
engine.bootstrap(history)
while True:
candle = await next_candle()
row = engine.enrich_incremental(candle)
...
Same bootstrap + update. WARMUP_BARS is chosen per indicator
analysis (§5.4) to guarantee convergence.
5. Scope¶
5.1 In scope¶
CVNTrade_Enrich(tech indicators) → port every indicator to the Indicator contractCVNTrade_XGBoostFeatureGenerator(xgb_*features) → sameEnrichmentAPIrefactor (facade + state management)- Pipeline runner's
EnrichmentStep→ bootstrap + incremental - Monolith backtest engine → bootstrap + iloc lookup (unchanged externally, cleaner internally)
- PR #609 xfail test → passes after refactor
- New tests: per-indicator bootstrap/update parity, full-pipeline parity on real OHLCV, paper/live replay determinism
5.2 Out of scope (later, separate issues)¶
- State persistence in MLflow (P6) — §11 open question
- Paper/live warmup-during-gap strategy — separate operational issue
- Performance optimisation past the "correct" baseline — measure first
- Other pipeline stages (FE, inference, filters, signal manager) — same principle applies but each deserves its own design
- Removing the monolith backtest engine entirely (#568) — separate track
5.3 Explicitly not changing¶
- Public signature of
EnrichmentAPI.enrich_batch— backward compatible - Output DataFrame shape — same columns, same order
- Downstream consumers (FE pipeline, inference) — they see the same rows
- MLflow model artifact layout — unchanged
6. Rollout plan¶
Phase 0 — Inventory (2–3 days, senior engineering time required — committee rec. 13)¶
- Enumerate every column produced by today's
CVNTrade_EnrichandCVNTrade_XGBoostFeatureGenerator - For each, classify: simple recursive / ring buffer / multi-tf / custom / trivially stateless
- For every custom
xgb_*feature: confirm whether the current computation is genuinely incremental-compatible, approximate, or inherently batch (if any). Flag the inherently-batch cases as blockers for a scope re-scoping discussion before Phase 2 starts. - Determine
WARMUP_BARSper indicator class (for §9 runtime validation) - Output: a spreadsheet (committed to
documentation/architecture/ENRICHMENT_INDICATORS.md) that the implementation team ticks off as each indicator is ported
Phase 1 — Core engine ✅ DONE 2026-04-21¶
-
EnrichmentStatedataclass — pure functional,set()returns new instance -
IndicatorProtocol —bootstrap_state/update/emitseparated (v4 contract) -
StatefulEnricherengine —register,bootstrap,update,emit,bootstrap_and_emit_all - EMA stub indicator — textbook recurrence, parity with
pd.Series.ewmatatol=1e-12 - Parity harness —
assert_indicator_parityhelper, reusable one-liner per feature for Phase 2 -
EnrichmentAPIfacade flag dispatch —CVN_ENRICHMENT_STATEFUL_ENABLED∈ {0,shadow,1} ;0legacy zero-cost,shadowruns both and logs divergences,1reserved for Phase 3 (raises in Phase 1) - Shadow-comparison harness (§P7) — structured log events (
enrichment_shadow_divergence,_ok,_no_overlap,_budget_exceeded), per-call divergence budget enforcement, defensive exception catching so shadow failures never take down a run - 63/63 tests green (engine, EMA parity, shadow harness, facade dispatch ; original 57 + 6 added during CR rounds 1-2)
- Zero behaviour change in default mode — legacy path pixel-identical
Files shipped:
- src/commun/pipeline/enrichment/{state,protocol,engine,shadow}.py
- src/commun/pipeline/enrichment/indicators/ema.py
- tests/unit/pipeline/enrichment/{test_engine,test_ema,test_shadow,test_facade_dispatch}.py
- tests/unit/pipeline/enrichment/parity_harness.py
- src/commun/pipeline/enrichment_api.py (facade hook added)
Phase 2 — Port indicators by class (2–3 weeks)¶
- Simple recursive (EMA, MACD, Wilder family) first — small, closed-form
- Ring-buffer next (SMA, rolling quantiles)
- Multi-timeframe last — most complex, depends on the above
- Custom xgb_* alongside as they typically wrap the above
- Each port: unit test (bootstrap batch output == incremental output on the same history), integration test (full pipeline parity)
- The shadow-comparison harness stays green throughout
Phase 3 — Flip the flag (1 day)¶
- Set
CVN_ENRICHMENT_STATEFUL_ENABLED=1inftf_config.base_env(Console snapshot) - Run a full FTF over the 3 exploration cryptos, verify the triptych matches the pre-flip baseline (tolerance: 1e-6 on feature values, identical trade decisions)
- Keep the old path reachable via
=0for 1 sprint as rollback
Phase 4 — Decommission (1 day, 1 sprint after flip)¶
- Remove the old
_enrich_corebatch-recompute path - Drop
enrich_streaming(window)deprecated alias - Flip #609 xfail → passing
- Update
EnrichmentAPIdocstring — the invariant is now true by construction
Total wall-clock: 6–8 weeks for 1 FTE (committee rec. 1 — the initial 4-week estimate was too aggressive given xgb_* and multi-tf complexity). Alternative: 1.5–2 FTE for 4 weeks. Default plan assumes 1 FTE × 7 weeks.
7. Validation criteria¶
7.1 Automated¶
-
test_batch_full_vs_streaming_rolling_window_reproduces_shadow_drift(#609) — xpass - New
test_bootstrap_then_incremental_equals_batch— bootstrap onhistory[:N], then applyupdatecandle-by-candle fromNtoT, compare final state to a freshbootstrap_and_emit_all(history[:T]). Must be bit-identical. - New
test_paper_live_replay_deterministic— from a pickledEnrichmentState, replay a candle stream, assert outputs bit-identical to a full-history bootstrap. - Per-indicator unit tests covering bootstrap, update, edge cases (gap in data, first candle, state carrying NaN).
- Shadow-comparison assertion (§P7) fires zero
WARNon a week of production backtest runs.
7.2 Functional¶
- A full FTF run (3 cryptos × 5 folds × 3 variants) produces identical
p(BUY)distributions pre-flip and post-flip (KS test p > 0.99). - Shadow divergence (#599) dashboard stops firing
pipeline_shadow_first_divergenceevents withside=same_bar_prob_drifton all exploration cryptos. - Monolith-vs-pipeline trade count difference → zero on a full FTF run.
7.3 Operational¶
- Backtest performance: no regression > 10% wall-clock per fold.
- Memory footprint:
EnrichmentStateserialised size documented, no pod OOM at steady state. - Paper/live bootstrap time: documented, < 30 s for
WARMUP_BARStypical value.
7.4 Performance profiling (committee rec. 9)¶
- Per-indicator
update()profiled on production-like OHLCV: report median, p95 and p99 latency per class (recursive / ring / multi-tf / xgb_*). - End-to-end per-candle latency budget documented (what we promise paper/live).
- Any indicator whose
update()exceeds documented budget gets a flagged follow-up issue before Phase 4 cutover — not a blocker if impact is bounded.
7.5 Edge case coverage (committee rec. 10)¶
- Data gaps (missing candles, exchange outage): state update must tolerate time-gaps without corruption — test with synthetic gaps of 1, 5, 50 candles.
- NaN propagation: any NaN in an indicator output must be isolated to that indicator's column (not contaminate downstream).
- First-candle bootstrap: the very first
update()after abootstrap()must produce the correct row, not a transient artifact. - Late candle replay (out-of-order arrival): documented behaviour, test covered.
- Empty bootstrap (zero history): should raise an explicit error, not crash silently.
8. Why this is the right fix¶
| Alternative considered | Why rejected |
|---|---|
| A — pre-compute multi-tf features batch-only and inject | still "one path in backtest, a different one in live" — doesn't fix live |
| B — extend the rolling window until warmup is sufficient | impossible for EMA / Wilder (unbounded memory); multi-tf would need windows of thousands of bars per higher-tf, making the approach pathological |
| C — monolith's enriched DataFrame shared with pipeline on backtest path only | kills the backtest-live equivalence (user veto, 2026-04-20) |
| E — fully stateless, recompute full history on every candle | O(T²) wall-clock (~110 h/fold on 6k bars vs 25 min today) — not viable |
| F — fully stateless, approximate EMA/Wilder with bounded windows | reintroduces drift by construction — same bug class we are fixing |
| D (this proposal) — incremental engine with explicit state threading, everywhere | aligns backtest and live by construction; unit-testable per indicator; no warmup-window tuning; standard pattern in streaming trading systems |
8bis. Scaling semantics and multi-tenancy¶
The word "stateful" invites a scaling objection ("can't scale if it's stateful"). The objection conflates two different meanings of state:
- Distributed shared state (bad) — a cluster-wide store with locking, consensus, eventual consistency headaches. Not this design.
- Explicit local state threaded through pure functions (good) —
new_state, output = update(old_state, candle). The state is a local variable, owned by the caller, passed in and out. Same pattern as Redux, Kafka Streams operators, Flink operator state, or any streaming trading engine (QuantConnect, Backtrader streaming, TradingView Pine). This IS the design.
What parallelises and what does not:
| Dimension | Scales? | How |
|---|---|---|
| Different symbols (BTC / ETH / SOL) | Linearly | One enricher instance per symbol, zero cross-talk |
| Different folds in FTF backtest | Linearly | Each fold bootstraps its own enricher — already the case in today's FTF multiprocessing |
| Different feature set versions | Linearly | One instance per (symbol, timeframe, feature_set_version) |
| Concurrent users with the same feature set | Share one instance (§P8) | Features are deterministic → identical output → no reason to duplicate state; users diverge downstream |
| Pod failover / restart | Bootstrap is idempotent | Re-fetch history → re-bootstrap → steady state restored. Current paper/live already does this every restart. |
| Multiple replicas of the same enricher | N/A by design | Per (symbol, timeframe, feature_set) there is no value in running two replicas; the state would diverge on first candle due to network-ordering non-determinism. Scale-out goes by additional keys, not by additional replicas per key. |
What does not parallelise — and this is mathematics, not engineering:
- Within a single stream (one symbol, time advancing): bar T
depends on bar T-1 via the EMA recursion
ema_t = α·x_t + (1−α)·ema_{t−1}. No stateless reformulation can break this dependency chain without either recomputing from scratch (option E, O(T²)) or approximating (option F, drift). The sequential constraint is intrinsic to the indicator family we use.
8ter. Who-owns-what (multi-tenant architecture)¶
OHLCV BTCUSDC 15m
│
▼
┌────────────────────────────────────────┐
│ Enricher — state key = (BTCUSDC, 15m, │ ← shared across all users
│ feature_set_version=v2-abc123) │ of this feature set
│ • EMA/Wilder/multi-tf state │
│ • ring buffers, partial higher-tf bar │
└────────────┬───────────────────────────┘
│
▼
features_df (deterministic, shared output)
│
├──► User A: model → threshold(mode=A) → PTE_A → Kelly_A → exec account A
├──► User B: model → threshold(mode=B) → PTE_B → Kelly_B → exec account B
└──► User C: different trained model (→ different feature_set_version
→ own enricher instance)
Shared layer (enricher state):
- (symbol, timeframe, feature_set_version) tuple → one instance
- deterministic from OHLCV
- memory cost: ≈ a few Mo per instance (a few floats per indicator,
plus ring buffers of ~50 × timeframe periods × 8 B)
- lifecycle: bootstrap at startup, survives as long as the key is
used, discarded when no user references it
Per-user layer (downstream of enrichment): - threshold mode (#608), PTE, Kelly fraction, filter flags - active positions, PnL, account state - exchange API keys, routing decisions - nothing of this touches the enricher
How users opt into a different enrichment: by picking a different
trained model. A different model = a different
feature_set_version = a different enricher key = its own instance.
This is already how the system works today (model artifacts are
version-pinned in MLflow per ADR-23) and this refactor does not
change that boundary.
8quater. Memory budget worked example¶
Let us put numbers on "a few Mo per key". Assume a worst-case indicator set: 20 Wilder/EMA indicators (each ≈ 3 float state values), 10 ring-buffer indicators (each ≈ N=50 floats), 4 multi-timeframe engines (each ≈ a second-level state of the same size), and growing pandas Series of the last 2000 bars (for the output DataFrame shape consumers expect).
- 20 × 3 × 8 B = 480 B
- 10 × 50 × 8 B = 4 kB
- 4 × (480 + 4000) = ≈ 18 kB
- Output ring of 2000 × (say) 150 features × 8 B = 2.4 MB
≈ 2.5 MB per (symbol, timeframe, feature_set) key. For 10
cryptos × 1 timeframe × 1 feature_set = 25 MB. For 100 cryptos = 250 MB.
Within reach of any single pod. If we ever cross the threshold
(thousands of concurrent feature sets?) the state key becomes a
natural sharding axis.
9. Risks¶
9.1 Implementation risks¶
- Indicator-by-indicator regressions: one mis-ported indicator silently produces different values → caught by the shadow-comparison harness (P7) and per-indicator unit tests. Mitigated by ticking off the inventory one class at a time.
- Multi-timeframe subtleties: resample boundaries (e.g. a 1h bar closes at HH:00, partial bar at start) are bug-prone. Mitigated by dedicated multi-tf test harness with synthetic data and realistic gap cases.
- Custom
xgb_*features that don't fit any class: may require bespoke porting per feature. Inventory phase surfaces these early.
9.2 Timing risks¶
- Underestimated scope: 4 weeks is optimistic for a team of 1 if the xgb_* custom features have many exotic cases. Inventory phase will confirm.
- P0-A tuning pause: we shouldn't freeze lever measurements for 4 weeks. Mitigation: P0-A continues with the old enrichment (today's state); the stateful refactor lands behind a flag and is validated in parallel; the FTF triptych comparison post-flip confirms no behavioural shift.
9.3 Residual risks post-delivery¶
- Backtest-live parity now depends on the bootstrap history being "the same" both times. If live fetches a shorter warmup than backtest's bootstrap, the first N candles in live still drift. Documentation must state the required
WARMUP_BARSper indicator class, and live deployment must fetch at least that many.
10. Impact on ongoing P0-A tuning¶
- PR #609 stays as-is (regression test, strict xfail).
- Lever #2 run completes under the current (broken) enrichment; the data is usable for within-run comparisons but not trustworthy for live-performance prediction — a caveat we've already internalised.
- After the stateful refactor flips, we re-run the P0-A baseline reference (tw=18, autocal=0, binary=0) with the new enrichment as the new ground truth. Prior runs become historical context, not comparators.
11. Decisions taken (post committee review — was "open questions" in v1/v2)¶
-
Tolerance floor — DECISION: target bit-identical parity in unit tests. If float non-associativity forces a relaxation, document per-indicator with the smallest feasible tolerance (≤ 1e-9 relative), justify in the test, and verify the divergence has zero impact on downstream decisions (signal, threshold crossings, trade generation). Committee rec. 2.
-
State serialisation / MLflow persistence (P6) — DECISION: defer to a follow-up issue post-refactor. Scope this issue to the core engine only. The follow-up unlocks fast paper/live startup + auditable replay; to be prioritised immediately after Phase 4 cutover. Committee rec. 3.
-
Multi-timeframe boundary policy — DECISION: keep the aligned-close policy (1h indicator updates only when the 1h bar closes). Matches today's
enrich_batchsemantics and standard chart conventions — no backward-incompatibility risk. A future flag for partial-bar updates is tracked as a separate enhancement if a latency-sensitive strategy ever needs it. Committee rec. 4. -
Paper/live bootstrap history — DECISION: the bootstrap window is computed per-indicator from the inventory (§6 Phase 0 deliverable).
WARMUP_BARS = max(per-indicator-requirements). Runtime enforcement inEnricher.bootstrap(history): iflen(history) < WARMUP_BARS, logERROR enrichment_bootstrap_insufficient_historyand either (a) auto-fetch more from the adapter if supported, or (b) raise a hard exception. Default: raise; adapters can opt into auto-fetch. Committee rec. 5. -
Backward compatibility of persisted models — DECISION: strictly enforced. The refactor must produce bit-identical numerical values for all existing features so deployed models keep working without retraining. If any indicator needs a deliberate semantic change during the port, it must emit a new
feature_set_versionand trigger full model retraining — no in-place semantic drift. Committee rec. 6. -
API rename — DECISION: rename now. New public API is
bootstrap()+enrich_incremental().enrich_streaming()becomes a deprecation shim emittingDeprecationWarningduring Phases 1–3 ; removed entirely in Phase 4.enrich_batch()stays (it's still the backtest startup entry point, now implemented as "bootstrap + emit-all-rows"). Committee rec. 7. -
Multi-tenant enricher lifecycle (§8ter) — DECISION deferred to Phase 1 design: default to a runtime registry per process with eviction on zero references for 10 minutes. If we ever run more than one user per process concurrently, revisit.
-
Replica policy (§8bis) — DECISION: enforced at infrastructure level via Kubernetes StatefulSet + partition-by-key on
(symbol, timeframe, feature_set_version). A Phase 4 deliverable is updatinginfra/helm/*/values-prod.yamlaccordingly. Committee rec. 8. -
Operational safeguards (committee rec. 12) — DECISION: added to the scope. Three components:
- Runtime kill switch
CVN_ENRICHMENT_PAUSE=1— freezes state updates in place, emits last-known features each candle, logsCRIT enrichment_paused_by_operator. Used when we suspect a bad state after an upstream data incident. - Incident response runbook
documentation/runbooks/enrichment_incident.md— Phase 2 deliverable, to be written alongside thethreshold_calibrator_*runbooks (#608). - Long-term drift monitoring — log per-indicator distributional summaries (mean, std, quantiles of outputs) to MLflow/W&B at model-training time; Grafana alert on distributional shift vs historical training window.
12. References¶
- Diagnostic PR: #609 — enrichment parity test, xfail strict
- Run evidence:
ftf_20260420_133512_8328fb_ATR1.5_3.0_H5(lever #2, shadow logs) - Parent issue: #599
- User design veto (2026-04-20): "si on a 2 chemins différents entre le backtest et le trading, on backteste rien"
- Relevant ADRs:
- ADR-23 (version-pinned MLflow artifacts) — extends naturally to enrichment state (P6)
- ADR-25 (no silent fallback) — applies to the shadow-comparison (fail fast, not warn-and-continue)
- ADR-40 (paper/live same kernel, adapter seul diffère) — directly aligned; this refactor is the enforcement of that ADR in the enrichment stage
- Prior art: streaming indicator engines in
pandas-ta,ta-lib, CCXT's pro streaming adapters, TradingView's Pine Script (bootstrap + per-bar update is their canonical pattern)
13. Acceptance criteria for this design document¶
- Problem quantified with run evidence + regression test
- Design principles (P1–P8) explicit and ranked
- Component layout + indicator contract specified
- Rollout plan in phases with exit criteria
- Validation criteria (automated / functional / operational / perf / edge-case)
- Risks + mitigations
- Impact on ongoing P0-A stated
- All open questions resolved (9/9) — see §11
- Alternative approaches considered and rejected with rationale
- Scaling semantics + multi-tenancy (§8bis, §8ter)
- Committee review PASSED (session scores: Architect 9.0 / ML 9.0 / Data 8.5 / Crypto 8.5 / Ops 8.0, strong consensus, 13 recommendations integrated into v3)
- Child issue of #599 created to execute Phases 0–4
This document is the spec of that child issue.
14. Committee submission (for §11 review)¶
Title: Stateful Enrichment Refactor — design review
Question (target score: 8+):
Review of documentation/../design/CVN-N005-stateful-enrichment.md before we
commit to the 4-week refactor. The diagnostic phase is complete (#609,
failing regression test, root cause identified as 500-bar rolling
window vs full-history batch on Wilder / EMA / multi-tf indicators).
Three prior options were rejected (§8). The proposal (Option D) is: single stateful+incremental enrichment engine, shared by backtest and paper/live, proven parity by construction rather than by tolerance.
Committee review requested on:
-
The non-negotiable constraint (§2.3, P1): is "one code path, same in backtest and live" the right hill to die on, or is there a principled way to have two paths that preserves backtest fidelity? (User veto of 2026-04-20 was blunt but we want the committee's independent read.)
-
Indicator contract (§4.2): the
bootstrap_state+updateprotocol. Sufficient for all indicator classes we face (simple recursive, ring buffer, Wilder, multi-timeframe, custom xgb_*)? Any class we've missed? -
Shadow-comparison harness (P7): is "run both paths, fail fast on divergence > tolerance" the right safety net, or do we need something stronger (e.g. production-blocking)?
-
The 4-week estimate (§6): realistic for a team of 1? If no, what's the minimum staffing we should plan for?
-
Scaling semantics (§8bis): is "explicit-state-threaded-through- pure-functions" the correct mental model, and is §8ter's shared-state / per-user-decisions separation sound? Specifically: are we right that enrichment state is never per-user, only per-
(symbol, timeframe, feature_set_version)? Any use case the committee sees where a user would legitimately want "their own" enrichment without picking a different feature set? -
Open questions §11 (1–8): direction preferred on each, with rationale.
-
Blind spots: what's the most expensive mistake we could make executing this plan that we haven't surfaced in §9?
Deliverable: per-expert opinion + consolidated verdict with named blockers if any.