Skip to content

Stateful Enrichment Refactor — Design document

Status: v4 — committee-reviewed (v3 PASSED, strong consensus, 8.6/10 average). Phase 1 shipped 2026-04-21 — scaffolding + EMA stub + shadow harness, 63/63 tests green (post-CR-round-2 + post-rebase, 2026-05-03 ; original 57 grew to 63 with the FTF lock + streaming-skip tests added during CR). Phase 2 (indicator bulk port) unblocked and can start. Authors: CVNTrade research, 2026-04-20 (v1–v3) / 2026-04-21 (v4) Parent issue: #599 (shadow divergence diagnostic) Supersedes: the three fix options outlined in PR #609 (all three were either incorrect — option C splits the code path — or intractable — option B cannot handle infinite-memory indicators).

Changelog

  • v4 (2026-04-21) — Phase 1 shipped and informs the doc:
  • Protocol revisionupdate and emit are now separated methods (§4.2), resolving the ordering subtlety identified in the 2026-04-21 design discussion. R/B classes update on every candle for state correctness ; emit is lazy and skipped on CUSUM-filtered candles. The v3 conflated update → (state', value) form is superseded.
  • §4.3 new — CUSUM/Enrichment ordering. Post-refactor, enrichment update runs BEFORE CUSUM gating ; CUSUM gates emit, not update. This is a direct consequence of stateful correctness and was implicit in v3 ; now explicit.
  • §6 Phase 1 marked ✅ done with links to the shipped module.
  • v3 (2026-04-20) — post committee review. Integrates 13 recommendations: timeline extended 4w → 6–8w, open questions §11 converted to decisions, operational safeguards added (kill switch, runbook, drift logging), shadow harness gains a divergence budget, Phase 0 inventory flagged as needing senior engineering time. No architectural change; the core proposal was unanimously supported.
  • v2 (2026-04-20) — scaling semantics (§8bis) and multi-tenant separation of concerns (§8ter) added to preempt "stateful doesn't scale" objection.
  • v1 (2026-04-20) — initial draft.

1. Executive summary

CVNTrade's enrichment stage today has a silent structural flaw. The invariant advertised by EnrichmentAPI:

enrich_batch(ohlcv).features.iloc[-1] == enrich_streaming(ohlcv).features.iloc[-1]

is false whenever the caller passes different windows to the two public methods. This is the norm — not the exception — in production:

  • the monolith backtest engine calls enrich_batch(full_history) once at start,
  • the pipeline runner calls enrich_streaming(ohlcv.iloc[T-500:T+1]) at every bar (see src/commun/pipeline/runner/steps.py:122).

For indicators with unbounded memory (EMA, Wilder RSI, ADX, MACD) and multi-timeframe resamples (_1h, _6h), the 500-bar rolling window produces measurably different feature values than the batch path. Confirmed by PR #609 test_batch_full_vs_streaming_rolling_window_reproduces_shadow_drift: 15 features drift between batch and streaming at T=1000, top offender atr_1h_zscore is NaN in streaming vs -1.54 in batch.

This drift propagates downstream: - p(BUY) shifts up to 0.12 at the same bar (#599 Loki evidence) - Trade counts diverge ±1/±3 per crypto per fold - Backtest no longer predicts live behaviour — because the live path (pipeline runner) is exactly the broken rolling-window path

The fix is not to hide the divergence behind two code paths ("batch for backtest, stream for live") — that would move the drift from "monolith vs pipeline" to "backtest vs live", where it becomes invisible.

The fix is a stateful, incremental enrichment engine shared by all modes. The state is initialised once (bootstrap on available history, batch or live) then updated one candle at a time via closed-form recursive rules. The output at any bar T is identical whether we got there by ingesting the whole history in batch or one candle at a time in streaming — by construction, not by testing.

This document proposes the architecture, the rollout plan, and the validation criteria. Implementation is a dedicated child issue, not this document.


2. Problem statement

2.1 Evidence

  • PR #609 regression test: 15 columns drift at T=1000 on synthetic OHLCV (atr_1h_zscore NaN vs -1.54 ; atr_1h_price 4.6% drift ; trend_strength_6h 1.1% ; ADX_48, RSI_48, MACD_* all non-zero).
  • Shadow run ftf_20260420_133512_8328fb: 8 first-divergence events observed across OP/AAVE/LDO with same_bar_prob_drift, |mono_p_buy − pipe_p_buy| up to 0.12, gap_to_test_end_bars 4556–5853 (drift scatters across the test window, not at boundary).
  • Trade-count divergence: monolith vs pipeline open ±1 to ±3 trades per crypto per fold.

2.2 Root cause

CVNTrade_Enrich.process() is a pure batch function: it takes a DataFrame, computes indicators on that DataFrame from scratch, returns the enriched DataFrame. Calling it with a partial history produces results that reflect only that partial history.

Indicator classes affected:

Class Memory Example Warmup needed
SMA / Bollinger finite, N bars SMA_20 N bars
Wilder smoothing unbounded RSI, ADX, ATR full history, never converges exactly
EMA / MACD unbounded EMA_50, MACD full history
Multi-timeframe resamples depends on tf atr_1h_*, trend_strength_6h N × (higher_tf / base_tf) bars minimum
Rolling quantiles finite N, but needs the ring *_zscore N bars
Custom xgb_* features case-by-case xgb_mining_pressure_* individual analysis

For rolling-500 vs full-history on 15m candles: - RSI_48, ADX_48, Wilder ATR: technically never exactly equal to batch (Wilder smoothing has geometric memory), differences measurable - EMA_50, MACD_24_64_18: same — EMA recursion never forgets - atr_1h_zscore: requires enough candles to compute 1h-resampled ATR + its rolling z-score. On 15m base, 1h = 4 bars, z-score over N=50 → 200+ bars minimum just for the window, plus the ATR warmup on the 1h axis. 500 candles is not enough; hence NaN. - trend_strength_6h: 6h = 24 bars of 15m, over a long-horizon indicator → even more warmup.

2.3 Why "two paths" is the wrong answer

Considered and rejected in PR #609 discussion: make the pipeline runner call enrich_batch on the full history in backtest, keep enrich_streaming rolling window only for paper/live. This was rejected by the user during the design discussion of 2026-04-20 with the decisive argument:

"si on a 2 chemins différents entre le backtest et le trading, on backteste rien, c'est con"

— which is correct. Backtest must simulate live. If the enrichment code used in backtest diverges from the enrichment code used in live, the backtest predicts a version of the strategy that will never run in production. The drift we see today between monolith and pipeline would become invisible drift between backtest and live — worse, not better.

Non-negotiable design constraint: one enrichment code path, shared by backtest, paper trading and live trading.


3. Design principles

P1 — One code path

All callers (backtest engine, pipeline runner, paper trading adapter, live adapter) go through the same enrichment engine. The only parameter that varies between modes is the initial state: backtest bootstraps from the full test history at run start ; live bootstraps from a fetched chunk of recent candles at startup ; both then update candle-by-candle.

P2 — Stateful by design, update rule is closed-form

Each indicator exposes an update rule:

state', output = indicator.update(state, candle)

where state carries just enough information to produce the correct output at the next candle (EMA carries its previous value ; SMA carries a ring buffer ; Wilder carries its smoothed running values ; multi-tf carries the current higher-tf partial bar + the state of the higher-tf indicator engine).

The rule must be closed-form — no recomputing from history at each candle (the O(T²) trap). For indicators where no closed-form exists, we either keep a ring buffer (finite-window indicators) or approximate explicitly with a documented tolerance (rare case, to avoid if possible).

P3 — Bootstrap is the only bridge to batch

Enricher.bootstrap(history_df) is called once at startup. It walks the history candle by candle internally (or batch-computes where equivalent) and emerges with a state that is identical to what an infinite streaming history would have produced. After bootstrap, the streaming update(candle) takes over.

This means the entire _enrich_core body of today's code moves from "orchestration of batch computations" to "bootstrap routine". The public API gains a persistent state object.

P4 — Parity is proven by construction, not by tests

The test suite will pin the parity invariant (PR #609 will go green strict-xfail→passed), but the real guarantee is that there is only one computation path. "Backtest consumes the state transitions in one shot ; live consumes them gradually" — by definition, they produce the same output at the same bar.

P5 — Deterministic and reproducible

Bootstrap is deterministic: same history_df in → same initial state out. Updates are deterministic: same (state, candle) in → same (state', output) out. Consequence: a paper-live replay from a known starting state reproduces the same decisions bit-for-bit.

P6 — State is an MLflow-logged artifact (optional, phase 2)

Just as CUSUM sigma is version-pinned in MLflow today, the bootstrapped enrichment state could be persisted per model version. Not required for v1 (we re-bootstrap from history each run) but it opens up: faster paper/live startup, auditable state, offline replay from a pinned state. Flagged as open question §11.

P7 — Fail-fast on divergence (with budget)

During the migration, both code paths coexist briefly. We ship a shadow-comparison assertion: on each candle, if the stateful output differs from the legacy batch output by more than a documented tolerance (1e-9 relative, per §11.1 decision), we log a WARN.

Divergence budget (committee recommendation 11): the run blocks if more than 5 WARNings per feature per fold fire. This keeps isolated float-non-associativity noise tolerable while guaranteeing any systematic drift is caught immediately. Past that budget, the run emits CRIT enrichment_shadow_budget_exceeded and fails fast (ADR-25).

Continuous shadow in production (optional, flag-gated): a CVN_ENRICHMENT_SHADOW_MODE env var can keep the comparison active on live trading beyond the migration window — useful as a long-term drift sentinel. Defaults off after migration closes; ops can flip it to on targeted at a single crypto for investigation without a redeploy.

P8 — Shared state, per-user decisions

Enrichment state is deterministic from OHLCV: identical input + identical indicator rules → identical state. Consequence: the enricher state is shared across all users who consume the same (symbol, timeframe, feature_set_version). It is not a per-user resource. Per-user customisation happens strictly downstream of enrichment — at inference threshold, PTE, sizing, filters and execution. This holds today (stateless recompute already produces the same features for any two users of the same feature set) and is preserved by the refactor; the stateful engine does not change the multi-tenancy semantics.

The state key is therefore (symbol, timeframe, feature_set_version), not (user_id, …). One enricher instance per unique key observed, serving all concurrent users of that key. See §8bis for the full separation of concerns.


4. Architecture

4.1 Component layout

src/commun/pipeline/
  enrichment_api.py                     (thin public API, unchanged surface)
  enrichment/
    state.py                            (EnrichmentState dataclass, serialisable)
    engine.py                           (StatefulEnricher: bootstrap + update)
    indicators/
      simple_recursive.py               (EMA, MACD, Wilder smoothing)
      ring_buffer.py                    (SMA, rolling std, quantiles)
      multi_timeframe.py                (resample + sub-engine per tf)
      xgb_custom.py                     (xgb_* features, per-feature)

EnrichmentAPI becomes a facade over StatefulEnricher:

class EnrichmentAPI:
    def __init__(self, config: EnrichmentConfig):
        self._engine = StatefulEnricher(config)
        self._state: Optional[EnrichmentState] = None

    def enrich_batch(self, ohlcv: pd.DataFrame) -> EnrichmentResult:
        """Bootstrap + emit all rows. Used by backtest at run start."""
        self._state, rows = self._engine.bootstrap_and_emit_all(ohlcv)
        return EnrichmentResult(features=rows, ...)

    def enrich_incremental(self, candle: pd.Series) -> pd.Series:
        """Single-candle update. Used by pipeline runner / paper / live."""
        if self._state is None:
            raise RuntimeError("must bootstrap() before enrich_incremental()")
        self._state, row = self._engine.update(self._state, candle)
        return row

    def bootstrap(self, history: pd.DataFrame) -> None:
        """Explicit bootstrap; used by paper/live after fetching history."""
        self._state, _ = self._engine.bootstrap_and_emit_all(history)

The deprecated enrich_streaming(window) stays during migration but emits a DeprecationWarning and internally: self.bootstrap(window); return self.enrich_incremental(window.iloc[-1]). Removed in phase 4.

4.2 Indicator contract (v4 — update and emit separated)

class Indicator(Protocol):
    name: str                            # column key in output

    def bootstrap_state(self, history: pd.DataFrame) -> IndicatorState:
        """Compute initial state from a batch of candles."""

    def update(self, state: IndicatorState, candle: pd.Series
               ) -> IndicatorState:
        """Ingest one candle, return the updated state. MANDATORY on every
        candle for R/B classes — skipping corrupts state. No-op for class S.
        """

    def emit(self, state: IndicatorState) -> float:
        """Read the current value from state. O(1) — no recomputation.
        Called lazily (only on CUSUM-pass candles for compute economy)."""

v4 rationale : the v3 contract conflated update → (state', value) in a single call. That was too coarse — stateful correctness demands update on every candle (including CUSUM-filtered ones) while emit is pure output and can be skipped when the value will not be consumed. Splitting resolves the tension without adding a feature flag.

Every existing indicator is ported to this contract. Where a closed-form update exists (EMA, Wilder), we use it. Where only a finite-window dependency exists (SMA-N), state is a deque of the last N prices. For multi-timeframe, state includes the current partial higher-tf bar + the sub-indicator state (recursive structure).

Phase 1 shipped (2026-04-21) with the EMA stub implementing this contract and a StatefulEnricher engine that orchestrates the three ops. Phase 2 ports the rest of the inventory class by class.

4.3 CUSUM / Enrichment ordering (v4 — explicit)

The post-refactor pipeline reorders CUSUM to sit after enrichment's update but before enrichment's emit :

OHLCV candle T
[Enrichment.update]  — runs for EVERY candle, O(1), keeps state correct
[CUSUM gate]         — reads returns (raw OHLCV), gates downstream
  ├── non-event → skip emit + FE + inference
  └── event → continue
[Enrichment.emit]    — reads state, ~gratuit for R/B, full compute for S
[FE + Inference + LdP v2 chain]

Why the reorder vs today's CUSUM → enrich_streaming order :

  • Today's enrich_streaming is stateless-per-call (rebuilds from a rolling 500-bar window on every invocation). Skipping it on CUSUM-filtered candles was a valid compute-saving because re-entering later would not notice the skip.
  • Post-refactor, enrichment is stateful. Skipping update on even a single candle means the EMA / Wilder / ring-buffer state diverges from the "would-have-been" state if the candle had been seen. The drift accumulates and breaks the cross-mode parity claim (§P4).
  • Because update is O(1) per candle in the stateful design, running it unconditionally costs effectively nothing. The compute-saving that justified the old order has evaporated.
  • emit remains skippable because it has no side effect on future bars — it only produces the value the downstream consumer wants. Skipping it on CUSUM-filtered candles keeps the savings for FE + inference + filter chain.

This reorder is implicit in the design but was only made explicit 2026-04-21 after operator feedback. Phase 2 call sites MUST honour this ordering — an integration test in Phase 2 will fail any call site that skips update.

4.4 Call patterns

Backtest (monolith):

engine.bootstrap(full_ohlcv)            # O(T), once at init
for t in range(len(ohlcv)):
    row = features.iloc[t]              # already computed at bootstrap
    ...

Backtest (pipeline runner):

engine.bootstrap(full_ohlcv_up_to_test_start)   # same O(T) call
for t in test_range:
    row = engine.enrich_incremental(ohlcv.iloc[t])  # O(1) per candle
    ...

Because both paths share engine.bootstrap + update, the output at bar T is identical (P4).

Paper / live:

history = fetch_recent_candles(crypto, n=WARMUP_BARS)
engine.bootstrap(history)
while True:
    candle = await next_candle()
    row = engine.enrich_incremental(candle)
    ...

Same bootstrap + update. WARMUP_BARS is chosen per indicator analysis (§5.4) to guarantee convergence.


5. Scope

5.1 In scope

  • CVNTrade_Enrich (tech indicators) → port every indicator to the Indicator contract
  • CVNTrade_XGBoostFeatureGenerator (xgb_* features) → same
  • EnrichmentAPI refactor (facade + state management)
  • Pipeline runner's EnrichmentStep → bootstrap + incremental
  • Monolith backtest engine → bootstrap + iloc lookup (unchanged externally, cleaner internally)
  • PR #609 xfail test → passes after refactor
  • New tests: per-indicator bootstrap/update parity, full-pipeline parity on real OHLCV, paper/live replay determinism

5.2 Out of scope (later, separate issues)

  • State persistence in MLflow (P6) — §11 open question
  • Paper/live warmup-during-gap strategy — separate operational issue
  • Performance optimisation past the "correct" baseline — measure first
  • Other pipeline stages (FE, inference, filters, signal manager) — same principle applies but each deserves its own design
  • Removing the monolith backtest engine entirely (#568) — separate track

5.3 Explicitly not changing

  • Public signature of EnrichmentAPI.enrich_batch — backward compatible
  • Output DataFrame shape — same columns, same order
  • Downstream consumers (FE pipeline, inference) — they see the same rows
  • MLflow model artifact layout — unchanged

6. Rollout plan

Phase 0 — Inventory (2–3 days, senior engineering time required — committee rec. 13)

  • Enumerate every column produced by today's CVNTrade_Enrich and CVNTrade_XGBoostFeatureGenerator
  • For each, classify: simple recursive / ring buffer / multi-tf / custom / trivially stateless
  • For every custom xgb_* feature: confirm whether the current computation is genuinely incremental-compatible, approximate, or inherently batch (if any). Flag the inherently-batch cases as blockers for a scope re-scoping discussion before Phase 2 starts.
  • Determine WARMUP_BARS per indicator class (for §9 runtime validation)
  • Output: a spreadsheet (committed to documentation/architecture/ENRICHMENT_INDICATORS.md) that the implementation team ticks off as each indicator is ported

Phase 1 — Core engine ✅ DONE 2026-04-21

  • EnrichmentState dataclass — pure functional, set() returns new instance
  • Indicator Protocol — bootstrap_state / update / emit separated (v4 contract)
  • StatefulEnricher engine — register, bootstrap, update, emit, bootstrap_and_emit_all
  • EMA stub indicator — textbook recurrence, parity with pd.Series.ewm at atol=1e-12
  • Parity harness — assert_indicator_parity helper, reusable one-liner per feature for Phase 2
  • EnrichmentAPI facade flag dispatch — CVN_ENRICHMENT_STATEFUL_ENABLED ∈ {0, shadow, 1} ; 0 legacy zero-cost, shadow runs both and logs divergences, 1 reserved for Phase 3 (raises in Phase 1)
  • Shadow-comparison harness (§P7) — structured log events (enrichment_shadow_divergence, _ok, _no_overlap, _budget_exceeded), per-call divergence budget enforcement, defensive exception catching so shadow failures never take down a run
  • 63/63 tests green (engine, EMA parity, shadow harness, facade dispatch ; original 57 + 6 added during CR rounds 1-2)
  • Zero behaviour change in default mode — legacy path pixel-identical

Files shipped: - src/commun/pipeline/enrichment/{state,protocol,engine,shadow}.py - src/commun/pipeline/enrichment/indicators/ema.py - tests/unit/pipeline/enrichment/{test_engine,test_ema,test_shadow,test_facade_dispatch}.py - tests/unit/pipeline/enrichment/parity_harness.py - src/commun/pipeline/enrichment_api.py (facade hook added)

Phase 2 — Port indicators by class (2–3 weeks)

  • Simple recursive (EMA, MACD, Wilder family) first — small, closed-form
  • Ring-buffer next (SMA, rolling quantiles)
  • Multi-timeframe last — most complex, depends on the above
  • Custom xgb_* alongside as they typically wrap the above
  • Each port: unit test (bootstrap batch output == incremental output on the same history), integration test (full pipeline parity)
  • The shadow-comparison harness stays green throughout

Phase 3 — Flip the flag (1 day)

  • Set CVN_ENRICHMENT_STATEFUL_ENABLED=1 in ftf_config.base_env (Console snapshot)
  • Run a full FTF over the 3 exploration cryptos, verify the triptych matches the pre-flip baseline (tolerance: 1e-6 on feature values, identical trade decisions)
  • Keep the old path reachable via =0 for 1 sprint as rollback

Phase 4 — Decommission (1 day, 1 sprint after flip)

  • Remove the old _enrich_core batch-recompute path
  • Drop enrich_streaming(window) deprecated alias
  • Flip #609 xfail → passing
  • Update EnrichmentAPI docstring — the invariant is now true by construction

Total wall-clock: 6–8 weeks for 1 FTE (committee rec. 1 — the initial 4-week estimate was too aggressive given xgb_* and multi-tf complexity). Alternative: 1.5–2 FTE for 4 weeks. Default plan assumes 1 FTE × 7 weeks.


7. Validation criteria

7.1 Automated

  • test_batch_full_vs_streaming_rolling_window_reproduces_shadow_drift (#609) — xpass
  • New test_bootstrap_then_incremental_equals_batch — bootstrap on history[:N], then apply update candle-by-candle from N to T, compare final state to a fresh bootstrap_and_emit_all(history[:T]). Must be bit-identical.
  • New test_paper_live_replay_deterministic — from a pickled EnrichmentState, replay a candle stream, assert outputs bit-identical to a full-history bootstrap.
  • Per-indicator unit tests covering bootstrap, update, edge cases (gap in data, first candle, state carrying NaN).
  • Shadow-comparison assertion (§P7) fires zero WARN on a week of production backtest runs.

7.2 Functional

  • A full FTF run (3 cryptos × 5 folds × 3 variants) produces identical p(BUY) distributions pre-flip and post-flip (KS test p > 0.99).
  • Shadow divergence (#599) dashboard stops firing pipeline_shadow_first_divergence events with side=same_bar_prob_drift on all exploration cryptos.
  • Monolith-vs-pipeline trade count difference → zero on a full FTF run.

7.3 Operational

  • Backtest performance: no regression > 10% wall-clock per fold.
  • Memory footprint: EnrichmentState serialised size documented, no pod OOM at steady state.
  • Paper/live bootstrap time: documented, < 30 s for WARMUP_BARS typical value.

7.4 Performance profiling (committee rec. 9)

  • Per-indicator update() profiled on production-like OHLCV: report median, p95 and p99 latency per class (recursive / ring / multi-tf / xgb_*).
  • End-to-end per-candle latency budget documented (what we promise paper/live).
  • Any indicator whose update() exceeds documented budget gets a flagged follow-up issue before Phase 4 cutover — not a blocker if impact is bounded.

7.5 Edge case coverage (committee rec. 10)

  • Data gaps (missing candles, exchange outage): state update must tolerate time-gaps without corruption — test with synthetic gaps of 1, 5, 50 candles.
  • NaN propagation: any NaN in an indicator output must be isolated to that indicator's column (not contaminate downstream).
  • First-candle bootstrap: the very first update() after a bootstrap() must produce the correct row, not a transient artifact.
  • Late candle replay (out-of-order arrival): documented behaviour, test covered.
  • Empty bootstrap (zero history): should raise an explicit error, not crash silently.

8. Why this is the right fix

Alternative considered Why rejected
A — pre-compute multi-tf features batch-only and inject still "one path in backtest, a different one in live" — doesn't fix live
B — extend the rolling window until warmup is sufficient impossible for EMA / Wilder (unbounded memory); multi-tf would need windows of thousands of bars per higher-tf, making the approach pathological
C — monolith's enriched DataFrame shared with pipeline on backtest path only kills the backtest-live equivalence (user veto, 2026-04-20)
E — fully stateless, recompute full history on every candle O(T²) wall-clock (~110 h/fold on 6k bars vs 25 min today) — not viable
F — fully stateless, approximate EMA/Wilder with bounded windows reintroduces drift by construction — same bug class we are fixing
D (this proposal) — incremental engine with explicit state threading, everywhere aligns backtest and live by construction; unit-testable per indicator; no warmup-window tuning; standard pattern in streaming trading systems

8bis. Scaling semantics and multi-tenancy

The word "stateful" invites a scaling objection ("can't scale if it's stateful"). The objection conflates two different meanings of state:

  • Distributed shared state (bad) — a cluster-wide store with locking, consensus, eventual consistency headaches. Not this design.
  • Explicit local state threaded through pure functions (good) — new_state, output = update(old_state, candle). The state is a local variable, owned by the caller, passed in and out. Same pattern as Redux, Kafka Streams operators, Flink operator state, or any streaming trading engine (QuantConnect, Backtrader streaming, TradingView Pine). This IS the design.

What parallelises and what does not:

Dimension Scales? How
Different symbols (BTC / ETH / SOL) Linearly One enricher instance per symbol, zero cross-talk
Different folds in FTF backtest Linearly Each fold bootstraps its own enricher — already the case in today's FTF multiprocessing
Different feature set versions Linearly One instance per (symbol, timeframe, feature_set_version)
Concurrent users with the same feature set Share one instance (§P8) Features are deterministic → identical output → no reason to duplicate state; users diverge downstream
Pod failover / restart Bootstrap is idempotent Re-fetch history → re-bootstrap → steady state restored. Current paper/live already does this every restart.
Multiple replicas of the same enricher N/A by design Per (symbol, timeframe, feature_set) there is no value in running two replicas; the state would diverge on first candle due to network-ordering non-determinism. Scale-out goes by additional keys, not by additional replicas per key.

What does not parallelise — and this is mathematics, not engineering:

  • Within a single stream (one symbol, time advancing): bar T depends on bar T-1 via the EMA recursion ema_t = α·x_t + (1−α)·ema_{t−1}. No stateless reformulation can break this dependency chain without either recomputing from scratch (option E, O(T²)) or approximating (option F, drift). The sequential constraint is intrinsic to the indicator family we use.

8ter. Who-owns-what (multi-tenant architecture)

OHLCV BTCUSDC 15m
┌────────────────────────────────────────┐
│ Enricher — state key = (BTCUSDC, 15m,  │   ← shared across all users
│  feature_set_version=v2-abc123)        │     of this feature set
│  • EMA/Wilder/multi-tf state           │
│  • ring buffers, partial higher-tf bar │
└────────────┬───────────────────────────┘
        features_df  (deterministic, shared output)
             ├──► User A: model → threshold(mode=A) → PTE_A → Kelly_A → exec account A
             ├──► User B: model → threshold(mode=B) → PTE_B → Kelly_B → exec account B
             └──► User C: different trained model   (→ different feature_set_version
                                                     → own enricher instance)

Shared layer (enricher state): - (symbol, timeframe, feature_set_version) tuple → one instance - deterministic from OHLCV - memory cost: ≈ a few Mo per instance (a few floats per indicator, plus ring buffers of ~50 × timeframe periods × 8 B) - lifecycle: bootstrap at startup, survives as long as the key is used, discarded when no user references it

Per-user layer (downstream of enrichment): - threshold mode (#608), PTE, Kelly fraction, filter flags - active positions, PnL, account state - exchange API keys, routing decisions - nothing of this touches the enricher

How users opt into a different enrichment: by picking a different trained model. A different model = a different feature_set_version = a different enricher key = its own instance. This is already how the system works today (model artifacts are version-pinned in MLflow per ADR-23) and this refactor does not change that boundary.

8quater. Memory budget worked example

Let us put numbers on "a few Mo per key". Assume a worst-case indicator set: 20 Wilder/EMA indicators (each ≈ 3 float state values), 10 ring-buffer indicators (each ≈ N=50 floats), 4 multi-timeframe engines (each ≈ a second-level state of the same size), and growing pandas Series of the last 2000 bars (for the output DataFrame shape consumers expect).

  • 20 × 3 × 8 B = 480 B
  • 10 × 50 × 8 B = 4 kB
  • 4 × (480 + 4000) = ≈ 18 kB
  • Output ring of 2000 × (say) 150 features × 8 B = 2.4 MB

≈ 2.5 MB per (symbol, timeframe, feature_set) key. For 10 cryptos × 1 timeframe × 1 feature_set = 25 MB. For 100 cryptos = 250 MB. Within reach of any single pod. If we ever cross the threshold (thousands of concurrent feature sets?) the state key becomes a natural sharding axis.


9. Risks

9.1 Implementation risks

  • Indicator-by-indicator regressions: one mis-ported indicator silently produces different values → caught by the shadow-comparison harness (P7) and per-indicator unit tests. Mitigated by ticking off the inventory one class at a time.
  • Multi-timeframe subtleties: resample boundaries (e.g. a 1h bar closes at HH:00, partial bar at start) are bug-prone. Mitigated by dedicated multi-tf test harness with synthetic data and realistic gap cases.
  • Custom xgb_* features that don't fit any class: may require bespoke porting per feature. Inventory phase surfaces these early.

9.2 Timing risks

  • Underestimated scope: 4 weeks is optimistic for a team of 1 if the xgb_* custom features have many exotic cases. Inventory phase will confirm.
  • P0-A tuning pause: we shouldn't freeze lever measurements for 4 weeks. Mitigation: P0-A continues with the old enrichment (today's state); the stateful refactor lands behind a flag and is validated in parallel; the FTF triptych comparison post-flip confirms no behavioural shift.

9.3 Residual risks post-delivery

  • Backtest-live parity now depends on the bootstrap history being "the same" both times. If live fetches a shorter warmup than backtest's bootstrap, the first N candles in live still drift. Documentation must state the required WARMUP_BARS per indicator class, and live deployment must fetch at least that many.

10. Impact on ongoing P0-A tuning

  • PR #609 stays as-is (regression test, strict xfail).
  • Lever #2 run completes under the current (broken) enrichment; the data is usable for within-run comparisons but not trustworthy for live-performance prediction — a caveat we've already internalised.
  • After the stateful refactor flips, we re-run the P0-A baseline reference (tw=18, autocal=0, binary=0) with the new enrichment as the new ground truth. Prior runs become historical context, not comparators.

11. Decisions taken (post committee review — was "open questions" in v1/v2)

  1. Tolerance floorDECISION: target bit-identical parity in unit tests. If float non-associativity forces a relaxation, document per-indicator with the smallest feasible tolerance (≤ 1e-9 relative), justify in the test, and verify the divergence has zero impact on downstream decisions (signal, threshold crossings, trade generation). Committee rec. 2.

  2. State serialisation / MLflow persistence (P6)DECISION: defer to a follow-up issue post-refactor. Scope this issue to the core engine only. The follow-up unlocks fast paper/live startup + auditable replay; to be prioritised immediately after Phase 4 cutover. Committee rec. 3.

  3. Multi-timeframe boundary policyDECISION: keep the aligned-close policy (1h indicator updates only when the 1h bar closes). Matches today's enrich_batch semantics and standard chart conventions — no backward-incompatibility risk. A future flag for partial-bar updates is tracked as a separate enhancement if a latency-sensitive strategy ever needs it. Committee rec. 4.

  4. Paper/live bootstrap historyDECISION: the bootstrap window is computed per-indicator from the inventory (§6 Phase 0 deliverable). WARMUP_BARS = max(per-indicator-requirements). Runtime enforcement in Enricher.bootstrap(history): if len(history) < WARMUP_BARS, log ERROR enrichment_bootstrap_insufficient_history and either (a) auto-fetch more from the adapter if supported, or (b) raise a hard exception. Default: raise; adapters can opt into auto-fetch. Committee rec. 5.

  5. Backward compatibility of persisted modelsDECISION: strictly enforced. The refactor must produce bit-identical numerical values for all existing features so deployed models keep working without retraining. If any indicator needs a deliberate semantic change during the port, it must emit a new feature_set_version and trigger full model retraining — no in-place semantic drift. Committee rec. 6.

  6. API renameDECISION: rename now. New public API is bootstrap() + enrich_incremental(). enrich_streaming() becomes a deprecation shim emitting DeprecationWarning during Phases 1–3 ; removed entirely in Phase 4. enrich_batch() stays (it's still the backtest startup entry point, now implemented as "bootstrap + emit-all-rows"). Committee rec. 7.

  7. Multi-tenant enricher lifecycle (§8ter) — DECISION deferred to Phase 1 design: default to a runtime registry per process with eviction on zero references for 10 minutes. If we ever run more than one user per process concurrently, revisit.

  8. Replica policy (§8bis) — DECISION: enforced at infrastructure level via Kubernetes StatefulSet + partition-by-key on (symbol, timeframe, feature_set_version). A Phase 4 deliverable is updating infra/helm/*/values-prod.yaml accordingly. Committee rec. 8.

  9. Operational safeguards (committee rec. 12) — DECISION: added to the scope. Three components:

  10. Runtime kill switch CVN_ENRICHMENT_PAUSE=1 — freezes state updates in place, emits last-known features each candle, logs CRIT enrichment_paused_by_operator. Used when we suspect a bad state after an upstream data incident.
  11. Incident response runbook documentation/runbooks/enrichment_incident.md — Phase 2 deliverable, to be written alongside the threshold_calibrator_* runbooks (#608).
  12. Long-term drift monitoring — log per-indicator distributional summaries (mean, std, quantiles of outputs) to MLflow/W&B at model-training time; Grafana alert on distributional shift vs historical training window.

12. References

  • Diagnostic PR: #609 — enrichment parity test, xfail strict
  • Run evidence: ftf_20260420_133512_8328fb_ATR1.5_3.0_H5 (lever #2, shadow logs)
  • Parent issue: #599
  • User design veto (2026-04-20): "si on a 2 chemins différents entre le backtest et le trading, on backteste rien"
  • Relevant ADRs:
  • ADR-23 (version-pinned MLflow artifacts) — extends naturally to enrichment state (P6)
  • ADR-25 (no silent fallback) — applies to the shadow-comparison (fail fast, not warn-and-continue)
  • ADR-40 (paper/live same kernel, adapter seul diffère) — directly aligned; this refactor is the enforcement of that ADR in the enrichment stage
  • Prior art: streaming indicator engines in pandas-ta, ta-lib, CCXT's pro streaming adapters, TradingView's Pine Script (bootstrap + per-bar update is their canonical pattern)

13. Acceptance criteria for this design document

  • Problem quantified with run evidence + regression test
  • Design principles (P1–P8) explicit and ranked
  • Component layout + indicator contract specified
  • Rollout plan in phases with exit criteria
  • Validation criteria (automated / functional / operational / perf / edge-case)
  • Risks + mitigations
  • Impact on ongoing P0-A stated
  • All open questions resolved (9/9) — see §11
  • Alternative approaches considered and rejected with rationale
  • Scaling semantics + multi-tenancy (§8bis, §8ter)
  • Committee review PASSED (session scores: Architect 9.0 / ML 9.0 / Data 8.5 / Crypto 8.5 / Ops 8.0, strong consensus, 13 recommendations integrated into v3)
  • Child issue of #599 created to execute Phases 0–4

This document is the spec of that child issue.


14. Committee submission (for §11 review)

Title: Stateful Enrichment Refactor — design review

Question (target score: 8+):

Review of documentation/../design/CVN-N005-stateful-enrichment.md before we commit to the 4-week refactor. The diagnostic phase is complete (#609, failing regression test, root cause identified as 500-bar rolling window vs full-history batch on Wilder / EMA / multi-tf indicators).

Three prior options were rejected (§8). The proposal (Option D) is: single stateful+incremental enrichment engine, shared by backtest and paper/live, proven parity by construction rather than by tolerance.

Committee review requested on:

  1. The non-negotiable constraint (§2.3, P1): is "one code path, same in backtest and live" the right hill to die on, or is there a principled way to have two paths that preserves backtest fidelity? (User veto of 2026-04-20 was blunt but we want the committee's independent read.)

  2. Indicator contract (§4.2): the bootstrap_state + update protocol. Sufficient for all indicator classes we face (simple recursive, ring buffer, Wilder, multi-timeframe, custom xgb_*)? Any class we've missed?

  3. Shadow-comparison harness (P7): is "run both paths, fail fast on divergence > tolerance" the right safety net, or do we need something stronger (e.g. production-blocking)?

  4. The 4-week estimate (§6): realistic for a team of 1? If no, what's the minimum staffing we should plan for?

  5. Scaling semantics (§8bis): is "explicit-state-threaded-through- pure-functions" the correct mental model, and is §8ter's shared-state / per-user-decisions separation sound? Specifically: are we right that enrichment state is never per-user, only per-(symbol, timeframe, feature_set_version)? Any use case the committee sees where a user would legitimately want "their own" enrichment without picking a different feature set?

  6. Open questions §11 (1–8): direction preferred on each, with rationale.

  7. Blind spots: what's the most expensive mistake we could make executing this plan that we haven't surfaced in §9?

Deliverable: per-expert opinion + consolidated verdict with named blockers if any.