Skip to content

FTF ↔ Runner Parity Certification — Design document

Status: v2 — committee-reviewed PASSED (Architect 9.0, ML Engineer 9.0, Crypto Trader 9.0, Ops 9.0, Data Scientist 8.5 — avg 8.9/10, strong consensus). 7 recommendations integrated below. Authors: CVNTrade research, 2026-04-21 Sibling issues: #599 (StatefulEnrichment — provides per-feature equivalence-by-construction), #568 (Pipeline runner Phase 1+ — provides the streaming execution path), #614 (execution track for this doc) Prerequisite: #599 design v3 PASSED (2026-04-20, committee session). The parity guarantee proposed here composes on top of the stateful refactor.

Changelog

  • v2 (2026-04-21) — post committee review, PASSED. 7 recommendations integrated:
  • Tolerance ladder: kept 1e-9 / strict, added per-feature documentation requirement in §4 P4 + inventory.
  • Vectorised with proof: canonical definition must be simple and direct (explicit rule in §4 P3).
  • E2E fixture: expanded to 3 cryptos (BTC/ETH/SOL) × {quiet, volatile, stress} scenarios (§5 Eq. 5).
  • Phase E governance: joint sign-off DRI + senior MLOps/SRE lead or designated committee member (§6 Phase E).
  • Monolith deprecation: soft-deprecate 2–4 weeks after harness stable, then full removal (new §6 Phase F).
  • Scoping: folded into #599 Phase 2 with dedicated engineer owning the parity suite (§9).
  • New §11bis "Blind spots addressed" covering all 7 committee-flagged angles:
    • Upstream OHLCV trust boundary + monitoring
    • Model artifact serialisation/deserialisation bit-identity
    • Runtime environment consistency (OS/Python/lib versions)
    • Training determinism (HPO seeds → bit-identical artifacts, ADR-23)
    • Timestamp/timezone consistency (DST, tz boundaries)
    • Property-based testing (Hypothesis) for vectorised features
    • Filter starvation (ADR-45)
  • v1 (2026-04-21) — initial draft submitted to committee.

1. Executive summary

CVNTrade faces a structural dilemma between speed of experimentation and fidelity to production:

  • Vectorised batch backtest — fast (~25–30 min per FTF run), powerful for multi-lever sweeps, but uses a code path that is not the one trading in production.
  • Streaming pipeline runner — matches paper/live behaviour, but 2–5× slower per backtest, turning a 10-crypto FTF cycle into tens of hours.

At today's FTF volume (~45 cells × 15 HPO trials × multiple levers × re-runs on config changes), running the tuning in streaming mode is not economically viable. Running it in batch mode raises a fundamental question: do the hyperparameters tuned on the batch path transfer to what will actually trade in production?

The usual answer — "they should, same code underneath" — is a foi empirique, not a demonstration. This document specifies a formal alternative: prove equivalence at each atomic stage (features, inference, filters, execution), and by composition prove that a batch run and a streaming run on the same data produce bit-identical trading decisions. The FTF can then stay vectorised with mathematical certainty, and the migration to the runner at the end of tuning is a no-op by construction, not a hope.

Total effort for the certification harness: ~1 week senior engineer, once. Permanent CI gate afterwards — any regression on parity blocks merge.


2. Problem statement

2.1 The dilemma (quantified)

Empirical measurements on FTF runs of 2026-04-20:

Config Duration Notes
Baseline (tw=18, shadow=off) ~30 min / run vectorised backtest only
Lever #2 (tw=9, shadow=on) ~3h30 / run monolith + streaming in parallel, enrichment rolling 500-bar window dominates
Baseline-ref (tw=18, shadow=off) ~56 min / run vectorised only, with autocal=0

Projected full FTF cycle on defi_top5:

  • 5 levers × 3 variants per lever × 10 cryptos × 5 folds × 15 HPO trials = ~11 250 cells
  • At 25 min / 45-cell run: ~100 h of vectorised compute for one full sweep
  • At 5× streaming overhead: ~500 h of streaming compute — untenable

We cannot do FTF in full streaming mode. We must do it in batch. So we need a certificate that the batch result transfers.

2.2 What "transfers" must mean

A lever conclusion like "isotonic beats platt by 8% on Sortino net-15bps" is only actionable if the winner will actually produce that Sortino when it trades. If the batch path and the streaming path produce different trade sequences on the same OHLCV, the FTF ranking may not transfer.

The certification claim we want: given a fixed (model, hyperparams, feature config), the trade sequence produced by the streaming pipeline runner on a given OHLCV window is bit-identical to the trade sequence produced by the vectorised batch path on the same window.

From there, Sortino, PnL, f1_buy, action_rate — all downstream metrics — are equal by construction.

2.3 What "bit-identical" means precisely

  • Feature values: equal at machine precision (≤ 1e-12 relative) or differ only by documented float non-associativity with bounded-magnitude tolerance (≤ 1e-9 relative, per-feature).
  • Inference outputs (p_buy per candle): same tolerance.
  • Trading signals (signal ∈ {HOLD, BUY} per candle): strictly equal, no tolerance.
  • Trades (entry bar, entry price, exit bar, exit price, pnl): strictly equal.
  • Aggregate metrics (Sortino, f1, etc.): equal by construction if the trade list is equal.

Strict equality at the signal level is the key contract — because any signal-level divergence triggers a different trade and cascades.


3. Complexity analysis

3.1 Training

Both modes train the same model on the same training data (assuming the #596 fix lands). Training is not a parity concern — it produces a deterministic artifact from an OHLCV window + hyperparameters + seed.

3.2 Backtest

Stage Batch (vectorised) Streaming (runner)
Feature generation O(T·F) pandas/numpy, SIMD-friendly, well-amortised O(T) candles × O(1) update + O(1) emit — slower constant factor (Python interpreter)
Inference model.predict_proba(X_batch) — single call, batched model.predict_proba(X_single) per candle — N function calls
Decision loop O(T) inherently sequential (positions, cooldown, drawdown gate) identical — both are already sequential here
Execution engine O(T) sequential identical

Observation: the decision loop and the execution engine are already sequential in both modes. The batch optimisation applies only to feature generation and inference. So the parity concern narrows to those two stages.

3.3 Inference (paper/live)

Streaming only. O(1) per candle. No batch counterpart — but this also means inference parity is a property of the trained model + feature vector, not of the execution mode. As long as features are equal at T, predictions are equal.


4. Design principles

P1 — Separate "value computation" from "decision loop"

The decision loop (filter chain + execution) is inherently sequential in both modes. We do not try to vectorise it. The parity concern lives strictly in feature generation and inference, which are the only stages where batch wins on speed.

P2 — Prove atomic equivalence, then compose

Instead of proving batch_pipeline ≡ stream_pipeline globally (intractable), we prove each atomic component equivalent: - feature_i_batch ≡ feature_i_stream for every i - inference_batch ≡ inference_stream - filter_chain_batch ≡ filter_chain_stream (trivially — same code, both sequential) - execution_batch ≡ execution_stream (trivially — same code)

By composition: if all atomic pairs are equivalent, the full pipelines are equivalent.

P3 — Vectorised path allowed, but certificated

A vectorised implementation of a feature can exist (and should, for speed), but only if accompanied by a test proving it equivalent to the canonical update+emit definition. The canonical definition is the streaming path (#599's update + emit contract); the vectorised path is an optimisation, not a source of truth.

Committee v2 reinforcement: the canonical definition must be as simple and direct as possible — a naive for-loop implementing the feature's mathematical definition line-by-line. Any cleverness (memoisation, batch shortcuts) lives in the vectorised path, proven equivalent. The canonical is the ground-truth oracle, not the performance path. If the canonical is hard to read, the certification chain collapses.

def update_ema(state, candle, alpha): ...      # canonical
def emit_ema(state): ...                        # canonical

def batch_ema_vectorised(ohlcv, alpha):
    return ohlcv['close'].ewm(alpha=alpha, adjust=False).mean().values

def test_ema_vectorised_equals_sequential():
    # property-based test or fixed fixtures
    assert np.allclose(
        batch_ema_vectorised(ohlcv, alpha=0.1),
        batch_ema_from_update_emit(ohlcv, alpha=0.1),
        atol=1e-12,
    )

If no vectorised form can be proven equivalent for feature X, we keep the loop for that feature. The cost is localised — we don't lose the SIMD gains on all other features.

P4 — Bit-identical on signals, float tolerance on continuous values

The tolerance ladder is explicit: - Continuous values (feature outputs, probabilities, PnL): atol=1e-9 relative (default) - Discrete outputs (signal, trade list): strict equality, no tolerance - Aggregate metrics: equal by construction from equal trade lists

Committee v2 refinement: any feature requiring a tolerance looser than 1e-9 must carry a documented justification in documentation/architecture/ENRICHMENT_INDICATORS.md (the per-feature line gains a tolerance_override column with reason — e.g., "rolling quantile uses pandas linear interpolation, deviates by 2e-9 on denormalised ranges; proven safe by ..."). Where feasible, tighten to 1e-12 — the default 1e-9 is a ceiling, not a target. No per-run ad-hoc loosening; tolerance is a versioned feature property.

P5 — CI-gated, not negotiable

All parity tests run on every PR. Breaking parity is a blocker, same level as a failed unit test. No exceptions — a regression on parity is a regression on the scientific foundation of the entire FTF → Production workflow.

P6 — No "empirical proximity"

We explicitly reject claims of the form "the two produce similar enough results". Either the atomic equivalences hold (and composition gives us bit-identity by construction), or they don't (and we fix the component). There is no middle ground in CI.


5. Architecture — the 4 + 1 equivalences

Eq. 1 — Feature equivalence (per-feature)

Per the #599 taxonomy (Recursive / RingBuffer / Stateless):

For every feature, prove:

∀ T,  batch_emit(ohlcv[:T+1])  ==  stream_emit ∘ (stream_update)^T  (bootstrap(ohlcv[:warmup]))

Tests: - Unit per-feature (test_feature_<name>_batch_stream_parity): fixed-seed random OHLCV, run batch, run stream, assert last-value equal within tolerance. Hypothesis-based property tests for extra coverage. - Sweep over T: not only the last bar — assert parity at 10 random intermediate T values to catch mid-window drift.

Eq. 2 — Inference equivalence

∀ X, i,  model.predict_proba(X)[i]  ==  model.predict_proba(X[i:i+1])[0]

Standard property of sklearn / XGBoost models when features are stateless at the model level (no batch-norm, no recurrent layer). Test: batch 1000 rows, compare to 1000 individual calls, assert element-wise equality.

Eq. 3 — Filter chain equivalence

Each plugin is either: - Stateless per candle (trend, cost, kelly) → trivially batch=stream - Stateful with explicitly-versioned state (cooldown ticker, concurrency tracker, regime tracker) → update the state the same way in both modes

Test: feed a deterministic signal sequence of 500 candles; run through FilterChainExecutor in batch accumulating mode vs candle-by-candle mode; assert the output signal sequence is strictly equal.

Eq. 4 — Execution engine equivalence

Given the same signal sequence and the same OHLCV, entry prices, SL/TP exit logic, fees, and slippage are deterministic arithmetic on the inputs.

Test: fixed signal fixture (~50 signals), run cvntrade_backtest_engine.execute in both modes, assert trades list bit-identical (entry_bar, entry_price, exit_bar, exit_price, exit_reason, pnl).

Eq. 5 — End-to-end integration certificate

On real OHLCV windows pinned as Parquet fixtures in tests/fixtures/parity/. Committee v2 expansion: the fixture covers 3 cryptos × 3 scenarios (9 test cases) to broaden market-condition coverage:

Crypto Scenario Period Rationale
BTC Quiet 2024-09 (mean vol) baseline case
BTC Volatile 2024-11 (post-halving rally) broad swings, many CUSUM events
BTC Stress 2022-06 (Terra/3AC) gap days, cross-timezone incidents
ETH Quiet 2024-07 different asset characteristics
ETH Volatile 2024-03 (shanghai upgrade) regime shift
ETH Stress 2022-05 correlated with BTC stress
SOL Quiet 2024-08 lower-cap asset
SOL Volatile 2023-11 (FTX rally) high-vol small-cap
SOL Stress 2022-11 (FTX collapse) extreme volatility + liquidity gaps
@pytest.mark.parametrize("crypto,scenario", FIXTURES)
def test_e2e_parity(crypto, scenario):
    ohlcv = load_fixture(crypto, scenario)     # pinned Parquet, ~2000 bars
    model = load_fixture_model(crypto)         # pinned MLflow artifact

    batch_trades, batch_metrics = run_batch_pipeline(ohlcv, model, config)
    stream_trades, stream_metrics = run_stream_pipeline(ohlcv, model, config)

    assert batch_trades == stream_trades                           # strict
    assert batch_metrics.n_trades == stream_metrics.n_trades       # strict
    assert abs(batch_metrics.sortino - stream_metrics.sortino) < 1e-9
    assert abs(batch_metrics.total_return - stream_metrics.total_return) < 1e-9

Aggregate runtime target: < 90 s for the 9 cases (parallelised). Part of pre-merge CI. If runtime grows past 90 s as features accrete, either parallelise harder or rotate scenarios weekly (always 3 cases per PR, full 9 nightly).


6. Rollout plan

Phase A — Per-feature parity tests (2 days)

Extends the #599 test suite. One test per feature class (R / B / S / T). Leverages the update / emit contract that #599 already introduces. For vectorised variants, add a parity test against the sequential reference.

Exit criterion: every feature in the enrichment inventory (documentation/architecture/ENRICHMENT_INDICATORS.md) has a passing parity test.

Phase B — Inference + filter chain + execution parity (1.5 days)

Three targeted tests, independent of each other. Small surface, quick to write.

Exit criterion: the three parity tests run in CI and pass.

Phase C — End-to-end integration harness (2 days)

Builds the fixture (a pinned 2000-bar OHLCV window from BTC/ETH/SOL, stored as Parquet in tests/fixtures/), implements the side-by-side runner, writes the assertions.

Exit criterion: E2E test runs in CI on every PR, < 60 s, passes on main.

Phase D — CI gate + documentation (1 day)

  • CI config: parity tests run on every PR, failing = merge blocked
  • Update CLAUDE.md with the "parity contract" section
  • Write the operator-facing runbook: "what to do if a parity test fails"

Exit criterion: parity test results visible in PR checks, documented recovery path.

Phase E — Certification ceremony at FTF close (ad-hoc, ~1 hour)

When FTF P0-A is closed and we're ready to move to production:

  1. Run the full E2E parity test on the exact config snapshot that was selected at the end of FTF.
  2. Run the streaming pipeline runner on the top-3 crypto × test-window slices that drove the decision.
  3. Confirm trade lists and metrics are bit-identical to the batch results FTF used.
  4. Joint sign-off (committee v2 recommendation 4) — the certification document requires two signatures:
  5. DRI (Directly Responsible Individual) for the MLflow backbone / FTF flow
  6. Senior MLOps / SRE lead (or a committee-designated member acting as governance reviewer) No solo sign-off. Shared accountability technical + operational + governance.

Exit criterion: a one-page certification document (timestamp, config git_sha, fixture SHAs, test outputs, two signatures) archived in committee/sessions/ as the production-ready bill of health.

Phase F — Monolith deprecation (1 sprint after Phase E, committee v2 recommendation 5)

Once the certification harness is fully live and stable in CI (Phase D exit + Phase E first successful execution):

  • Week 0: src/backtest/cvntrade_backtest_engine.py gets a DeprecationWarning at import — 2-4 week soft-deprecation window starts.
  • During the soft-deprecation window:
  • The pipeline runner is the declared primary; the monolith remains callable for comparison / rollback.
  • Production FTF / paper / live do not call the monolith.
  • CI parity tests stay green (regression monitor on the runner).
  • Any warning-on-import hit during the window triggers a call-site investigation.
  • End of window (2-4 weeks, operator decision):
  • If no warnings hit and parity tests have been green continuously: full removal of cvntrade_backtest_engine.py and all its callers.
  • If any warning hit: investigate, extend window, or open a regression issue.

Exit criterion: monolith code removed from repo, all historical references migrated to git log archaeology only.


7. Validation criteria

Automated (CI)

  • All per-feature parity tests pass (classes R / B / S / T from #599 inventory)
  • Inference parity test passes
  • Filter chain parity test passes
  • Execution engine parity test passes
  • End-to-end integration parity test passes (on the pinned fixture)

Operational

  • Parity tests run in ≤ 90 s total (not blocking CI throughput)
  • Clear failure messages: on test failure, output which feature / stage diverged, with first-divergence candle index (same pattern as the shadow diagnostic in #599)

Process

  • Pre-merge CI gate active
  • Documentation in CLAUDE.md + operator runbook
  • Phase E ceremony executed once before P0-A conclusion lands in production

8. Risks

Risk Mitigation
A feature has no feasible vectorised equivalent → forced to stay in the slow path Acceptable if localised; the FTF stays fast on the 99% of features that do vectorise. Documented in the inventory per feature.
Float non-associativity causes spurious test failures Per-feature tolerance calibration (≤ 1e-9 relative, documented in the test). If a feature consistently requires a higher tolerance, the tolerance is recorded in its inventory entry + justified.
Parity test runtime grows as features are added Test parallelisation; keep per-test runtime < 1 s; the aggregate can grow linearly with feature count without impact.
Developer bypasses CI by marking parity tests as xfail to merge fast Governance: @pytest.mark.xfail on a parity test requires a one-line reference to a tracking issue that commits to removing the xfail within a defined sprint. Reviewed in CR.
E2E test depends on a fixture that rots (OHLCV changes, exchange delisting) Fixture is pinned (Parquet in-repo), does not fetch from exchange at test time. Independent of external state.
False sense of security if parity tests pass but FTF config differs from Phase E config The Phase E ceremony explicitly re-runs parity on the exact config that is being promoted — not on a historical one.

9. Interfaces with in-flight work + scoping (v2)

  • #599 StatefulEnrichment: provides Eq. 1 largely by construction (the update/emit contract is a shared definition). This document extends #599's §7 validation with formal per-feature tests.
  • #568 Pipeline runner Phase 1+: provides the streaming side of the equation. Eq. 2–5 require the runner being callable as a library in tests (already the case).
  • #596 train_window fix (merged as commit 0a1cdc71): pre-requisite, ✅ done. The "same config → same training" assumption is now guaranteed.
  • #608 ThresholdCalibrator: orthogonal. The calibrator lives inside apply_thresholds and is version-pinned in MLflow — equivalent in both paths by construction.
  • #612 Inference gate audit: orthogonal. Confirms that the decision gate location is consistent between paths, which is a precondition of Eq. 2.

Scoping (committee v2 recommendation 6)

Folded into #599 Phase 2 as a corollary — shared test suite, same update/emit contract, same week of wall-clock. BUT with an explicit constraint:

  • A dedicated engineer owns the parity test suite within the #599 scope. Not a junior, not a rotation slot. The same person drives E1 (feature parity) and E2–5 (inference / filter / execution / E2E) — so the test architecture stays coherent and the composition proof isn't split across contributors.
  • Naming this person is part of the Phase A kickoff for #599. Without a named owner, the parity track becomes an afterthought folded into unrelated PRs, which is exactly what we're trying to avoid (see Phase 2.9a refactor's leftover surface that produced #612 last week).

10. Decisions taken (v2 — was "open questions" in v1)

All v1 open questions are resolved post committee review. Kept here as a decision log.

  1. Tolerance ladder (§4 P4) — DECISION: 1e-9 relative on continuous / strict on discrete. Any looser tolerance requires a per-feature justification in the enrichment inventory. Default is a ceiling, not a target; tighten to 1e-12 where feasible.

  2. Vectorised path allowed with proof (§4 P3) — DECISION: allowed with mandatory parity test. Additional rule: canonical definition must be simple and direct (committee v2 rec. 2).

  3. Fixture for E2E (§5 Eq. 5) — DECISION: 3 cryptos × 3 scenarios = 9 cases, pinned Parquet, < 90 s aggregate CI runtime. Rotate nightly if needed (committee v2 rec. 3).

  4. Phase E governance (§6 Phase E) — DECISION: joint sign-off DRI + senior MLOps/SRE (or committee-designated). No solo sign-off (committee v2 rec. 4).

  5. Monolith deprecation (§6 Phase F) — DECISION: soft-deprecate 2–4 weeks after Phase D stable + Phase E first success, then remove. Safer than immediate removal, cheaper than keeping as permanent fallback (committee v2 rec. 5).

  6. Scoping vs #599 (§9) — DECISION: folded into #599 Phase 2 with dedicated engineer named at kickoff (committee v2 rec. 6).

  7. Blind spots — 7 covered explicitly in new §11bis (committee v2 rec. 7).

Remaining open questions (for future iterations)

  1. Parity test runtime ceiling: aggregate target is < 90 s today. As the feature inventory grows, at what point do we pivot from "all 9 scenarios per PR" to "3 per PR + 9 nightly"? Proposal: trigger the pivot at aggregate > 180 s, monitored via CI metrics.
  2. Continuous parity monitoring in production: the Phase E certification is a one-shot ceremony. Should we extend with continuous parity sampling (e.g. 1% of live trades shadow-run through the batch path and compare)? Not in scope of v1 implementation, tracked as a potential Phase G.

11. Acceptance criteria for this plan document

  • Problem statement quantified with real FTF durations
  • Complexity decomposition across training / backtest / inference
  • Six design principles (P1–P6) explicit
  • Four atomic equivalences + one integration equivalence specified with concrete test shapes
  • Rollout in 6 phases (A–F) with exit criteria
  • Risks + mitigations
  • Interfaces with #599, #568, #596, #608, #612 + scoping (§9)
  • Decisions converted to firm commitments (v2, §10)
  • Blind spots addressed (v2, §11bis — 7 committee-flagged angles)
  • Committee review PASSED (scores 9.0 / 9.0 / 9.0 / 9.0 / 8.5, avg 8.9/10, strong consensus)

Plan is now the spec of child issue #614 — FTF-to-Runner Parity Certification, executed as a corollary of #599 Phase 2.


11bis. Blind spots addressed (committee v2 rec. 7)

Seven angles flagged by the committee. Each becomes an explicit test or operational control.

11bis.1 Upstream OHLCV trust boundary

Concern: the parity proof assumes "same OHLCV in" → "same everything out". But if the data ingestion path between batch and live produces even slightly different OHLCV (rounding, exchange aggregation, late bars), the certificate is vacuous.

Control: - Document the trust boundary explicitly: parity starts at the OHLCV DataFrame handed to the enricher. Anything upstream (ETL, Feast, exchange adapter) is out of scope of this document but must satisfy its own parity (tracked under a separate observability issue). - Add a smoke test at pipeline entry: checksum the first 100 rows of OHLCV ingested by each path; emit event=ohlcv_checksum to Loki. Drift → alert on the shadow dashboard. - The Phase E ceremony explicitly reads the OHLCV checksum from both the FTF run that selected the config AND the streaming run used for certification — they must match.

11bis.2 Model artifact serialisation / deserialisation bit-identity

Concern: a trained XGBoost model saved in environment A, loaded in environment B, might produce slightly different predictions due to library version differences or serialisation-format drift.

Control: - New unit test test_model_save_load_bit_identical_inference:

model = train(X_train, y_train)
preds_before = model.predict_proba(X_test)
save_to_mlflow(model, path)
loaded = load_from_mlflow(path)
preds_after = loaded.predict_proba(X_test)
assert np.allclose(preds_before, preds_after, atol=0)  # STRICT
- Plus a cross-environment variant: save on CI, load in a fresh container, compare. Run nightly.

11bis.3 Runtime environment consistency

Concern: OS, Python version, and key library versions (numpy, pandas, XGBoost, scikit-learn) drifting between batch (CI runner) and production (airflow pod or runtime pod).

Control: - Document the "runtime contract" in documentation/runtime_contract.md: pinned versions of numpy, pandas, XGBoost, LightGBM, CatBoost, scikit-learn, Python. - CI uses the same base Docker image as production (airflow_docker/Dockerfile.k8s), not a lighter CI image. Shared Dockerfile = shared runtime invariants. - Pre-merge CI gate: any pip install or dependency-change PR must pass the full parity suite (not just linted).

11bis.4 Training determinism

Concern: HPO with different seeds or non-deterministic parallelism produces different hyperparameters → different models → different predictions → parity violated upstream of the harness.

Control: - ADR-23 (version-pinned MLflow artifacts) already enforces this at the artifact level. Extension: a new test test_training_determinism:

model_a = train(ohlcv, hpo_seed=42)
model_b = train(ohlcv, hpo_seed=42)   # re-run with same seed
assert serialise(model_a) == serialise(model_b)   # byte-identical
- All HPO configs must fix CVN_HPO_SAMPLER_SEED (already the case via regime_trainer.py:525). - Any future introduction of GPU training must document determinism controls or be gated out.

11bis.5 Timestamp / timezone consistency

Concern: OHLCV timestamps in batch may be tz-aware while a live stream delivers naïve tz — features that depend on dt.hour or dt.dayofweek would diverge around DST transitions.

Control: - CLAUDE.md already mandates tz-aware UTC indexes. Extension: add an explicit assertion in the E2E fixture loader:

assert ohlcv.index.tz is not None
assert str(ohlcv.index.tz) == "UTC"
- Dedicated parity test test_cyclic_features_across_dst_transition: a fixture covering an autumn DST event, assert hour_sin, hour_cos, day_sin, day_cos are bit-identical across both paths.

11bis.6 Property-based testing for vectorised features

Concern: hand-crafted fixed fixtures may miss edge cases (zero-variance windows, all-NaN stretches, discontinuities).

Control: - All per-feature parity tests (Eq. 1) gain a Hypothesis strategy that generates randomised but well-formed OHLCV sequences (monotonic timestamps, positive volumes, bounded prices). - Runtime budget: 10 Hypothesis examples per feature per CI run. Nightly: 100 examples per feature. - Any finding shrinks to a minimal failing input; the failing fixture gets pinned as a fixed-fixture test for fast regression coverage afterwards.

11bis.7 Filter starvation (ADR-45)

Concern: filters with internal state (cooldown, concurrency) may exhibit different behaviour under rapid-fire signal bursts vs sparse signals. A batch path may see a burst collapsed into a single pandas operation while streaming sees the same burst as sequential discrete decisions.

Control: - Dedicated test test_filter_chain_under_starvation: a signal sequence with (a) a 20-candle burst of BUYs, (b) a 500-candle drought, (c) another 20-candle burst. Run through the filter chain in both modes. Assert strictly-equal output signal sequence. - The cooldown and concurrency plugins are already explicitly ADR-45-aware; this test pins the behaviour as regression-proof.



12. References

  • Parent track: #568 (pipeline runner), #599 (enrichment refactor)
  • Prereq PR: #613 (train_window fix) — must merge before Phase E ceremony
  • Related ADRs:
  • ADR-40 (paper/live same kernel) — this document is the formal enforcement mechanism
  • ADR-23 (version-pinned MLflow artifacts) — supports the assumption that "same config → same model"
  • ADR-25 (no silent fallback) — parity test failure must never be silent
  • Prior art:
  • Google's "Training-Serving Skew" chapter in Reliable ML (Kreuzberger et al.) — same problem, different domain
  • Deep learning: TF Serving's "same graph at train and serve" — structurally analogous
  • Ray Train + Ray Serve parity tests — industry pattern we're adapting

13. Committee submission — CLOSED, PASSED

v1 submitted 2026-04-21. Verdict: PASSED (strong consensus). - Architect: 9.0 / 10 - ML Engineer: 9.0 / 10 - Crypto Trader: 9.0 / 10 - Ops: 9.0 / 10 - Data Scientist: 8.5 / 10 - Average: 8.9 / 10

All 7 recommendations integrated in this v2 (see §Changelog at the top). No resubmission planned unless the v2 revisions themselves require review.

Archived artefact: committee/sessions/ (session json to be committed alongside implementation PR).


14. (historic) v1 Submission template — kept for reference

Title: FTF ↔ Runner Parity Certification — plan review

Question (target score: 8+):

Review of documentation/../design/CVN-N005-parity-certification.md before we fold this work into the #599 stateful refactor scope. The underlying problem is the classic training-serving skew recast as FTF-vs-production: we must tune FTF in batch for speed but need mathematical certainty that the results transfer to the streaming runner.

The plan codifies six design principles (§4), four atomic equivalences + one integration (§5), and a 5-phase rollout (§6).

Committee review requested on:

  1. Tolerance ladder (§4 P4): 1e-9 relative on continuous, strict on discrete signals and trades. Right bar? Too loose? Per-class justification preferred?

  2. Vectorised allowed with proof (§4 P3): we allow vectorised feature implementations as long as they carry a parity test against the canonical update/emit definition. Alternative is banning them outright for safety, paying a 3–5× runtime penalty. What's the committee's tradeoff?

  3. Fixture selection for Eq. 5 (§5): one quiet + one volatile BTC window? Three cryptos? The fixture is the basis of the pre-merge gate — the choice shapes what "parity" means in practice.

  4. Phase E ceremony governance (§6 Phase E): who signs off on the certification before production? Operator alone, DRI, committee?

  5. Monolith deprecation (§9): once parity is certified, the monolith backtest engine is redundant. Immediate removal, soft-deprecate, or keep as a third reference path forever?

  6. Relationship to the enrichment refactor #599: we propose folding this work into #599 Phase 2 as a corollary (same engineer, same test suite). Is that the right scoping, or should it be a sibling issue with its own engineer?

  7. Blind spots: what's the most expensive mistake we could make executing this plan that isn't surfaced in §8 Risks?

Artifacts referenced: - documentation/design/CVN-N005-parity-certification.md (this doc, ~400 lines) - documentation/design/CVN-N005-stateful-enrichment.md v3 (committee PASSED 2026-04-20) - documentation/architecture/ENRICHMENT_INDICATORS.md (Phase 0 deliverable)

Deliverable expected: per-expert opinion (score, confidence, findings, risks, recommendations) + consolidated verdict (PROCEED / REVISE / REJECT). If PROCEED, child issue #614 opened and folded into #599 scope.