Skip to content

Plan dossier — CVN-N001-EE-S07 : Track 2 Order book microstructure features (FE + sweep)

Date : 2026-05-05 (initial) / amended v2 addressing committee 2dc76b50 METHODOLOGY_FLAW recommendations / amended v2.1 realigning §0 + §4.6 + §5#2 with the BTC-scope-split shipped in PR #850 (cache-key extension deferred to S07 v2 follow-up — committee 4076bdca pr_review blocker) Story : CVN-N001-EE-S07 (OP wp#46) GH issue : #718 Sister Story : CVN-N001-EE-S15 (wp#126) — Binance L2 ingestion infra (DEPENDENCY — must ship + 1 month forward data accumulated before this Story's first sweep) Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P2 — Track 2 of F1_buy boost ; data-tier lever ; tier 1 expected effect size Committee plan_review : ✅ PASSED / METHODOLOGY_FLAW (session 2dc76b50, 2026-05-05, 5 experts strong consensus, 0 blockers, 15 recs, $0.20). OP Meeting : #99. v2 dossier addresses 12/15 recs ; 3 deferred with rationale (see §0 below).


0. 2026-05-05 v2 amendment — committee 2dc76b50 METHODOLOGY_FLAW recommendations

The committee passed the plan with 0 blockers but flagged 3 methodological gap classes (statistical rigor, feature definitions + monitoring, economic realism + operational safety) that warrant amendment before impl. v2 incorporates 12/15 recs ; 3 are explicitly deferred.

Adoptions (12)

# Rec Where amended
1 BH multiple-testing correction across all variant comparisons (not just leakage check) §2 hypothesis + §5 acceptance criteria #4
2 Cohen's d 95 % bootstrap CIs in deep-sweep results §5 acceptance criteria #6
3 Power analysis OR window extension : pre-registered fallback to 6-month window if 3-month sanity sweep shows insufficient power §3.X new + §7 falsifiability
4 Pin canonical Kyle 1985 + window for l2_kyle_lambda (cross-ref to S15 §4.3 schema owner) §4.1 + new explicit citation
5 Real-time L2 feature data quality monitoring (freshness, NaN, range, dist shift) — operator runbook new §4.7
7 Cache key extension (l2_schema_version in cache key) Deferred to S07 v2 follow-up PR (mirrors Track 1 BTC scope split — cache key + InferenceAPI loader + MLflow artefact persistence are the inference-path coherent unit, all 3 land together). v2 amendment originally proposed "upfront" but the actual scope-split decision (operator-confirmed) keeps this with the inference path. Committee 4076bdca pr_review flagged the dossier-vs-code mismatch ; v2.1 realigned dossier with shipped code. v2.2 amendment per CR pass on PR #850 (2026-05-05 21:10) : the previously-claimed "v1 disables cache when l2_features_enabled=True" mitigation does NOT exist in code (no such gate in enrichment_api.py or contracts.py). The actual gate is the operator-discipline rule from MLOps readiness §7.1 : "models trained with l2_features_enabled=True MUST NOT be deployed to inference until S07 v2 follow-up merges". No L2-trained model reaches inference until then → no cache collision in prod possible. The silent-data-discrepancy risk lives only on the training side (FTF sweeps re-running with same factor / different schema_version would mis-cache) and is mitigated by ADR-79's per-run-id cache namespacing (each FTF run has its own run_id partition).
9 Sortino + expectancy after fees and spread informational reporting in deep sweep §5 acceptance criteria #5b new
10 Runtime kill-switch / feature-flag for L2 features (disable without retraining) new §4.8
11 Staged rollout strategy (shadow → canary → full) for L2-enabled models new §4.9 + §5 acceptance criteria #11
13 Feature importance ablation in deep-mode sweep §5 acceptance criteria #6b new
14 Per-class metrics (precision/recall BUY/SELL) reporting alongside f1_buy §5 acceptance criteria #6c new
15 Regime-stability monitoring post-LOCK + drift thresholds for retraining new §4.10 + §8 risks updated

Deferred (3)

# Rec Rationale
6 ADR for L2 schema evolution Schema is owned by S15 (sister Story) ; ADR governance belongs to ADR-77 docs SSoT Story or a dedicated S15 follow-up. Out-of-scope here ; tracked as a known gap.
8 Accelerate #711 dynamic slippage Operator decision 8a explicitly defers #711 to a separate Story after Track 2 LOCK. Documented disagreement with the committee : the operator-directed mission scope is f1_buy-primary (per F1 plan §6 derogation), not joint-economic-metric. #711 ships as its own Story under economic-metric mission scope.
12 Label horizon sensitivity sweep across multiple horizons Adds a 2nd FTF factor (label_horizon) interacting with l2_features ; doubles the sweep cell count and stretches the Story's scope. Tracked as a follow-up Story (likely under the future filter-tuning mission) — NOT a v1 Track 2 deliverable.

Track 2 of the F1_buy boost plan (F1_BUY_BOOST_PLAN.md §5 Track 2) hypothesises that adding L2 order book microstructure features lifts f1_buy materially over the OHLCV-only baseline. The model currently sees only OHLCV candles ; microstructure data (spread, depth imbalance, order flow) carries information about short-horizon price pressure that's invisible from candles alone.

Operator decision context (wp#46 comment 679, 2026-05-05) : - Track 14 (5m timeframe) LOCK at +0.125 Δf1_buy proves data-tier productive ; Track 2 is the next data-tier lever - Track 2 starts in parallel with Track 12 (frac diff) — the original quick-win bundle gate is moot since Track 14 already cleared the +0.05 spirit - Scope split : data ingestion infra in S15 (sister Story) ; this Story = FE + FTF sweep only - Feature catalogue : 6 standard microstructure features (per operator decision 6b) - FTF factor structure : reuse btc_features template (per operator decision 7)

Cross-track lesson absorbed : Track 1 (BTC features) ABANDONED on leakage gate (currently under investigation in wp#103). Track 2 design must front-load the same leakage check (standard_purge0 variant in the matrix) to surface the same class of pathology early.

2. Hypothesis (falsifiable)

Adding 6 standard L2 microstructure features lifts f1_buy materially over the OHLCV-only baseline at the current dataset / model regime. Specifically :

  • H0 (null) : mean(f1_buy | l2_features=enabled) - mean(f1_buy | disabled) is indistinguishable from 0 (CI95 includes 0) → ABANDON.
  • H1 (alternative) : Δf1_buy ≥ +0.03 (Story-specific bar per wp#46 acceptance criteria — slightly higher than the standard +0.015 because microstructure is expected to be a strong signal for short-horizon predictions) with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.

Mandatory leakage check (same pattern as Track 1 plan §4.6) : paired t-test on f1_buy(standard_purge0) - f1_buy(standard) over 25 (crypto, fold) cells, BH-corrected at α = 0.05. Failure → ABANDON pending root-cause investigation (à la wp#103).

Statistical contract (per committee 2dc76b50 rec #1, v2) : ALL pairwise variant-vs-baseline comparisons (each of min, standard, flow_only vs none, plus standard_purge0 vs standard, plus standard_purge10 vs standard) MUST be BH-corrected jointly at α = 0.05. The Type I error budget is shared across the 5 candidate comparisons ; reporting unadjusted p-values for any single comparison is non-compliant.

3. Variant matrix

5 variants per the F1 plan §4.2 convention + 1 leakage sanity (matching the proven btc_features template per operator decision 7) :

Variant What it does Features added Notes
none (baseline) OHLCV-only 0 Reference
min Minimal — directional only spread_bps, depth_imbalance_l1, mid_price Cheapest variant — tests "spread + L1 imbalance is enough"
standard (canonical) F1 plan standard set spread_bps, depth_imbalance_l1, mid_price, depth_imbalance_l5, kyle_lambda, ofi_15m Operator decision 6b ; the LOCK candidate
standard_purge0 Same as standard but no purging (l2_purge_bars=0) Same 6 Mandatory leakage-detection sanity per Track 1 plan §4.6 pattern. NOT a candidate for lock.
standard_purge10 Same as standard with l2_purge_bars=10 (sensitivity) Same 6 Empirically justifies the canonical purge_bars value
flow_only Order-flow-only (no static spread) kyle_lambda, ofi_15m, depth_imbalance_l5 Tests "flow signal without static spread/depth carries the lift"

Total 6 variants — same shape as btc_features factor.

3.1 Pre-registered window-extension fallback (per committee 2dc76b50 rec #3, v2)

Operator decision 5c locks the canonical data window at 3 months. Committee flagged this as potentially under-powered for cross-regime statistical robustness. Pre-registered fallback :

  1. First sanity sweep at 1 month (operator decision 5c sub-clause) reports per-variant Cohen's d + 95 % bootstrap CI. If |CI_width| / |d| > 0.8 for the standard vs none comparison (i.e., effect size estimate has > 80 % relative uncertainty), the data window is judged under-powered.
  2. Action on under-power : automatic extension to 6 months at the deep-mode sweep. If 6 months still insufficient, escalate to 12 months with operator decision (cost = +1 month FTF compute).
  3. Action on adequate power : proceed with the 3-month canonical (no extension).

This is a falsifiable pre-registered protocol — the window length is data-driven, not operator-discretionary post-hoc.

4. Implementation path

4.1 FE module (src/commun/pipeline/orderbook_features.py, NEW)

Mirror of src/commun/pipeline/btc_features.py (Track 1 module that survived 4 CR passes + committee review). Same structure :

_L2_FEATURE_COLUMNS = {
    "min":         frozenset({"l2_spread_bps", "l2_depth_imbalance_l1", "l2_mid_price"}),
    "standard":    frozenset({"l2_spread_bps", "l2_depth_imbalance_l1", "l2_mid_price",
                              "l2_depth_imbalance_l5", "l2_kyle_lambda", "l2_ofi_15m"}),
    "flow_only":   frozenset({"l2_kyle_lambda", "l2_ofi_15m", "l2_depth_imbalance_l5"}),
}

def compute_l2_features(target_ohlcv, l2_snapshots, feature_set="standard", purge_bars=20):
    """Read pre-computed L2 columns from S15's Timescale snapshots, shift by purge_bars,
    join to target_ohlcv index. ADR-14 invariant : every L2 feature uses only data
    ≤ t - purge_bars to prevent same-bar leakage from book state into label window."""

Canonical feature definitions (per committee 2dc76b50 rec #4, v2) — pinned by S15 schema at the ingestion layer ; this Story's FE module reads them as-is :

  • l2_kyle_lambda : Kyle 1985 lambda computed as the OLS regression slope of Δ(mid_price) on signed_volume over a rolling 15-minute window (the canonical Kyle illiquidity measure). Pre-computed in S15's l2_snapshots.kyle_lambda column ; this Story does NOT recompute it. Reference : Kyle, A. S. (1985), "Continuous auctions and insider trading", Econometrica 53(6), 1315-1335.
  • l2_ofi_15m : Order Flow Imbalance over a 15-minute rolling window per Cont, Kukanov, Stoikov (2014) "The price impact of order book events", Journal of Financial Econometrics 12(1), 47-88. Pre-computed in S15's l2_snapshots.ofi_15m column.
  • l2_spread_bps, l2_depth_imbalance_l1, l2_depth_imbalance_l5, l2_mid_price : standard OB metrics, definitions documented in S15 plan dossier §4.3 schema.

If S15's pre-computation diverges from these definitions, Track 2's results are non-comparable to the literature ; cross-Story validation = early sanity sweep checks l2_kyle_lambda distribution moments against published values for crypto majors (mean ~ O(1e-6) per USD-normalised volume per Easley et al. 2012).

ADR-14 purging invariant identical to Track 1 : every L2 feature column is .shift(purge_bars) so column row i carries information from L2 row i - purge_bars. The 6 feature scalars come pre-computed from S15's l2_snapshots table (per S15 §4.3 schema) — this module just reads + shifts + joins, no recomputation.

ADR-25 fail-fast : if l2_features_enabled=True is requested but the l2_snapshots table has < 95 % coverage of the target time window, raise RuntimeError rather than imputing zeros.

4.2 EnrichmentAPI extension

Same pattern as Track 1's BTC features wiring :

  1. Extend EnrichmentConfig (in src/commun/pipeline/contracts.py) with l2_features_enabled: bool = False, l2_features_set: Literal["min","standard","flow_only"] = "standard", l2_purge_bars: int = 20.
  2. Extend CVNTrade_Enrich.process() to accept optional l2_snapshots: Optional[pd.DataFrame] kwarg.
  3. The L2 DataFrame is loaded by the ETL orchestration layer from Timescale (NOT by the enrichment layer per ADR-25 — fail-fast if l2_features_enabled=True and l2_snapshots is None).

4.3 ETL extension (src/ETL/cvntrade_etl_pipeline.py)

When training with l2_features_enabled=True, fetch the L2 snapshots from Timescale alongside OHLCV :

l2_snapshots = pd.read_sql(
    "SELECT * FROM l2_snapshots WHERE symbol = %s AND bar_time BETWEEN %s AND %s "
    "AND schema_version = 'l2_schema_v1' "
    "ORDER BY bar_time",
    con=engine,
    params=(crypto_symbol, start_ts, end_ts),
)

For paper / live, the streaming kernel pre-fetches a rolling L2 window per candle (same pattern as the BTC ohlcv pre-fetch).

4.4 FTF factor (per ADR-58)

File : src/commun/finetune/ablation_matrix.py — add l2_features factor mirroring btc_features :

AblationFactor(
    name="l2_features",
    factor_type="training",
    category="data",
    description="L2 order book microstructure features (Track 2 of F1_buy boost). "
                "6 variants matching the btc_features template. ADR-14 invariant : "
                "training time t uses only L2 data <= t - purge_bars.",
    env_vars={
        "none":             {"CVN_L2_FEATURES_ENABLED": "0"},
        "min":              {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "min", "CVN_L2_PURGE_BARS": "20"},
        "standard":         {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "20"},
        "standard_purge0":  {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "0"},
        "standard_purge10": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "10"},
        "flow_only":        {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "flow_only", "CVN_L2_PURGE_BARS": "20"},
    },
),

4.5 Guardrail tests (ADR-58)

Test 1 (unit) : test_l2_features_shift_invariant — assert compute_l2_features(..., purge_bars=20) produces columns shifted by exactly 20 rows. Test 2 (unit) : test_l2_features_set_columns — for each of the 3 feature sets, assert feature_columns_for_set(name) returns the canonical column list. Test 3 (integration) : test_ablation_matrix_l2_features_variants — assert the FTF factor has 6 variants with expected env_vars. Test 4 (integration) : test_l2_features_etl_failfast — assert RuntimeError raised when l2_features_enabled=True + < 95 % L2 coverage.

4.6 Feature contract pinning via MLflow artefact + cache key extension (deferred per S07 v2 follow-up scope split)

Per ADR-23 (features version-pinned, fail-fast), the trained model's MLflow artefact includes a new l2_enrichment_config.json capturing : - l2_features_enabled: bool - l2_features_set: str - l2_purge_bars: int - l2_schema_version: str (= l2_schema_v1 from S15)

InferenceAPI loads this config at inference time and fail-fasts if the runtime env disagrees with the model's pinned config (same pattern as Track 1's enrichment_config.json).

Cache key extension — deferred to S07 v2 follow-up (committee 4076bdca v2.1 alignment + CR pass v2.2) : the v1 PR ships the FE module + factor + tests + EnrichmentAPI wiring only ; the cache key extension + MLflow artefact persistence + InferenceAPI loader land together in the inference-path follow-up PR (S07 v2), mirroring the BTC Track 1 scope split per EnrichmentConfig Track 1 docstring §4.1bis. The original v2 §0 amendment said "upfront" ; committee 4076bdca pr_review correctly flagged the dossier-vs-shipped-code mismatch ; v2.1 realigned the dossier with the committed code.

v2.2 (CR pass on PR #850 2026-05-05 21:10) : a "v1 disables cache when l2_features_enabled=True" mitigation was claimed in v2.1 but does NOT exist in the shipped code. Honest mitigation chain : 1. Inference-side : the MLOps readiness §7.1 OOS rule is the gate — models trained with l2_features_enabled=True MUST NOT be deployed to inference until S07 v2 ships the cache key extension + InferenceAPI loader together. No L2-trained model reaches prod until then → no inference-time cache collision possible. 2. Training-side : ADR-79 namespaces FTF cache by run_id, so two sweeps with different l2_schema_version use different cache partitions. Within a single FTF run, the operator pins one l2_schema_version per the BASE_ENV at run start ; mid-run schema-mismatch is impossible by construction. 3. What's NOT mitigated : if the operator manually invokes compute_l2_features outside an FTF run + outside the inference path (e.g., a notebook running enrich_batch with l2_features_enabled=True against a re-pinned PG schema_version), they could populate the L2 columns under a stale schema_version. This is an operator-discipline bound and lives in §7.1 OOS until the cache key extension lands.

4.7 Real-time L2 feature data quality monitoring (per committee 2dc76b50 rec #5, v2 NEW)

Once L2-enabled models reach inference (post-LOCK), the runtime emits the following per-prediction quality signals to the existing observability stack (Loki + Grafana per ADR-26) :

Signal Threshold Alert action
l2_feature_freshness_seconds (delta between prediction time and most recent L2 snapshot) warn > 60s, P2 alert > 300s switch to L2-blind fallback model (per §4.8)
l2_feature_nan_rate (% of L2 columns with NaN values per prediction) warn > 5 %, P2 alert > 20 % switch to L2-blind fallback model
l2_feature_value_oor (% of L2 column values outside reference range — e.g. spread_bps > 1000 is OOR for crypto majors) warn > 1 %, P2 alert > 5 % flag for operator review ; do NOT auto-switch (could be regime change)
l2_kyle_lambda_drift_psi (PSI of l2_kyle_lambda distribution vs training reference, rolling 24h) warn > 0.10, P2 alert > 0.25 trigger retraining workflow per §4.10

Reference ranges + thresholds calibrated from the deep-sweep training distribution ; stored in the model's MLflow artefact as l2_quality_reference.json.

4.8 Runtime kill-switch for L2 features (per committee 2dc76b50 rec #10, v2 NEW)

A feature flag cvn_pipeline.l2_features.runtime_disable (PG-stored per ADR-59) lets the operator disable L2-enabled models in production WITHOUT retraining. When set to true :

  1. InferenceAPI detects the flag at every inference call (cached for 60s)
  2. Routes inference to the registered champion_no_l2 rollback model (per §7 + ADR-15) instead of the L2-enabled champion
  3. Logs a kill_switch=l2_features event per ADR-32 ; emits a Grafana annotation
  4. Operator re-enables by flipping the flag back to false

The kill-switch is NOT the same as the trading kill-switch (ADR-71 = halts all trading). This is per-feature : trading continues but with the L2-blind model. Both kill-switches can be combined (trading off + L2 off = full safety) ; they're orthogonal.

4.9 Staged rollout : shadow → canary → full (per committee 2dc76b50 rec #11, v2 NEW)

Post-LOCK deployment of an L2-enabled model follows a 3-stage rollout (mirroring the standard MLOps pattern + ADR-42 atomic per-crypto promotion) :

Stage Traffic % Duration Pass criteria Rollback trigger
Shadow 0 % (predictions logged but not executed) 7 days l2_feature_freshness_seconds_p99 < 30s AND l2_feature_nan_rate_p99 < 1 % AND prediction agreement with L2-blind ≥ 90 % any quality signal threshold breach (§4.7)
Canary (1 of 5 cryptos) per-crypto promotion via Console (ADR-42) 7 days f1_buy on the canary crypto matches the deep-sweep estimate within ±20 % drift trigger from §4.10 OR operator manual rollback via §4.8 kill-switch
Full (5/5 cryptos) per-crypto promotion sequence over 5 days (1 crypto / day) 7 days steady-state no kill-switch triggers, no drift alerts drift trigger from §4.10 OR operator manual rollback

The shadow stage is the cheapest catch ; canary catches per-crypto pathologies that aggregate stats hide. Full rollout is the steady-state.

4.10 Regime-stability monitoring + retraining triggers (per committee 2dc76b50 rec #15, v2 NEW)

Post-LOCK, the L2-enabled model carries 3 drift signals that, on threshold breach, trigger an automated retraining workflow :

Signal Threshold Action
Concept drift : rolling f1_buy on last 30d vs training-window f1_buy gap > 0.05 absolute open ticket + auto-launch shadow retraining job (no auto-deploy)
Feature drift : PSI of l2_kyle_lambda, l2_ofi_15m, l2_depth_imbalance_l5 vs training reference (rolling 30d) any one > 0.25 same as concept drift
Regime change : l2_spread_bps_p99 rolling 7d vs rolling 90d ratio ratio > 3.0 (microstructure regime shift, e.g., Binance fee changes) escalate to operator review BEFORE auto-retraining

Retraining workflow (when triggered) : standard CVNTrade retrain DAG with the original FTF config, freshest training window. The retrained model goes through the §4.9 staged rollout (shadow first), NOT auto-promoted. Operator approval gate at the canary stage.

5. Acceptance criteria

# Criterion Evidence
1 6 NEW variants live in l2_features factor matrix python -c "from commun.finetune.ablation_matrix import DATA_FACTORS; print([v for f in DATA_FACTORS if f.name=='l2_features' for v in f.env_vars])" shows 6 entries
2 Unit tests pass (3 new test modules) pytest tests/unit/test_orderbook_features.py tests/unit/test_l2_features_ablation.py tests/unit/pipeline/test_enrichment_api_l2_wiring.py green. Cache-key-extension test deferred to S07 v2 follow-up PR per §0 row #7 + §4.6 v2.1 alignment (the inference-path coherent unit).
3 First sanity sweep (1 month data, smoke power_mode) complete Run ID + paired t-test results + power assessment per §3.1
4 Mandatory leakage check passes — AND all 5 candidate variant comparisons are BH-corrected jointly (v2 rec #1) BH-corrected p ≥ 0.05 on standard_purge0 vs standard ; BH-corrected p < 0.05 on at-least-one of min/standard/flow_only vs none for the alt-hypothesis to hold
5 Deep-mode sweep complete (3 months OR 6 months per §3.1 power-driven extension) Run ID + ADR-79 8-step results dossier ; window length justified
5b Sortino + expectancy after fees + spread reported informationally (v2 rec #9) results dossier §X new ; not a hard gate, but flags economic-regression risk for downstream #711 Story scoping
6 F1 mission gates clear (Δf1_buy ≥ +0.03 + CI95 excluding 0 + ≥ 4/5 cryptos + Cohen d ≥ 0.3) results dossier §6
6b Cohen's d 95 % bootstrap CIs reported per variant (v2 rec #2) results dossier §6 — point estimate + CI
6c Feature importance ablation in deep sweep (v2 rec #13) per-feature SHAP / permutation importance ranking on the LOCK candidate variant
6d Per-class metrics (precision/recall BUY/SELL) reported alongside f1_buy (v2 rec #14) results dossier §6 — confusion matrix + per-class precision/recall
7 MLOps readiness completed (incl. §4.7 monitoring + §4.8 kill-switch + §4.9 rollout + §4.10 drift triggers) documentation/stories/CVN-N001-EE-S07/mlops_readiness.md
8 Verdict recorded (LOCK / KEEP_AVAILABLE / ABANDON) per ADR-79 decision tree results dossier §11 verdict
9 If LOCK : champion_no_l2 model registered in MLflow as rollback target (per Track 1 plan §7 model-switching rollback pattern) MLflow Registry tag champion_no_l2
10 Committee plan_review PASSED — DONE : session 2dc76b50 (OP Meeting #99), 0 blockers, code METHODOLOGY_FLAW addressed in v2 amendment this dossier §0
11 Committee pr_review PASSED on impl PR OP Meeting per ADR-82 + session JSON link
12 Staged rollout complete (shadow → canary → full) (v2 rec #11) per §4.9 per-stage Grafana annotations + operator approval logs

6. Out of scope

  • L2 data ingestion infra — sister Story S15 (wp#126) covers the Timescale schema + DAGs + reconstruction
  • #711 dynamic slippage model — operator decision 8a : separate Story after Track 2 LOCK
  • Order book imbalance > L5 — VAMP, queue position, micro-volatility — future research Stories
  • Multi-exchange L2 — Binance only (S15 scope) ; Coinbase / Kraken later
  • Real-time streaming consumption of L2 in paper / live — backtest-only for v1 ; paper / live integration is a follow-up Story if Track 2 LOCKs (similar pattern to Track 1's parent plan §6)
  • Joint metric (Sortino + expectancy) gates — F1 mission scope is f1_buy-primary per §6 derogation ; gate 2 (joint metric) is filter-tuning follow-up Story

7. Falsifiability + rollback

  • Falsifiability per H0 : pre-registered Δf1_buy ≥ +0.03 with CI95 excluding 0 + per-asset 4/5. If Δf1 ∈ [-0.01, +0.025] with CI95 including 0 OR per-asset improves on < 4/5 cryptos, ABANDON.
  • Falsifiability on leakage : if standard_purge0 outperforms standard on f1_buy (BH p < 0.05), ABANDON pending leakage investigation (à la wp#103 for Track 1).
  • Rollback at LOCK time (per Track 1 plan §7 model-switching pattern) :
  • The MLOps promotion workflow already handles model artefact swaps per ADR-15 + ADR-42 (atomic per-crypto promotion). Rollback = promote the previous L2-blind champion via Console flow on mlflow_promotion.
  • champion_no_l2 model MUST be registered in MLflow Registry as a deployable rollback target before any Track 2 model becomes champion (mandatory pre-LOCK artefact, mirroring Track 1 plan §7).
  • Hot-fix path for code bugs : standard PR + retrain with fix + atomic promotion (NOT runtime env-flag toggle, which would dimension-mismatch the model per ADR-23).

8. Risks

Risk Likelihood Impact Mitigation
L2 data not yet accumulated when Story is impl-ready high low Story is sequenced AFTER S15 ships ; first sanity sweep runs at 1 month of forward data ; deep sweep at 3 months. The PR can land first ; sweeps wait.
Same leakage class as Track 1 (per-bar L2 info contaminating labels) medium high Mandatory standard_purge0 variant in matrix surfaces leakage as a gate violation. Plus the purge=20 shift is enforced by a regression test.
Microstructure features unstable across cryptos (high spread variance, low-liquidity altcoins) medium medium Per-crypto FTF variant gates (per-asset 4/5) catch crypto-specific failures ; failing crypto can be excluded from a per-asset Story later
Reconstruction quality from S15 not enough for FE (ADR-25 fail-fast triggers) medium medium S15 §7 has its own falsifiability ; if it ships forward-only with reduced window, S07 sanity sweep at 1 month still works
Pipeline cache key extension needed for L2 (cache invalidation) low low v1 has NO code-level cache-disable gate (CR pass v2.2 caught the false claim). The actual mitigation is the operator-discipline rule in MLOps readiness §7.1 (no L2-trained model deployed until S07 v2) + ADR-79's per-run_id FTF cache namespacing. Cache extension is the S07 v2 inference-path follow-up.
MLflow artefact bloat from l2_enrichment_config.json low low Negligible ; same JSON pattern as Track 1's enrichment_config.json
Feature drift over time (microstructure regime shifts post-Binance volume changes) medium high Quarterly review per MLOps readiness §3 ; Loki alert if l2_kyle_lambda_p99 drifts > 3σ over rolling 30 days

9. Sequencing + cross-Story impact

S15 (data infra) ────┐
                     ├─ S15 deployed + ~1 month forward data
                     │       │
                     │       ▼
                     │   S07 PR opened (impl ready, sanity sweep at 1 month data)
                     │       │
                     │       ▼
                     │   ~2 more months pass (3-month canonical window)
                     │       │
                     │       ▼
                     │   S07 deep-mode sweep + verdict per ADR-79
                     │       │
                     │       ▼
                     │   If LOCK : downstream Story #711 dynamic slippage
Track 1 leakage (wp#103) ─┐
                          ├─ Track 12 (frac diff) gated on Track 1 verdict — independent of Track 2
Track 14 (5m TF) LOCK ────┘

Cross-Epic impact : - Track 14 (5m timeframe) : independent, LOCK shipped — Track 2 sweep runs at 15-min bars (default) but COULD run at 5m if Track 14 canonical is adopted globally (separate Story decision) - Track 1 leakage investigation : independent — same leakage-check pattern adopted here as a defensive measure - Track 12 (frac diff) : independent — gated on Track 1, no L2 dependency - #711 dynamic slippage : downstream of Track 2 LOCK - CVN-N010-EA (KPI Store) : if Track 2 LOCKs, the L2 features become production columns ; emit l2_kyle_lambda quantile via emit_kpi for drift monitoring

10. References

11. Plan-review questions for committee

  1. Effect-size bar (Δf1_buy ≥ +0.03) : the wp#46 acceptance criterion is +0.03 (vs the F1 plan's standard +0.015). Is this calibrated correctly given microstructure's expected effect size, or should it match the standard +0.015 to avoid biasing toward ABANDON?
  2. Leakage check pattern reuse : Track 2's leakage check mirrors Track 1's (standard_purge0 vs standard paired t-test). The Track 1 leakage investigation (wp#103) is still in flight ; should Track 2 wait for the Track 1 verdict before launching its first sweep, or proceed in parallel knowing the same pathology might surface?
  3. flow_only variant inclusion : 4 variants is the minimum (none + min + standard + standard_purge0) ; standard_purge10 and flow_only add to 6. Is the 6-variant matrix the right power-vs-cost tradeoff?
  4. l2_kyle_lambda definition : Kyle's lambda has multiple definitions in the literature (regression slope of price impact on signed volume is the canonical Kyle 1985). Confirm S15's pre-computation matches the canonical definition + window size (15-min rolling regression).
  5. Pre-impl OR post-impl-PR sweep : the PR can land before any data has accumulated (just the FE module + factor + tests). Should we (a) ship the PR + wait for data, or (b) wait until 1 month data accumulated then ship PR + sanity sweep in same submission? (a) keeps Story In progress shorter ; (b) bundles evidence with code.
  6. l2_features_enabled=True cache invalidation : ~~v1 disables cache (fail-safe)~~ — CR pass v2.2 caught this as a false claim. There is no code-level cache-disable gate in v1 ; EnrichmentAPI.enrich_batch does not branch on l2_features_enabled. The actual mitigation chain is (i) inference-side OOS-only rule (cache hit on training-window slice never poisons fresh inference), (ii) per-run_id ADR-79 namespacing (Track-2 sweeps cannot collide with Track-1), (iii) the operator-discipline rule documented in MLOps §7.1. Question for committee : is the operator-discipline bound (iii) acceptable for v1, or should we (a) ship the cache-key extension (l2_feature_set_version in cache key) upfront in Track 2 v1 instead of deferring to S07 v2 follow-up PR, or (b) add a code-level cache-disable gate keyed on l2_features_enabled=True as a true fail-safe? (Note : this whole row aligns with §0 row #7 — the cache-key extension is the inference-path coherent unit alongside the InferenceAPI loader + MLflow artefact persistence, deferred to S07 v2 follow-up PR per the operator-confirmed BTC-pattern scope split. "Track 12" is unrelated and was a stale draft reference.)