Plan dossier — CVN-N001-EE-S07 : Track 2 Order book microstructure features (FE + sweep)¶

Date : 2026-05-05 (initial) / amended v2 addressing committee 2dc76b50 METHODOLOGY_FLAW recommendations / amended v2.1 realigning §0 + §4.6 + §5#2 with the BTC-scope-split shipped in PR #850 (cache-key extension deferred to S07 v2 follow-up — committee 4076bdca pr_review blocker) Story : CVN-N001-EE-S07 (OP wp#46) GH issue : #718 Sister Story : CVN-N001-EE-S15 (wp#126) — Binance L2 ingestion infra (DEPENDENCY — must ship + 1 month forward data accumulated before this Story's first sweep) Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P2 — Track 2 of F1_buy boost ; data-tier lever ; tier 1 expected effect size Committee plan_review : ✅ PASSED / METHODOLOGY_FLAW (session 2dc76b50, 2026-05-05, 5 experts strong consensus, 0 blockers, 15 recs, $0.20). OP Meeting : #99. v2 dossier addresses 12/15 recs ; 3 deferred with rationale (see §0 below).

0. 2026-05-05 v2 amendment — committee `2dc76b50` METHODOLOGY_FLAW recommendations¶

The committee passed the plan with 0 blockers but flagged 3 methodological gap classes (statistical rigor, feature definitions + monitoring, economic realism + operational safety) that warrant amendment before impl. v2 incorporates 12/15 recs ; 3 are explicitly deferred.

Adoptions (12)¶

#	Rec	Where amended
1	BH multiple-testing correction across all variant comparisons (not just leakage check)	§2 hypothesis + §5 acceptance criteria #4
2	Cohen's d 95 % bootstrap CIs in deep-sweep results	§5 acceptance criteria #6
3	Power analysis OR window extension : pre-registered fallback to 6-month window if 3-month sanity sweep shows insufficient power	§3.X new + §7 falsifiability
4	Pin canonical Kyle 1985 + window for `l2_kyle_lambda` (cross-ref to S15 §4.3 schema owner)	§4.1 + new explicit citation
5	Real-time L2 feature data quality monitoring (freshness, NaN, range, dist shift) — operator runbook	new §4.7
7	Cache key extension (`l2_schema_version` in cache key)	Deferred to S07 v2 follow-up PR (mirrors Track 1 BTC scope split — cache key + InferenceAPI loader + MLflow artefact persistence are the inference-path coherent unit, all 3 land together). v2 amendment originally proposed "upfront" but the actual scope-split decision (operator-confirmed) keeps this with the inference path. Committee `4076bdca` pr_review flagged the dossier-vs-code mismatch ; v2.1 realigned dossier with shipped code. v2.2 amendment per CR pass on PR #850 (2026-05-05 21:10) : the previously-claimed "v1 disables cache when `l2_features_enabled=True`" mitigation does NOT exist in code (no such gate in `enrichment_api.py` or `contracts.py`). The actual gate is the operator-discipline rule from MLOps readiness §7.1 : "models trained with `l2_features_enabled=True` MUST NOT be deployed to inference until S07 v2 follow-up merges". No L2-trained model reaches inference until then → no cache collision in prod possible. The silent-data-discrepancy risk lives only on the training side (FTF sweeps re-running with same factor / different schema_version would mis-cache) and is mitigated by ADR-79's per-run-id cache namespacing (each FTF run has its own `run_id` partition).
9	Sortino + expectancy after fees and spread informational reporting in deep sweep	§5 acceptance criteria #5b new
10	Runtime kill-switch / feature-flag for L2 features (disable without retraining)	new §4.8
11	Staged rollout strategy (shadow → canary → full) for L2-enabled models	new §4.9 + §5 acceptance criteria #11
13	Feature importance ablation in deep-mode sweep	§5 acceptance criteria #6b new
14	Per-class metrics (precision/recall BUY/SELL) reporting alongside f1_buy	§5 acceptance criteria #6c new
15	Regime-stability monitoring post-LOCK + drift thresholds for retraining	new §4.10 + §8 risks updated

Deferred (3)¶

#	Rec	Rationale
6	ADR for L2 schema evolution	Schema is owned by S15 (sister Story) ; ADR governance belongs to ADR-77 docs SSoT Story or a dedicated S15 follow-up. Out-of-scope here ; tracked as a known gap.
8	Accelerate #711 dynamic slippage	Operator decision 8a explicitly defers #711 to a separate Story after Track 2 LOCK. Documented disagreement with the committee : the operator-directed mission scope is f1_buy-primary (per F1 plan §6 derogation), not joint-economic-metric. #711 ships as its own Story under economic-metric mission scope.
12	Label horizon sensitivity sweep across multiple horizons	Adds a 2nd FTF factor (`label_horizon`) interacting with `l2_features` ; doubles the sweep cell count and stretches the Story's scope. Tracked as a follow-up Story (likely under the future filter-tuning mission) — NOT a v1 Track 2 deliverable.

Track 2 of the F1_buy boost plan (F1_BUY_BOOST_PLAN.md §5 Track 2) hypothesises that adding L2 order book microstructure features lifts f1_buy materially over the OHLCV-only baseline. The model currently sees only OHLCV candles ; microstructure data (spread, depth imbalance, order flow) carries information about short-horizon price pressure that's invisible from candles alone.

Operator decision context (wp#46 comment 679, 2026-05-05) : - Track 14 (5m timeframe) LOCK at +0.125 Δf1_buy proves data-tier productive ; Track 2 is the next data-tier lever - Track 2 starts in parallel with Track 12 (frac diff) — the original quick-win bundle gate is moot since Track 14 already cleared the +0.05 spirit - Scope split : data ingestion infra in S15 (sister Story) ; this Story = FE + FTF sweep only - Feature catalogue : 6 standard microstructure features (per operator decision 6b) - FTF factor structure : reuse btc_features template (per operator decision 7)

Cross-track lesson absorbed : Track 1 (BTC features) ABANDONED on leakage gate (currently under investigation in wp#103). Track 2 design must front-load the same leakage check (standard_purge0 variant in the matrix) to surface the same class of pathology early.

2. Hypothesis (falsifiable)¶

Adding 6 standard L2 microstructure features lifts f1_buy materially over the OHLCV-only baseline at the current dataset / model regime. Specifically :

H0 (null) : mean(f1_buy | l2_features=enabled) - mean(f1_buy | disabled) is indistinguishable from 0 (CI95 includes 0) → ABANDON.
H1 (alternative) : Δf1_buy ≥ +0.03 (Story-specific bar per wp#46 acceptance criteria — slightly higher than the standard +0.015 because microstructure is expected to be a strong signal for short-horizon predictions) with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.

Mandatory leakage check (same pattern as Track 1 plan §4.6) : paired t-test on f1_buy(standard_purge0) - f1_buy(standard) over 25 (crypto, fold) cells, BH-corrected at α = 0.05. Failure → ABANDON pending root-cause investigation (à la wp#103).

Statistical contract (per committee 2dc76b50 rec #1, v2) : ALL pairwise variant-vs-baseline comparisons (each of min, standard, flow_only vs none, plus standard_purge0 vs standard, plus standard_purge10 vs standard) MUST be BH-corrected jointly at α = 0.05. The Type I error budget is shared across the 5 candidate comparisons ; reporting unadjusted p-values for any single comparison is non-compliant.

3. Variant matrix¶

5 variants per the F1 plan §4.2 convention + 1 leakage sanity (matching the proven btc_features template per operator decision 7) :

Variant	What it does	Features added	Notes
`none` (baseline)	OHLCV-only	0	Reference
`min`	Minimal — directional only	`spread_bps`, `depth_imbalance_l1`, `mid_price`	Cheapest variant — tests "spread + L1 imbalance is enough"
`standard` (canonical)	F1 plan standard set	`spread_bps`, `depth_imbalance_l1`, `mid_price`, `depth_imbalance_l5`, `kyle_lambda`, `ofi_15m`	Operator decision 6b ; the LOCK candidate
`standard_purge0`	Same as `standard` but no purging (`l2_purge_bars=0`)	Same 6	Mandatory leakage-detection sanity per Track 1 plan §4.6 pattern. NOT a candidate for lock.
`standard_purge10`	Same as `standard` with `l2_purge_bars=10` (sensitivity)	Same 6	Empirically justifies the canonical purge_bars value
`flow_only`	Order-flow-only (no static spread)	`kyle_lambda`, `ofi_15m`, `depth_imbalance_l5`	Tests "flow signal without static spread/depth carries the lift"

Total 6 variants — same shape as btc_features factor.

3.1 Pre-registered window-extension fallback (per committee `2dc76b50` rec #3, v2)¶

Operator decision 5c locks the canonical data window at 3 months. Committee flagged this as potentially under-powered for cross-regime statistical robustness. Pre-registered fallback :

First sanity sweep at 1 month (operator decision 5c sub-clause) reports per-variant Cohen's d + 95 % bootstrap CI. If |CI_width| / |d| > 0.8 for the standard vs none comparison (i.e., effect size estimate has > 80 % relative uncertainty), the data window is judged under-powered.
Action on under-power : automatic extension to 6 months at the deep-mode sweep. If 6 months still insufficient, escalate to 12 months with operator decision (cost = +1 month FTF compute).
Action on adequate power : proceed with the 3-month canonical (no extension).

This is a falsifiable pre-registered protocol — the window length is data-driven, not operator-discretionary post-hoc.

4. Implementation path¶

4.1 FE module (`src/commun/pipeline/orderbook_features.py`, NEW)¶

Mirror of src/commun/pipeline/btc_features.py (Track 1 module that survived 4 CR passes + committee review). Same structure :

_L2_FEATURE_COLUMNS = {
    "min":         frozenset({"l2_spread_bps", "l2_depth_imbalance_l1", "l2_mid_price"}),
    "standard":    frozenset({"l2_spread_bps", "l2_depth_imbalance_l1", "l2_mid_price",
                              "l2_depth_imbalance_l5", "l2_kyle_lambda", "l2_ofi_15m"}),
    "flow_only":   frozenset({"l2_kyle_lambda", "l2_ofi_15m", "l2_depth_imbalance_l5"}),
}

def compute_l2_features(target_ohlcv, l2_snapshots, feature_set="standard", purge_bars=20):
    """Read pre-computed L2 columns from S15's Timescale snapshots, shift by purge_bars,
    join to target_ohlcv index. ADR-14 invariant : every L2 feature uses only data
    ≤ t - purge_bars to prevent same-bar leakage from book state into label window."""

Canonical feature definitions (per committee 2dc76b50 rec #4, v2) — pinned by S15 schema at the ingestion layer ; this Story's FE module reads them as-is :

l2_kyle_lambda : Kyle 1985 lambda computed as the OLS regression slope of Δ(mid_price) on signed_volume over a rolling 15-minute window (the canonical Kyle illiquidity measure). Pre-computed in S15's l2_snapshots.kyle_lambda column ; this Story does NOT recompute it. Reference : Kyle, A. S. (1985), "Continuous auctions and insider trading", Econometrica 53(6), 1315-1335.
l2_ofi_15m : Order Flow Imbalance over a 15-minute rolling window per Cont, Kukanov, Stoikov (2014) "The price impact of order book events", Journal of Financial Econometrics 12(1), 47-88. Pre-computed in S15's l2_snapshots.ofi_15m column.
l2_spread_bps, l2_depth_imbalance_l1, l2_depth_imbalance_l5, l2_mid_price : standard OB metrics, definitions documented in S15 plan dossier §4.3 schema.

If S15's pre-computation diverges from these definitions, Track 2's results are non-comparable to the literature ; cross-Story validation = early sanity sweep checks l2_kyle_lambda distribution moments against published values for crypto majors (mean ~ O(1e-6) per USD-normalised volume per Easley et al. 2012).

ADR-14 purging invariant identical to Track 1 : every L2 feature column is .shift(purge_bars) so column row i carries information from L2 row i - purge_bars. The 6 feature scalars come pre-computed from S15's l2_snapshots table (per S15 §4.3 schema) — this module just reads + shifts + joins, no recomputation.

ADR-25 fail-fast : if l2_features_enabled=True is requested but the l2_snapshots table has < 95 % coverage of the target time window, raise RuntimeError rather than imputing zeros.

4.2 EnrichmentAPI extension¶

Same pattern as Track 1's BTC features wiring :

Extend EnrichmentConfig (in src/commun/pipeline/contracts.py) with l2_features_enabled: bool = False, l2_features_set: Literal["min","standard","flow_only"] = "standard", l2_purge_bars: int = 20.
Extend CVNTrade_Enrich.process() to accept optional l2_snapshots: Optional[pd.DataFrame] kwarg.
The L2 DataFrame is loaded by the ETL orchestration layer from Timescale (NOT by the enrichment layer per ADR-25 — fail-fast if l2_features_enabled=True and l2_snapshots is None).

4.3 ETL extension (`src/ETL/cvntrade_etl_pipeline.py`)¶

When training with l2_features_enabled=True, fetch the L2 snapshots from Timescale alongside OHLCV :

l2_snapshots = pd.read_sql(
    "SELECT * FROM l2_snapshots WHERE symbol = %s AND bar_time BETWEEN %s AND %s "
    "AND schema_version = 'l2_schema_v1' "
    "ORDER BY bar_time",
    con=engine,
    params=(crypto_symbol, start_ts, end_ts),
)

For paper / live, the streaming kernel pre-fetches a rolling L2 window per candle (same pattern as the BTC ohlcv pre-fetch).

4.4 FTF factor (per ADR-58)¶

File : src/commun/finetune/ablation_matrix.py — add l2_features factor mirroring btc_features :

AblationFactor(
    name="l2_features",
    factor_type="training",
    category="data",
    description="L2 order book microstructure features (Track 2 of F1_buy boost). "
                "6 variants matching the btc_features template. ADR-14 invariant : "
                "training time t uses only L2 data <= t - purge_bars.",
    env_vars={
        "none":             {"CVN_L2_FEATURES_ENABLED": "0"},
        "min":              {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "min", "CVN_L2_PURGE_BARS": "20"},
        "standard":         {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "20"},
        "standard_purge0":  {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "0"},
        "standard_purge10": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "10"},
        "flow_only":        {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "flow_only", "CVN_L2_PURGE_BARS": "20"},
    },
),

4.5 Guardrail tests (ADR-58)¶

Test 1 (unit) : test_l2_features_shift_invariant — assert compute_l2_features(..., purge_bars=20) produces columns shifted by exactly 20 rows. Test 2 (unit) : test_l2_features_set_columns — for each of the 3 feature sets, assert feature_columns_for_set(name) returns the canonical column list. Test 3 (integration) : test_ablation_matrix_l2_features_variants — assert the FTF factor has 6 variants with expected env_vars. Test 4 (integration) : test_l2_features_etl_failfast — assert RuntimeError raised when l2_features_enabled=True + < 95 % L2 coverage.

4.6 Feature contract pinning via MLflow artefact + cache key extension (deferred per S07 v2 follow-up scope split)¶

Per ADR-23 (features version-pinned, fail-fast), the trained model's MLflow artefact includes a new l2_enrichment_config.json capturing : - l2_features_enabled: bool - l2_features_set: str - l2_purge_bars: int - l2_schema_version: str (= l2_schema_v1 from S15)

InferenceAPI loads this config at inference time and fail-fasts if the runtime env disagrees with the model's pinned config (same pattern as Track 1's enrichment_config.json).

Cache key extension — deferred to S07 v2 follow-up (committee 4076bdca v2.1 alignment + CR pass v2.2) : the v1 PR ships the FE module + factor + tests + EnrichmentAPI wiring only ; the cache key extension + MLflow artefact persistence + InferenceAPI loader land together in the inference-path follow-up PR (S07 v2), mirroring the BTC Track 1 scope split per EnrichmentConfig Track 1 docstring §4.1bis. The original v2 §0 amendment said "upfront" ; committee 4076bdca pr_review correctly flagged the dossier-vs-shipped-code mismatch ; v2.1 realigned the dossier with the committed code.

v2.2 (CR pass on PR #850 2026-05-05 21:10) : a "v1 disables cache when l2_features_enabled=True" mitigation was claimed in v2.1 but does NOT exist in the shipped code. Honest mitigation chain : 1. Inference-side : the MLOps readiness §7.1 OOS rule is the gate — models trained with l2_features_enabled=True MUST NOT be deployed to inference until S07 v2 ships the cache key extension + InferenceAPI loader together. No L2-trained model reaches prod until then → no inference-time cache collision possible. 2. Training-side : ADR-79 namespaces FTF cache by run_id, so two sweeps with different l2_schema_version use different cache partitions. Within a single FTF run, the operator pins one l2_schema_version per the BASE_ENV at run start ; mid-run schema-mismatch is impossible by construction. 3. What's NOT mitigated : if the operator manually invokes compute_l2_features outside an FTF run + outside the inference path (e.g., a notebook running enrich_batch with l2_features_enabled=True against a re-pinned PG schema_version), they could populate the L2 columns under a stale schema_version. This is an operator-discipline bound and lives in §7.1 OOS until the cache key extension lands.

4.7 Real-time L2 feature data quality monitoring (per committee `2dc76b50` rec #5, v2 NEW)¶

Once L2-enabled models reach inference (post-LOCK), the runtime emits the following per-prediction quality signals to the existing observability stack (Loki + Grafana per ADR-26) :

Signal	Threshold	Alert action
`l2_feature_freshness_seconds` (delta between prediction time and most recent L2 snapshot)	warn > 60s, P2 alert > 300s	switch to L2-blind fallback model (per §4.8)
`l2_feature_nan_rate` (% of L2 columns with NaN values per prediction)	warn > 5 %, P2 alert > 20 %	switch to L2-blind fallback model
`l2_feature_value_oor` (% of L2 column values outside reference range — e.g. `spread_bps > 1000` is OOR for crypto majors)	warn > 1 %, P2 alert > 5 %	flag for operator review ; do NOT auto-switch (could be regime change)
`l2_kyle_lambda_drift_psi` (PSI of `l2_kyle_lambda` distribution vs training reference, rolling 24h)	warn > 0.10, P2 alert > 0.25	trigger retraining workflow per §4.10

Reference ranges + thresholds calibrated from the deep-sweep training distribution ; stored in the model's MLflow artefact as l2_quality_reference.json.

4.8 Runtime kill-switch for L2 features (per committee `2dc76b50` rec #10, v2 NEW)¶

A feature flag cvn_pipeline.l2_features.runtime_disable (PG-stored per ADR-59) lets the operator disable L2-enabled models in production WITHOUT retraining. When set to true :

InferenceAPI detects the flag at every inference call (cached for 60s)
Routes inference to the registered champion_no_l2 rollback model (per §7 + ADR-15) instead of the L2-enabled champion
Logs a kill_switch=l2_features event per ADR-32 ; emits a Grafana annotation
Operator re-enables by flipping the flag back to false

The kill-switch is NOT the same as the trading kill-switch (ADR-71 = halts all trading). This is per-feature : trading continues but with the L2-blind model. Both kill-switches can be combined (trading off + L2 off = full safety) ; they're orthogonal.

4.9 Staged rollout : shadow → canary → full (per committee `2dc76b50` rec #11, v2 NEW)¶

Post-LOCK deployment of an L2-enabled model follows a 3-stage rollout (mirroring the standard MLOps pattern + ADR-42 atomic per-crypto promotion) :

Stage	Traffic %	Duration	Pass criteria	Rollback trigger
Shadow	0 % (predictions logged but not executed)	7 days	`l2_feature_freshness_seconds_p99 < 30s` AND `l2_feature_nan_rate_p99 < 1 %` AND prediction agreement with L2-blind ≥ 90 %	any quality signal threshold breach (§4.7)
Canary (1 of 5 cryptos)	per-crypto promotion via Console (ADR-42)	7 days	f1_buy on the canary crypto matches the deep-sweep estimate within ±20 %	drift trigger from §4.10 OR operator manual rollback via §4.8 kill-switch
Full (5/5 cryptos)	per-crypto promotion sequence over 5 days (1 crypto / day)	7 days steady-state	no kill-switch triggers, no drift alerts	drift trigger from §4.10 OR operator manual rollback

The shadow stage is the cheapest catch ; canary catches per-crypto pathologies that aggregate stats hide. Full rollout is the steady-state.

4.10 Regime-stability monitoring + retraining triggers (per committee `2dc76b50` rec #15, v2 NEW)¶

Post-LOCK, the L2-enabled model carries 3 drift signals that, on threshold breach, trigger an automated retraining workflow :

Signal	Threshold	Action
Concept drift : rolling f1_buy on last 30d vs training-window f1_buy	gap > 0.05 absolute	open ticket + auto-launch shadow retraining job (no auto-deploy)
Feature drift : PSI of `l2_kyle_lambda`, `l2_ofi_15m`, `l2_depth_imbalance_l5` vs training reference (rolling 30d)	any one > 0.25	same as concept drift
Regime change : `l2_spread_bps_p99` rolling 7d vs rolling 90d ratio	ratio > 3.0 (microstructure regime shift, e.g., Binance fee changes)	escalate to operator review BEFORE auto-retraining

Retraining workflow (when triggered) : standard CVNTrade retrain DAG with the original FTF config, freshest training window. The retrained model goes through the §4.9 staged rollout (shadow first), NOT auto-promoted. Operator approval gate at the canary stage.

5. Acceptance criteria¶

#	Criterion	Evidence
1	6 NEW variants live in `l2_features` factor matrix	`python -c "from commun.finetune.ablation_matrix import DATA_FACTORS; print([v for f in DATA_FACTORS if f.name=='l2_features' for v in f.env_vars])"` shows 6 entries
2	Unit tests pass (3 new test modules)	`pytest tests/unit/test_orderbook_features.py tests/unit/test_l2_features_ablation.py tests/unit/pipeline/test_enrichment_api_l2_wiring.py` green. Cache-key-extension test deferred to S07 v2 follow-up PR per §0 row #7 + §4.6 v2.1 alignment (the inference-path coherent unit).
3	First sanity sweep (1 month data, smoke power_mode) complete	Run ID + paired t-test results + power assessment per §3.1
4	Mandatory leakage check passes — AND all 5 candidate variant comparisons are BH-corrected jointly (v2 rec #1)	BH-corrected p ≥ 0.05 on `standard_purge0 vs standard` ; BH-corrected p < 0.05 on at-least-one of `min/standard/flow_only vs none` for the alt-hypothesis to hold
5	Deep-mode sweep complete (3 months OR 6 months per §3.1 power-driven extension)	Run ID + ADR-79 8-step results dossier ; window length justified
5b	Sortino + expectancy after fees + spread reported informationally (v2 rec #9)	results dossier §X new ; not a hard gate, but flags economic-regression risk for downstream #711 Story scoping
6	F1 mission gates clear (Δf1_buy ≥ +0.03 + CI95 excluding 0 + ≥ 4/5 cryptos + Cohen d ≥ 0.3)	results dossier §6
6b	Cohen's d 95 % bootstrap CIs reported per variant (v2 rec #2)	results dossier §6 — point estimate + CI
6c	Feature importance ablation in deep sweep (v2 rec #13)	per-feature SHAP / permutation importance ranking on the LOCK candidate variant
6d	Per-class metrics (precision/recall BUY/SELL) reported alongside f1_buy (v2 rec #14)	results dossier §6 — confusion matrix + per-class precision/recall
7	MLOps readiness completed (incl. §4.7 monitoring + §4.8 kill-switch + §4.9 rollout + §4.10 drift triggers)	`documentation/stories/CVN-N001-EE-S07/mlops_readiness.md`
8	Verdict recorded (LOCK / KEEP_AVAILABLE / ABANDON) per ADR-79 decision tree	results dossier §11 verdict
9	If LOCK : `champion_no_l2` model registered in MLflow as rollback target (per Track 1 plan §7 model-switching rollback pattern)	MLflow Registry tag `champion_no_l2`
10	Committee plan_review PASSED — DONE : session `2dc76b50` (OP Meeting #99), 0 blockers, code METHODOLOGY_FLAW addressed in v2 amendment	this dossier §0
11	Committee pr_review PASSED on impl PR	OP Meeting per ADR-82 + session JSON link
12	Staged rollout complete (shadow → canary → full) (v2 rec #11) per §4.9	per-stage Grafana annotations + operator approval logs

6. Out of scope¶

L2 data ingestion infra — sister Story S15 (wp#126) covers the Timescale schema + DAGs + reconstruction
#711 dynamic slippage model — operator decision 8a : separate Story after Track 2 LOCK
Order book imbalance > L5 — VAMP, queue position, micro-volatility — future research Stories
Multi-exchange L2 — Binance only (S15 scope) ; Coinbase / Kraken later
Real-time streaming consumption of L2 in paper / live — backtest-only for v1 ; paper / live integration is a follow-up Story if Track 2 LOCKs (similar pattern to Track 1's parent plan §6)
Joint metric (Sortino + expectancy) gates — F1 mission scope is f1_buy-primary per §6 derogation ; gate 2 (joint metric) is filter-tuning follow-up Story

7. Falsifiability + rollback¶

Falsifiability per H0 : pre-registered Δf1_buy ≥ +0.03 with CI95 excluding 0 + per-asset 4/5. If Δf1 ∈ [-0.01, +0.025] with CI95 including 0 OR per-asset improves on < 4/5 cryptos, ABANDON.
Falsifiability on leakage : if standard_purge0 outperforms standard on f1_buy (BH p < 0.05), ABANDON pending leakage investigation (à la wp#103 for Track 1).
Rollback at LOCK time (per Track 1 plan §7 model-switching pattern) :
The MLOps promotion workflow already handles model artefact swaps per ADR-15 + ADR-42 (atomic per-crypto promotion). Rollback = promote the previous L2-blind champion via Console flow on mlflow_promotion.
champion_no_l2 model MUST be registered in MLflow Registry as a deployable rollback target before any Track 2 model becomes champion (mandatory pre-LOCK artefact, mirroring Track 1 plan §7).
Hot-fix path for code bugs : standard PR + retrain with fix + atomic promotion (NOT runtime env-flag toggle, which would dimension-mismatch the model per ADR-23).

8. Risks¶

Risk	Likelihood	Impact	Mitigation
L2 data not yet accumulated when Story is impl-ready	high	low	Story is sequenced AFTER S15 ships ; first sanity sweep runs at 1 month of forward data ; deep sweep at 3 months. The PR can land first ; sweeps wait.
Same leakage class as Track 1 (per-bar L2 info contaminating labels)	medium	high	Mandatory `standard_purge0` variant in matrix surfaces leakage as a gate violation. Plus the `purge=20` shift is enforced by a regression test.
Microstructure features unstable across cryptos (high spread variance, low-liquidity altcoins)	medium	medium	Per-crypto FTF variant gates (`per-asset 4/5`) catch crypto-specific failures ; failing crypto can be excluded from a per-asset Story later
Reconstruction quality from S15 not enough for FE (ADR-25 fail-fast triggers)	medium	medium	S15 §7 has its own falsifiability ; if it ships forward-only with reduced window, S07 sanity sweep at 1 month still works
Pipeline cache key extension needed for L2 (cache invalidation)	low	low	v1 has NO code-level cache-disable gate (CR pass v2.2 caught the false claim). The actual mitigation is the operator-discipline rule in MLOps readiness §7.1 (no L2-trained model deployed until S07 v2) + ADR-79's per-`run_id` FTF cache namespacing. Cache extension is the S07 v2 inference-path follow-up.
MLflow artefact bloat from `l2_enrichment_config.json`	low	low	Negligible ; same JSON pattern as Track 1's `enrichment_config.json`
Feature drift over time (microstructure regime shifts post-Binance volume changes)	medium	high	Quarterly review per MLOps readiness §3 ; Loki alert if `l2_kyle_lambda_p99` drifts > 3σ over rolling 30 days

9. Sequencing + cross-Story impact¶

S15 (data infra) ────┐
                     ├─ S15 deployed + ~1 month forward data
                     │       │
                     │       ▼
                     │   S07 PR opened (impl ready, sanity sweep at 1 month data)
                     │       │
                     │       ▼
                     │   ~2 more months pass (3-month canonical window)
                     │       │
                     │       ▼
                     │   S07 deep-mode sweep + verdict per ADR-79
                     │       │
                     │       ▼
                     │   If LOCK : downstream Story #711 dynamic slippage
                     │
Track 1 leakage (wp#103) ─┐
                          ├─ Track 12 (frac diff) gated on Track 1 verdict — independent of Track 2
Track 14 (5m TF) LOCK ────┘

Cross-Epic impact : - Track 14 (5m timeframe) : independent, LOCK shipped — Track 2 sweep runs at 15-min bars (default) but COULD run at 5m if Track 14 canonical is adopted globally (separate Story decision) - Track 1 leakage investigation : independent — same leakage-check pattern adopted here as a defensive measure - Track 12 (frac diff) : independent — gated on Track 1, no L2 dependency - #711 dynamic slippage : downstream of Track 2 LOCK - CVN-N010-EA (KPI Store) : if Track 2 LOCKs, the L2 features become production columns ; emit l2_kyle_lambda quantile via emit_kpi for drift monitoring

10. References¶

OP wp#46 (S07)
GH issue #718
Sister Story OP wp#126 (S15)
F1 plan : documentation/F1_BUY_BOOST_PLAN.md §5 Track 2 + §6 sequencing
Track 1 plan dossier (pattern reuse) : 2026-04-30-track1-btc-features-plan.md
Track 1 results (leakage diagnostic) : missions/ml-boost/2026-05-02-track1-btc-features-results.md §5.1
Track 14 results (data-tier confirmation) : missions/ml-boost/2026-05-02-track14-timeframe-results.md
Operator decision audit : wp#46 comment 679 (2026-05-05)
ADR-14 : purging invariant (training time t ≤ t − purge_bars)
ADR-23 : features version-pinned, fail-fast
ADR-25 : pas de fallback silencieux
ADR-58 : every FTF factor must have a guardrail + integration test
ADR-79 : FTF Story closure 8-step workflow
ADR-80 : FTF post-run extraction + dossier mechanics
ADR-82 : every committee session logged as OP Meeting

11. Plan-review questions for committee¶

Effect-size bar (Δf1_buy ≥ +0.03) : the wp#46 acceptance criterion is +0.03 (vs the F1 plan's standard +0.015). Is this calibrated correctly given microstructure's expected effect size, or should it match the standard +0.015 to avoid biasing toward ABANDON?
Leakage check pattern reuse : Track 2's leakage check mirrors Track 1's (standard_purge0 vs standard paired t-test). The Track 1 leakage investigation (wp#103) is still in flight ; should Track 2 wait for the Track 1 verdict before launching its first sweep, or proceed in parallel knowing the same pathology might surface?
flow_only variant inclusion : 4 variants is the minimum (none + min + standard + standard_purge0) ; standard_purge10 and flow_only add to 6. Is the 6-variant matrix the right power-vs-cost tradeoff?
l2_kyle_lambda definition : Kyle's lambda has multiple definitions in the literature (regression slope of price impact on signed volume is the canonical Kyle 1985). Confirm S15's pre-computation matches the canonical definition + window size (15-min rolling regression).
Pre-impl OR post-impl-PR sweep : the PR can land before any data has accumulated (just the FE module + factor + tests). Should we (a) ship the PR + wait for data, or (b) wait until 1 month data accumulated then ship PR + sanity sweep in same submission? (a) keeps Story In progress shorter ; (b) bundles evidence with code.
l2_features_enabled=True cache invalidation : ~~v1 disables cache (fail-safe)~~ — CR pass v2.2 caught this as a false claim. There is no code-level cache-disable gate in v1 ; EnrichmentAPI.enrich_batch does not branch on l2_features_enabled. The actual mitigation chain is (i) inference-side OOS-only rule (cache hit on training-window slice never poisons fresh inference), (ii) per-run_id ADR-79 namespacing (Track-2 sweeps cannot collide with Track-1), (iii) the operator-discipline rule documented in MLOps §7.1. Question for committee : is the operator-discipline bound (iii) acceptable for v1, or should we (a) ship the cache-key extension (l2_feature_set_version in cache key) upfront in Track 2 v1 instead of deferring to S07 v2 follow-up PR, or (b) add a code-level cache-disable gate keyed on l2_features_enabled=True as a true fail-safe? (Note : this whole row aligns with §0 row #7 — the cache-key extension is the inference-path coherent unit alongside the InferenceAPI loader + MLflow artefact persistence, deferred to S07 v2 follow-up PR per the operator-confirmed BTC-pattern scope split. "Track 12" is unrelated and was a stale draft reference.)