Plan dossier — CVN-N001-EE-S07 : Track 2 Order book microstructure features (FE + sweep)¶
Date : 2026-05-05 (initial) / amended v2 addressing committee 2dc76b50 METHODOLOGY_FLAW recommendations / amended v2.1 realigning §0 + §4.6 + §5#2 with the BTC-scope-split shipped in PR #850 (cache-key extension deferred to S07 v2 follow-up — committee 4076bdca pr_review blocker)
Story : CVN-N001-EE-S07 (OP wp#46)
GH issue : #718
Sister Story : CVN-N001-EE-S15 (wp#126) — Binance L2 ingestion infra (DEPENDENCY — must ship + 1 month forward data accumulated before this Story's first sweep)
Author : Dominique (operator) + Claude
Session type : plan_review (per ADR-68)
Severity : P2 — Track 2 of F1_buy boost ; data-tier lever ; tier 1 expected effect size
Committee plan_review : ✅ PASSED / METHODOLOGY_FLAW (session 2dc76b50, 2026-05-05, 5 experts strong consensus, 0 blockers, 15 recs, $0.20). OP Meeting : #99. v2 dossier addresses 12/15 recs ; 3 deferred with rationale (see §0 below).
0. 2026-05-05 v2 amendment — committee 2dc76b50 METHODOLOGY_FLAW recommendations¶
The committee passed the plan with 0 blockers but flagged 3 methodological gap classes (statistical rigor, feature definitions + monitoring, economic realism + operational safety) that warrant amendment before impl. v2 incorporates 12/15 recs ; 3 are explicitly deferred.
Adoptions (12)¶
| # | Rec | Where amended |
|---|---|---|
| 1 | BH multiple-testing correction across all variant comparisons (not just leakage check) | §2 hypothesis + §5 acceptance criteria #4 |
| 2 | Cohen's d 95 % bootstrap CIs in deep-sweep results | §5 acceptance criteria #6 |
| 3 | Power analysis OR window extension : pre-registered fallback to 6-month window if 3-month sanity sweep shows insufficient power | §3.X new + §7 falsifiability |
| 4 | Pin canonical Kyle 1985 + window for l2_kyle_lambda (cross-ref to S15 §4.3 schema owner) |
§4.1 + new explicit citation |
| 5 | Real-time L2 feature data quality monitoring (freshness, NaN, range, dist shift) — operator runbook | new §4.7 |
| 7 | Cache key extension (l2_schema_version in cache key) |
Deferred to S07 v2 follow-up PR (mirrors Track 1 BTC scope split — cache key + InferenceAPI loader + MLflow artefact persistence are the inference-path coherent unit, all 3 land together). v2 amendment originally proposed "upfront" but the actual scope-split decision (operator-confirmed) keeps this with the inference path. Committee 4076bdca pr_review flagged the dossier-vs-code mismatch ; v2.1 realigned dossier with shipped code. v2.2 amendment per CR pass on PR #850 (2026-05-05 21:10) : the previously-claimed "v1 disables cache when l2_features_enabled=True" mitigation does NOT exist in code (no such gate in enrichment_api.py or contracts.py). The actual gate is the operator-discipline rule from MLOps readiness §7.1 : "models trained with l2_features_enabled=True MUST NOT be deployed to inference until S07 v2 follow-up merges". No L2-trained model reaches inference until then → no cache collision in prod possible. The silent-data-discrepancy risk lives only on the training side (FTF sweeps re-running with same factor / different schema_version would mis-cache) and is mitigated by ADR-79's per-run-id cache namespacing (each FTF run has its own run_id partition). |
| 9 | Sortino + expectancy after fees and spread informational reporting in deep sweep | §5 acceptance criteria #5b new |
| 10 | Runtime kill-switch / feature-flag for L2 features (disable without retraining) | new §4.8 |
| 11 | Staged rollout strategy (shadow → canary → full) for L2-enabled models | new §4.9 + §5 acceptance criteria #11 |
| 13 | Feature importance ablation in deep-mode sweep | §5 acceptance criteria #6b new |
| 14 | Per-class metrics (precision/recall BUY/SELL) reporting alongside f1_buy | §5 acceptance criteria #6c new |
| 15 | Regime-stability monitoring post-LOCK + drift thresholds for retraining | new §4.10 + §8 risks updated |
Deferred (3)¶
| # | Rec | Rationale |
|---|---|---|
| 6 | ADR for L2 schema evolution | Schema is owned by S15 (sister Story) ; ADR governance belongs to ADR-77 docs SSoT Story or a dedicated S15 follow-up. Out-of-scope here ; tracked as a known gap. |
| 8 | Accelerate #711 dynamic slippage | Operator decision 8a explicitly defers #711 to a separate Story after Track 2 LOCK. Documented disagreement with the committee : the operator-directed mission scope is f1_buy-primary (per F1 plan §6 derogation), not joint-economic-metric. #711 ships as its own Story under economic-metric mission scope. |
| 12 | Label horizon sensitivity sweep across multiple horizons | Adds a 2nd FTF factor (label_horizon) interacting with l2_features ; doubles the sweep cell count and stretches the Story's scope. Tracked as a follow-up Story (likely under the future filter-tuning mission) — NOT a v1 Track 2 deliverable. |
Track 2 of the F1_buy boost plan (F1_BUY_BOOST_PLAN.md §5 Track 2) hypothesises that adding L2 order book microstructure features lifts f1_buy materially over the OHLCV-only baseline. The model currently sees only OHLCV candles ; microstructure data (spread, depth imbalance, order flow) carries information about short-horizon price pressure that's invisible from candles alone.
Operator decision context (wp#46 comment 679, 2026-05-05) :
- Track 14 (5m timeframe) LOCK at +0.125 Δf1_buy proves data-tier productive ; Track 2 is the next data-tier lever
- Track 2 starts in parallel with Track 12 (frac diff) — the original quick-win bundle gate is moot since Track 14 already cleared the +0.05 spirit
- Scope split : data ingestion infra in S15 (sister Story) ; this Story = FE + FTF sweep only
- Feature catalogue : 6 standard microstructure features (per operator decision 6b)
- FTF factor structure : reuse btc_features template (per operator decision 7)
Cross-track lesson absorbed : Track 1 (BTC features) ABANDONED on leakage gate (currently under investigation in wp#103). Track 2 design must front-load the same leakage check (standard_purge0 variant in the matrix) to surface the same class of pathology early.
2. Hypothesis (falsifiable)¶
Adding 6 standard L2 microstructure features lifts f1_buy materially over the OHLCV-only baseline at the current dataset / model regime. Specifically :
- H0 (null) :
mean(f1_buy | l2_features=enabled) - mean(f1_buy | disabled)is indistinguishable from 0 (CI95 includes 0) → ABANDON. - H1 (alternative) : Δf1_buy ≥ +0.03 (Story-specific bar per wp#46 acceptance criteria — slightly higher than the standard +0.015 because microstructure is expected to be a strong signal for short-horizon predictions) with 95 % bootstrap CI excluding 0, AND ≥ 4/5 cryptos individually improve, AND Cohen's d ≥ 0.3.
Mandatory leakage check (same pattern as Track 1 plan §4.6) : paired t-test on f1_buy(standard_purge0) - f1_buy(standard) over 25 (crypto, fold) cells, BH-corrected at α = 0.05. Failure → ABANDON pending root-cause investigation (à la wp#103).
Statistical contract (per committee 2dc76b50 rec #1, v2) : ALL pairwise variant-vs-baseline comparisons (each of min, standard, flow_only vs none, plus standard_purge0 vs standard, plus standard_purge10 vs standard) MUST be BH-corrected jointly at α = 0.05. The Type I error budget is shared across the 5 candidate comparisons ; reporting unadjusted p-values for any single comparison is non-compliant.
3. Variant matrix¶
5 variants per the F1 plan §4.2 convention + 1 leakage sanity (matching the proven btc_features template per operator decision 7) :
| Variant | What it does | Features added | Notes |
|---|---|---|---|
none (baseline) |
OHLCV-only | 0 | Reference |
min |
Minimal — directional only | spread_bps, depth_imbalance_l1, mid_price |
Cheapest variant — tests "spread + L1 imbalance is enough" |
standard (canonical) |
F1 plan standard set | spread_bps, depth_imbalance_l1, mid_price, depth_imbalance_l5, kyle_lambda, ofi_15m |
Operator decision 6b ; the LOCK candidate |
standard_purge0 |
Same as standard but no purging (l2_purge_bars=0) |
Same 6 | Mandatory leakage-detection sanity per Track 1 plan §4.6 pattern. NOT a candidate for lock. |
standard_purge10 |
Same as standard with l2_purge_bars=10 (sensitivity) |
Same 6 | Empirically justifies the canonical purge_bars value |
flow_only |
Order-flow-only (no static spread) | kyle_lambda, ofi_15m, depth_imbalance_l5 |
Tests "flow signal without static spread/depth carries the lift" |
Total 6 variants — same shape as btc_features factor.
3.1 Pre-registered window-extension fallback (per committee 2dc76b50 rec #3, v2)¶
Operator decision 5c locks the canonical data window at 3 months. Committee flagged this as potentially under-powered for cross-regime statistical robustness. Pre-registered fallback :
- First sanity sweep at 1 month (operator decision 5c sub-clause) reports per-variant Cohen's d + 95 % bootstrap CI. If
|CI_width| / |d| > 0.8for thestandard vs nonecomparison (i.e., effect size estimate has > 80 % relative uncertainty), the data window is judged under-powered. - Action on under-power : automatic extension to 6 months at the deep-mode sweep. If 6 months still insufficient, escalate to 12 months with operator decision (cost = +1 month FTF compute).
- Action on adequate power : proceed with the 3-month canonical (no extension).
This is a falsifiable pre-registered protocol — the window length is data-driven, not operator-discretionary post-hoc.
4. Implementation path¶
4.1 FE module (src/commun/pipeline/orderbook_features.py, NEW)¶
Mirror of src/commun/pipeline/btc_features.py (Track 1 module that survived 4 CR passes + committee review). Same structure :
_L2_FEATURE_COLUMNS = {
"min": frozenset({"l2_spread_bps", "l2_depth_imbalance_l1", "l2_mid_price"}),
"standard": frozenset({"l2_spread_bps", "l2_depth_imbalance_l1", "l2_mid_price",
"l2_depth_imbalance_l5", "l2_kyle_lambda", "l2_ofi_15m"}),
"flow_only": frozenset({"l2_kyle_lambda", "l2_ofi_15m", "l2_depth_imbalance_l5"}),
}
def compute_l2_features(target_ohlcv, l2_snapshots, feature_set="standard", purge_bars=20):
"""Read pre-computed L2 columns from S15's Timescale snapshots, shift by purge_bars,
join to target_ohlcv index. ADR-14 invariant : every L2 feature uses only data
≤ t - purge_bars to prevent same-bar leakage from book state into label window."""
Canonical feature definitions (per committee 2dc76b50 rec #4, v2) — pinned by S15 schema at the ingestion layer ; this Story's FE module reads them as-is :
l2_kyle_lambda: Kyle 1985 lambda computed as the OLS regression slope ofΔ(mid_price)onsigned_volumeover a rolling 15-minute window (the canonical Kyle illiquidity measure). Pre-computed in S15'sl2_snapshots.kyle_lambdacolumn ; this Story does NOT recompute it. Reference : Kyle, A. S. (1985), "Continuous auctions and insider trading", Econometrica 53(6), 1315-1335.l2_ofi_15m: Order Flow Imbalance over a 15-minute rolling window per Cont, Kukanov, Stoikov (2014) "The price impact of order book events", Journal of Financial Econometrics 12(1), 47-88. Pre-computed in S15'sl2_snapshots.ofi_15mcolumn.l2_spread_bps,l2_depth_imbalance_l1,l2_depth_imbalance_l5,l2_mid_price: standard OB metrics, definitions documented in S15 plan dossier §4.3 schema.
If S15's pre-computation diverges from these definitions, Track 2's results are non-comparable to the literature ; cross-Story validation = early sanity sweep checks l2_kyle_lambda distribution moments against published values for crypto majors (mean ~ O(1e-6) per USD-normalised volume per Easley et al. 2012).
ADR-14 purging invariant identical to Track 1 : every L2 feature column is .shift(purge_bars) so column row i carries information from L2 row i - purge_bars. The 6 feature scalars come pre-computed from S15's l2_snapshots table (per S15 §4.3 schema) — this module just reads + shifts + joins, no recomputation.
ADR-25 fail-fast : if l2_features_enabled=True is requested but the l2_snapshots table has < 95 % coverage of the target time window, raise RuntimeError rather than imputing zeros.
4.2 EnrichmentAPI extension¶
Same pattern as Track 1's BTC features wiring :
- Extend
EnrichmentConfig(insrc/commun/pipeline/contracts.py) withl2_features_enabled: bool = False,l2_features_set: Literal["min","standard","flow_only"] = "standard",l2_purge_bars: int = 20. - Extend
CVNTrade_Enrich.process()to accept optionall2_snapshots: Optional[pd.DataFrame]kwarg. - The L2 DataFrame is loaded by the ETL orchestration layer from Timescale (NOT by the enrichment layer per ADR-25 — fail-fast if
l2_features_enabled=Trueandl2_snapshots is None).
4.3 ETL extension (src/ETL/cvntrade_etl_pipeline.py)¶
When training with l2_features_enabled=True, fetch the L2 snapshots from Timescale alongside OHLCV :
l2_snapshots = pd.read_sql(
"SELECT * FROM l2_snapshots WHERE symbol = %s AND bar_time BETWEEN %s AND %s "
"AND schema_version = 'l2_schema_v1' "
"ORDER BY bar_time",
con=engine,
params=(crypto_symbol, start_ts, end_ts),
)
For paper / live, the streaming kernel pre-fetches a rolling L2 window per candle (same pattern as the BTC ohlcv pre-fetch).
4.4 FTF factor (per ADR-58)¶
File : src/commun/finetune/ablation_matrix.py — add l2_features factor mirroring btc_features :
AblationFactor(
name="l2_features",
factor_type="training",
category="data",
description="L2 order book microstructure features (Track 2 of F1_buy boost). "
"6 variants matching the btc_features template. ADR-14 invariant : "
"training time t uses only L2 data <= t - purge_bars.",
env_vars={
"none": {"CVN_L2_FEATURES_ENABLED": "0"},
"min": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "min", "CVN_L2_PURGE_BARS": "20"},
"standard": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "20"},
"standard_purge0": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "0"},
"standard_purge10": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "standard", "CVN_L2_PURGE_BARS": "10"},
"flow_only": {"CVN_L2_FEATURES_ENABLED": "1", "CVN_L2_FEATURES_SET": "flow_only", "CVN_L2_PURGE_BARS": "20"},
},
),
4.5 Guardrail tests (ADR-58)¶
Test 1 (unit) : test_l2_features_shift_invariant — assert compute_l2_features(..., purge_bars=20) produces columns shifted by exactly 20 rows.
Test 2 (unit) : test_l2_features_set_columns — for each of the 3 feature sets, assert feature_columns_for_set(name) returns the canonical column list.
Test 3 (integration) : test_ablation_matrix_l2_features_variants — assert the FTF factor has 6 variants with expected env_vars.
Test 4 (integration) : test_l2_features_etl_failfast — assert RuntimeError raised when l2_features_enabled=True + < 95 % L2 coverage.
4.6 Feature contract pinning via MLflow artefact + cache key extension (deferred per S07 v2 follow-up scope split)¶
Per ADR-23 (features version-pinned, fail-fast), the trained model's MLflow artefact includes a new l2_enrichment_config.json capturing :
- l2_features_enabled: bool
- l2_features_set: str
- l2_purge_bars: int
- l2_schema_version: str (= l2_schema_v1 from S15)
InferenceAPI loads this config at inference time and fail-fasts if the runtime env disagrees with the model's pinned config (same pattern as Track 1's enrichment_config.json).
Cache key extension — deferred to S07 v2 follow-up (committee 4076bdca v2.1 alignment + CR pass v2.2) : the v1 PR ships the FE module + factor + tests + EnrichmentAPI wiring only ; the cache key extension + MLflow artefact persistence + InferenceAPI loader land together in the inference-path follow-up PR (S07 v2), mirroring the BTC Track 1 scope split per EnrichmentConfig Track 1 docstring §4.1bis. The original v2 §0 amendment said "upfront" ; committee 4076bdca pr_review correctly flagged the dossier-vs-shipped-code mismatch ; v2.1 realigned the dossier with the committed code.
v2.2 (CR pass on PR #850 2026-05-05 21:10) : a "v1 disables cache when l2_features_enabled=True" mitigation was claimed in v2.1 but does NOT exist in the shipped code. Honest mitigation chain :
1. Inference-side : the MLOps readiness §7.1 OOS rule is the gate — models trained with l2_features_enabled=True MUST NOT be deployed to inference until S07 v2 ships the cache key extension + InferenceAPI loader together. No L2-trained model reaches prod until then → no inference-time cache collision possible.
2. Training-side : ADR-79 namespaces FTF cache by run_id, so two sweeps with different l2_schema_version use different cache partitions. Within a single FTF run, the operator pins one l2_schema_version per the BASE_ENV at run start ; mid-run schema-mismatch is impossible by construction.
3. What's NOT mitigated : if the operator manually invokes compute_l2_features outside an FTF run + outside the inference path (e.g., a notebook running enrich_batch with l2_features_enabled=True against a re-pinned PG schema_version), they could populate the L2 columns under a stale schema_version. This is an operator-discipline bound and lives in §7.1 OOS until the cache key extension lands.
4.7 Real-time L2 feature data quality monitoring (per committee 2dc76b50 rec #5, v2 NEW)¶
Once L2-enabled models reach inference (post-LOCK), the runtime emits the following per-prediction quality signals to the existing observability stack (Loki + Grafana per ADR-26) :
| Signal | Threshold | Alert action |
|---|---|---|
l2_feature_freshness_seconds (delta between prediction time and most recent L2 snapshot) |
warn > 60s, P2 alert > 300s | switch to L2-blind fallback model (per §4.8) |
l2_feature_nan_rate (% of L2 columns with NaN values per prediction) |
warn > 5 %, P2 alert > 20 % | switch to L2-blind fallback model |
l2_feature_value_oor (% of L2 column values outside reference range — e.g. spread_bps > 1000 is OOR for crypto majors) |
warn > 1 %, P2 alert > 5 % | flag for operator review ; do NOT auto-switch (could be regime change) |
l2_kyle_lambda_drift_psi (PSI of l2_kyle_lambda distribution vs training reference, rolling 24h) |
warn > 0.10, P2 alert > 0.25 | trigger retraining workflow per §4.10 |
Reference ranges + thresholds calibrated from the deep-sweep training distribution ; stored in the model's MLflow artefact as l2_quality_reference.json.
4.8 Runtime kill-switch for L2 features (per committee 2dc76b50 rec #10, v2 NEW)¶
A feature flag cvn_pipeline.l2_features.runtime_disable (PG-stored per ADR-59) lets the operator disable L2-enabled models in production WITHOUT retraining. When set to true :
- InferenceAPI detects the flag at every inference call (cached for 60s)
- Routes inference to the registered
champion_no_l2rollback model (per §7 + ADR-15) instead of the L2-enabled champion - Logs a
kill_switch=l2_featuresevent per ADR-32 ; emits a Grafana annotation - Operator re-enables by flipping the flag back to
false
The kill-switch is NOT the same as the trading kill-switch (ADR-71 = halts all trading). This is per-feature : trading continues but with the L2-blind model. Both kill-switches can be combined (trading off + L2 off = full safety) ; they're orthogonal.
4.9 Staged rollout : shadow → canary → full (per committee 2dc76b50 rec #11, v2 NEW)¶
Post-LOCK deployment of an L2-enabled model follows a 3-stage rollout (mirroring the standard MLOps pattern + ADR-42 atomic per-crypto promotion) :
| Stage | Traffic % | Duration | Pass criteria | Rollback trigger |
|---|---|---|---|---|
| Shadow | 0 % (predictions logged but not executed) | 7 days | l2_feature_freshness_seconds_p99 < 30s AND l2_feature_nan_rate_p99 < 1 % AND prediction agreement with L2-blind ≥ 90 % |
any quality signal threshold breach (§4.7) |
| Canary (1 of 5 cryptos) | per-crypto promotion via Console (ADR-42) | 7 days | f1_buy on the canary crypto matches the deep-sweep estimate within ±20 % | drift trigger from §4.10 OR operator manual rollback via §4.8 kill-switch |
| Full (5/5 cryptos) | per-crypto promotion sequence over 5 days (1 crypto / day) | 7 days steady-state | no kill-switch triggers, no drift alerts | drift trigger from §4.10 OR operator manual rollback |
The shadow stage is the cheapest catch ; canary catches per-crypto pathologies that aggregate stats hide. Full rollout is the steady-state.
4.10 Regime-stability monitoring + retraining triggers (per committee 2dc76b50 rec #15, v2 NEW)¶
Post-LOCK, the L2-enabled model carries 3 drift signals that, on threshold breach, trigger an automated retraining workflow :
| Signal | Threshold | Action |
|---|---|---|
| Concept drift : rolling f1_buy on last 30d vs training-window f1_buy | gap > 0.05 absolute | open ticket + auto-launch shadow retraining job (no auto-deploy) |
Feature drift : PSI of l2_kyle_lambda, l2_ofi_15m, l2_depth_imbalance_l5 vs training reference (rolling 30d) |
any one > 0.25 | same as concept drift |
Regime change : l2_spread_bps_p99 rolling 7d vs rolling 90d ratio |
ratio > 3.0 (microstructure regime shift, e.g., Binance fee changes) | escalate to operator review BEFORE auto-retraining |
Retraining workflow (when triggered) : standard CVNTrade retrain DAG with the original FTF config, freshest training window. The retrained model goes through the §4.9 staged rollout (shadow first), NOT auto-promoted. Operator approval gate at the canary stage.
5. Acceptance criteria¶
| # | Criterion | Evidence |
|---|---|---|
| 1 | 6 NEW variants live in l2_features factor matrix |
python -c "from commun.finetune.ablation_matrix import DATA_FACTORS; print([v for f in DATA_FACTORS if f.name=='l2_features' for v in f.env_vars])" shows 6 entries |
| 2 | Unit tests pass (3 new test modules) | pytest tests/unit/test_orderbook_features.py tests/unit/test_l2_features_ablation.py tests/unit/pipeline/test_enrichment_api_l2_wiring.py green. Cache-key-extension test deferred to S07 v2 follow-up PR per §0 row #7 + §4.6 v2.1 alignment (the inference-path coherent unit). |
| 3 | First sanity sweep (1 month data, smoke power_mode) complete | Run ID + paired t-test results + power assessment per §3.1 |
| 4 | Mandatory leakage check passes — AND all 5 candidate variant comparisons are BH-corrected jointly (v2 rec #1) | BH-corrected p ≥ 0.05 on standard_purge0 vs standard ; BH-corrected p < 0.05 on at-least-one of min/standard/flow_only vs none for the alt-hypothesis to hold |
| 5 | Deep-mode sweep complete (3 months OR 6 months per §3.1 power-driven extension) | Run ID + ADR-79 8-step results dossier ; window length justified |
| 5b | Sortino + expectancy after fees + spread reported informationally (v2 rec #9) | results dossier §X new ; not a hard gate, but flags economic-regression risk for downstream #711 Story scoping |
| 6 | F1 mission gates clear (Δf1_buy ≥ +0.03 + CI95 excluding 0 + ≥ 4/5 cryptos + Cohen d ≥ 0.3) | results dossier §6 |
| 6b | Cohen's d 95 % bootstrap CIs reported per variant (v2 rec #2) | results dossier §6 — point estimate + CI |
| 6c | Feature importance ablation in deep sweep (v2 rec #13) | per-feature SHAP / permutation importance ranking on the LOCK candidate variant |
| 6d | Per-class metrics (precision/recall BUY/SELL) reported alongside f1_buy (v2 rec #14) | results dossier §6 — confusion matrix + per-class precision/recall |
| 7 | MLOps readiness completed (incl. §4.7 monitoring + §4.8 kill-switch + §4.9 rollout + §4.10 drift triggers) | documentation/stories/CVN-N001-EE-S07/mlops_readiness.md |
| 8 | Verdict recorded (LOCK / KEEP_AVAILABLE / ABANDON) per ADR-79 decision tree | results dossier §11 verdict |
| 9 | If LOCK : champion_no_l2 model registered in MLflow as rollback target (per Track 1 plan §7 model-switching rollback pattern) |
MLflow Registry tag champion_no_l2 |
| 10 | Committee plan_review PASSED — DONE : session 2dc76b50 (OP Meeting #99), 0 blockers, code METHODOLOGY_FLAW addressed in v2 amendment |
this dossier §0 |
| 11 | Committee pr_review PASSED on impl PR | OP Meeting per ADR-82 + session JSON link |
| 12 | Staged rollout complete (shadow → canary → full) (v2 rec #11) per §4.9 | per-stage Grafana annotations + operator approval logs |
6. Out of scope¶
- L2 data ingestion infra — sister Story S15 (wp#126) covers the Timescale schema + DAGs + reconstruction
- #711 dynamic slippage model — operator decision 8a : separate Story after Track 2 LOCK
- Order book imbalance > L5 — VAMP, queue position, micro-volatility — future research Stories
- Multi-exchange L2 — Binance only (S15 scope) ; Coinbase / Kraken later
- Real-time streaming consumption of L2 in paper / live — backtest-only for v1 ; paper / live integration is a follow-up Story if Track 2 LOCKs (similar pattern to Track 1's parent plan §6)
- Joint metric (Sortino + expectancy) gates — F1 mission scope is f1_buy-primary per §6 derogation ; gate 2 (joint metric) is filter-tuning follow-up Story
7. Falsifiability + rollback¶
- Falsifiability per H0 : pre-registered Δf1_buy ≥ +0.03 with CI95 excluding 0 + per-asset 4/5. If Δf1 ∈ [-0.01, +0.025] with CI95 including 0 OR per-asset improves on < 4/5 cryptos, ABANDON.
- Falsifiability on leakage : if
standard_purge0outperformsstandardon f1_buy (BH p < 0.05), ABANDON pending leakage investigation (à la wp#103 for Track 1). - Rollback at LOCK time (per Track 1 plan §7 model-switching pattern) :
- The MLOps promotion workflow already handles model artefact swaps per ADR-15 + ADR-42 (atomic per-crypto promotion). Rollback = promote the previous L2-blind champion via Console flow on
mlflow_promotion. champion_no_l2model MUST be registered in MLflow Registry as a deployable rollback target before any Track 2 model becomes champion (mandatory pre-LOCK artefact, mirroring Track 1 plan §7).- Hot-fix path for code bugs : standard PR + retrain with fix + atomic promotion (NOT runtime env-flag toggle, which would dimension-mismatch the model per ADR-23).
8. Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| L2 data not yet accumulated when Story is impl-ready | high | low | Story is sequenced AFTER S15 ships ; first sanity sweep runs at 1 month of forward data ; deep sweep at 3 months. The PR can land first ; sweeps wait. |
| Same leakage class as Track 1 (per-bar L2 info contaminating labels) | medium | high | Mandatory standard_purge0 variant in matrix surfaces leakage as a gate violation. Plus the purge=20 shift is enforced by a regression test. |
| Microstructure features unstable across cryptos (high spread variance, low-liquidity altcoins) | medium | medium | Per-crypto FTF variant gates (per-asset 4/5) catch crypto-specific failures ; failing crypto can be excluded from a per-asset Story later |
| Reconstruction quality from S15 not enough for FE (ADR-25 fail-fast triggers) | medium | medium | S15 §7 has its own falsifiability ; if it ships forward-only with reduced window, S07 sanity sweep at 1 month still works |
| Pipeline cache key extension needed for L2 (cache invalidation) | low | low | v1 has NO code-level cache-disable gate (CR pass v2.2 caught the false claim). The actual mitigation is the operator-discipline rule in MLOps readiness §7.1 (no L2-trained model deployed until S07 v2) + ADR-79's per-run_id FTF cache namespacing. Cache extension is the S07 v2 inference-path follow-up. |
MLflow artefact bloat from l2_enrichment_config.json |
low | low | Negligible ; same JSON pattern as Track 1's enrichment_config.json |
| Feature drift over time (microstructure regime shifts post-Binance volume changes) | medium | high | Quarterly review per MLOps readiness §3 ; Loki alert if l2_kyle_lambda_p99 drifts > 3σ over rolling 30 days |
9. Sequencing + cross-Story impact¶
S15 (data infra) ────┐
├─ S15 deployed + ~1 month forward data
│ │
│ ▼
│ S07 PR opened (impl ready, sanity sweep at 1 month data)
│ │
│ ▼
│ ~2 more months pass (3-month canonical window)
│ │
│ ▼
│ S07 deep-mode sweep + verdict per ADR-79
│ │
│ ▼
│ If LOCK : downstream Story #711 dynamic slippage
│
Track 1 leakage (wp#103) ─┐
├─ Track 12 (frac diff) gated on Track 1 verdict — independent of Track 2
Track 14 (5m TF) LOCK ────┘
Cross-Epic impact :
- Track 14 (5m timeframe) : independent, LOCK shipped — Track 2 sweep runs at 15-min bars (default) but COULD run at 5m if Track 14 canonical is adopted globally (separate Story decision)
- Track 1 leakage investigation : independent — same leakage-check pattern adopted here as a defensive measure
- Track 12 (frac diff) : independent — gated on Track 1, no L2 dependency
- #711 dynamic slippage : downstream of Track 2 LOCK
- CVN-N010-EA (KPI Store) : if Track 2 LOCKs, the L2 features become production columns ; emit l2_kyle_lambda quantile via emit_kpi for drift monitoring
10. References¶
- OP wp#46 (S07)
- GH issue #718
- Sister Story OP wp#126 (S15)
- F1 plan :
documentation/F1_BUY_BOOST_PLAN.md§5 Track 2 + §6 sequencing - Track 1 plan dossier (pattern reuse) :
2026-04-30-track1-btc-features-plan.md - Track 1 results (leakage diagnostic) :
missions/ml-boost/2026-05-02-track1-btc-features-results.md§5.1 - Track 14 results (data-tier confirmation) :
missions/ml-boost/2026-05-02-track14-timeframe-results.md - Operator decision audit : wp#46 comment 679 (2026-05-05)
- ADR-14 : purging invariant (training time t ≤ t − purge_bars)
- ADR-23 : features version-pinned, fail-fast
- ADR-25 : pas de fallback silencieux
- ADR-58 : every FTF factor must have a guardrail + integration test
- ADR-79 : FTF Story closure 8-step workflow
- ADR-80 : FTF post-run extraction + dossier mechanics
- ADR-82 : every committee session logged as OP Meeting
11. Plan-review questions for committee¶
- Effect-size bar (Δf1_buy ≥ +0.03) : the wp#46 acceptance criterion is +0.03 (vs the F1 plan's standard +0.015). Is this calibrated correctly given microstructure's expected effect size, or should it match the standard +0.015 to avoid biasing toward ABANDON?
- Leakage check pattern reuse : Track 2's leakage check mirrors Track 1's (
standard_purge0 vs standardpaired t-test). The Track 1 leakage investigation (wp#103) is still in flight ; should Track 2 wait for the Track 1 verdict before launching its first sweep, or proceed in parallel knowing the same pathology might surface? flow_onlyvariant inclusion : 4 variants is the minimum (none + min + standard + standard_purge0) ;standard_purge10andflow_onlyadd to 6. Is the 6-variant matrix the right power-vs-cost tradeoff?l2_kyle_lambdadefinition : Kyle's lambda has multiple definitions in the literature (regression slope of price impact on signed volume is the canonical Kyle 1985). Confirm S15's pre-computation matches the canonical definition + window size (15-min rolling regression).- Pre-impl OR post-impl-PR sweep : the PR can land before any data has accumulated (just the FE module + factor + tests). Should we (a) ship the PR + wait for data, or (b) wait until 1 month data accumulated then ship PR + sanity sweep in same submission? (a) keeps Story In progress shorter ; (b) bundles evidence with code.
l2_features_enabled=Truecache invalidation : ~~v1 disables cache (fail-safe)~~ — CR pass v2.2 caught this as a false claim. There is no code-level cache-disable gate in v1 ;EnrichmentAPI.enrich_batchdoes not branch onl2_features_enabled. The actual mitigation chain is (i) inference-side OOS-only rule (cache hit on training-window slice never poisons fresh inference), (ii) per-run_idADR-79 namespacing (Track-2 sweeps cannot collide with Track-1), (iii) the operator-discipline rule documented in MLOps §7.1. Question for committee : is the operator-discipline bound (iii) acceptable for v1, or should we (a) ship the cache-key extension (l2_feature_set_versionin cache key) upfront in Track 2 v1 instead of deferring to S07 v2 follow-up PR, or (b) add a code-level cache-disable gate keyed onl2_features_enabled=Trueas a true fail-safe? (Note : this whole row aligns with §0 row #7 — the cache-key extension is the inference-path coherent unit alongside the InferenceAPI loader + MLflow artefact persistence, deferred to S07 v2 follow-up PR per the operator-confirmed BTC-pattern scope split. "Track 12" is unrelated and was a stale draft reference.)