Skip to content

F1_buy Boost Plan

Status: ✅ APPROVED — 🟡 EXECUTION FROZEN (S18/S19 baseline restoration required) (committee round 3 session 9d4942cb 2026-04-27 ; governance reset 2026-05-13 v5 — see §0bis) Issue: #707 Author: Dominique (operator) + Claude Date: 2026-04-27 (last in-place update : 2026-05-13 — v5 Tier 7 refinement)

Single source of truth for the F1 mission. This file is the canonical plan, the live track status table (§10), and the gate criteria (§6). Three sibling docs that used to overlap with this plan are now thin stubs pointing back here :


0bis. Current execution state — S18/S19 freeze

Decision state as of 2026-05-13 : This plan remains the canonical F1 uplift plan, but all metric-based LOCK / ABANDON decisions depending on post-#891 harness runs are frozen until S18 / S19 restore a valid baseline. Implementation work that improves data availability, observability, diagnostics or feature infrastructure may continue. Model-quality verdicts must NOT be closed against the regressed baseline. F1_LOCK and TRADING_LOCK are now distinct statuses (see §4).

0bis.1 Governance state matrix

Area Status Rationale
Historical baseline pre-#891 ~0.42 reference target
Current harness baseline regressed (LGB best_iter=1 in 53/53 trials, XGB f1_buy ~0.09 per S18) comparison floor is broken
S18 diagnostic in flight (plan dossier merged ; impl awaits Step 0) identifies root cause
S19 remediation (scope TBD by S18 verdict) restores the baseline
Track metric gates suspended when based on post-#891 FTF comparison to regressed median
F1_LOCK predictive uplift only ; does NOT authorize trading research-grade verdict
TRADING_LOCK requires separate economic gate clearance (§4) production-grade verdict
New alpha work allowed only if not metric-gated (infra, data, observability) unblocks parallel work
Track 5/6/9/1 ABANDONED firm — alternatives lost to a regressed median, abandon is correct regardless of S18 preserves audit trail
Track 14 LOCK on 5m firm — pre-S17/S18 evidence, +0.1252 deep-mode lift on pre-#891 path not affected by harness regression
Track 11 / Track 12 / Track 2-impl metric verdict paused ; implementation can continue rerun after S19 with valid baseline

0bis.2 Per-Track status table (S18 impact)

Track Current verdict S18 impact Action required
Track 5 (label smoothing) ABANDONED remains firm none
Track 6 (focal loss) ABANDONED remains firm none
Track 9 (per-régime threshold) ABANDONED remains firm none
Track 1 (BTC cross-asset features) ABANDONED + leakage caveat (GH #806) re-evaluate after S18+S19 (the leakage investigation may interact with the harness regression diagnosis) hold the leakage investigation merge until S18 verdict
Track 14 (timeframe sweep) LOCK on 5m (F1_LOCK only — see §4 v4 amendment) remains valid ; pre-#891 evidence retroactively re-classified as F1_LOCK, not TRADING_LOCK (Sortino + MaxDD regress on 5m)
Track 11 (ensemble diversity) In testing rerun after S19 pause metric closure ; keep implementation alive
Track 12 (frac diff + interactions) In testing pause metric closure ; impl + parity tests can continue resume gate eval after S19
Track 2 (order book features, CVN-N001-EE-S07-impl) In progress implementation can continue ; metric verdict paused resume gate eval after S19

0bis.3 What can still ship today (not metric-gated)

The freeze applies only to metric-based LOCK / ABANDON closures that compare to post-#891 FTF runs. Other work continues :

  • Data infrastructure : Binance L2 ingester (CVN-N001-EE-S15, merged), Feast feature store hardening, FTF cache improvements.
  • Observability : Grafana dashboards (incl. cvntrade-hp-coverage from S17), Loki query templates, OTel spans.
  • Feature infrastructure : Track 2 (S07-impl) ships the 6 L2 features + FTF factor + guardrail tests + cache key extension. Gate evaluation is deferred ; the code lands.
  • Track 12 frac-diff impl (PR #885 merged, S05) : code + 8 FTF variants + InferenceAPI guardrail are in. Gate evaluation deferred.
  • Diagnostic Stories : S18 (in flight), follow-up S19 (if any), S20-S22 alpha tracks defined in §5 v4 amendment.
  • Documentation hygiene : SSoT consolidation, F1 plan governance reset (this v4), break-glass runbook, OOS validation per ADR-14 once S18 lands.

The single load-bearing constraint : no Track Story flips from In testing to Closed with a LOCK / ABANDON verdict that compares post-S17 numbers to the regressed harness baseline. That gate is held by S19.

Approval history

Round Verdict Avg Min expert Session Dossier
1 PASSED / EXECUTION_RISK 6.7 5.0 (Ops) 8db2529d reviews/2026-04-27-f1-buy-boost-plan.md
2 REJECTED / EXECUTION_RISK 8.36 7.2 (Crypto-Trader) 08655dce reviews/2026-04-27-f1-buy-boost-plan-v2.md
3 PASSED / OK 8.96 8.8 (Data-Scientist) 9d4942cb this document

The two earlier dossiers stay in documentation/reviews/ as committee audit trail. This file is the live, canonical version that drives implementation.

Operational prerequisites before kickoff

Per Ops sign-off (round 3), the following must be DESIGN-signed (not necessarily implemented) before Track 5 (first quick-win) merges:

  • #708 — Kill-switch DESIGN sign-off
  • #709 — MLOps readiness template DESIGN sign-off (each later track then completes a copy of the template before its merge)

Other deferred follow-ups (not blocking the plan):

  • #710 — 9-filter chain ablation study (orthogonal diagnostic)
  • #711 — Dynamic slippage model (depends on Track 2 order book features; until it ships, an interim 10 bps static slippage applies — see §4)


0. Executive summary

CVNTrade's f1_buy has plateaued at 0.40–0.46 across 400 runs in Phase 2 (median 0.418, peak 0.541) — below the economic break-even (~0.500 at realistic cost) on the median fold. Tuning inside the existing OHLCV+XGBoost+binary-triple-barrier surface has reached an asymptote. To break it, we need to expand the input space (data) or the output space (label structure) or the function class (model architecture).

This plan proposes 13 tracks organised in 6 tiers (data / label / loss / architecture / calibration / semi-sup), with a quick-win bundle of 5 tracks shipped sequentially (~8-10 days) and a big-bet bundle deferred until the quick-win delivers or plateaus.

Phase Tracks Effort Status
Quick-win (sequential) 5 → 6 → 9 → 1 → 12 ~8-10 days Approved, ready to start
Big bet (gated) 11 → 2 → 8 → 7 ~3-4 weeks Deferred until quick-win evaluated
Stress robustness 13 ~3-4 days Mandatory before live deployment
Pseudo-labeling 10 TBD Deferred indefinitely (research plan needed first)

Joint primary metrics: f1_buy + expectancy_net + sortino_ratio + max_drawdown + n_trades (no metric used in isolation — see §4 for the explicit cost formula).

Current focus (v5 — post-S18 reframing)

The F1 mission is temporarily restricted to three workstreams ; everything else is deferred until the S18/S19 baseline is restored :

  • Signal validation — S18 diagnostic (Step 0 → 5) is the current top priority. Closes when the post-#891 baseline is either explained (env drift Story) or restored (S19 remediation Story).
  • Data + feature improvements — implementation work on Tracks 1/2/12 + Tier 7 alpha tracks may proceed, BUT metric-based LOCK / ABANDON closures are paused (see §0bis).
  • Problem formulation — Tier 7 alpha tracks (selective prediction, edge regression, regime-aware modelling, continuous labels, simple-model baseline, feature-block ablation, horizon sweep, class purity) explore the predictivity surface beyond v1-v3 loss/label tweaks.

Architecture scaling (Track 8 sequence model), advanced loss-function innovation (Track 7 cost-sensitive) and pseudo-labelling (Track 10) are explicitly deferred until baseline restoration — see §6 "Post-S18 freeze list".


0ter. Target ladder (predictive objective framing — v5)

F1 targets are now interpreted as a graduated ladder, not a single number. This frames Track verdicts against operational thresholds instead of an unreachable global maximum :

Threshold Meaning
f1_buy ≥ 0.40 baseline restored (post-S18 + S19 gate — re-opens metric-based verdicts)
f1_buy ≥ 0.50 economically viable candidate (above the ~45 bps round-trip break-even)
f1_buy ≥ 0.60 strong predictive model
f1_buy ≥ 0.70 high-quality predictive regime
f1_buy ≥ 0.80 achievable only via selective prediction (confidence-gated subset, Track 15) — NOT a global objective

The technical mapping of these thresholds onto F1_LOCK vs TRADING_LOCK authorisations lives in §4. A Track hitting 0.50 globally + 0.80 on a confidence-gated subset is more valuable than a Track hitting 0.60 globally with no abstention.


1. Project context (condensed, unchanged from v1)

CVNTrade trades crypto altcoins on 15m candles via a unified pipeline:

OHLCV → Enrichment → FE pipeline (StandardScaler) → Selection → XGBoost
   → calibration (isotonic) → 9-filter chain → execution

PTE (production target encoding): ATR0.5_1.5_H4 = SL 0.5×ATR, TP 1.5×ATR, horizon 4h. Binary classification (BUY vs HOLD).

Chronic problem: f1_buy plateaus at 0.40–0.46 across 400 runs in Phase 2 (median 0.418, peak 0.541). Below the economic break-even (~0.500 at realistic cost) on the median fold.


2. Inventory — what we've already tried (unchanged from v1)

See v1 §2 (committee session 8db2529d, dossier archived in committee/sessions/) — 12 levers tried, all stayed inside the OHLCV+XGBoost+binary surface, all hit the same asymptote.


3. Committee v1 recommendations — integration mapping

# v1 recommendation Source v2 integration
1 Test tracks sequentially (not parallel) with true holdout Data-Scientist §6 sequencing — quick-win tracks shipped one-by-one with attribution measurement between each
2 Add explicit purging/embargo for BTC features ML-Eng + Ops §5 Track 1 — explicit embargo block added (BTC features lag-purged like own-asset features)
3 Min 50 BUY trades per fold for f1 stability Crypto-Trader §6 fold-acceptance gate — runs with <50 BUY/fold rejected from variant comparison
4 Confident learning / label noise detection for tracks 4-5 Crypto-Trader + Data-Scientist §5 Track 4 + §5 Track 5 — cleanlab integration step added explicitly
5 Mandate MLOps readiness plan for each track Ops §6 — every track must complete the template from #709 before merge
6 System-wide kill-switch Ops Deferred to #708 (cross-cutting, not F1-specific) — referenced as production prerequisite
7 Integrate P&L metrics alongside f1_buy as primary metrics Crypto-Trader §4 — joint primary metrics (f1_buy + expectancy + sortino + max_drawdown + n_trades) for ALL tracks
8 Uncertainty reporting (95% CI, fold variance, per-asset, class distribution) Data-Scientist §6 reporting standard — all results pass through commun/audit/baselines.py bootstrap helpers
9 9-filter chain ablation ML-Eng + Ops Deferred to #710 (orthogonal diagnostic)
10 Ensemble diversity (LightGBM, CatBoost, snapshot ensembles) ML-Eng + Architect + Crypto-Trader + Data-Scientist §5 NEW Track 11 — added to tier 4 architecture
11 Fractional differentiation, explicit feature interactions, dynamic feature generation ML-Eng + Architect §5 NEW Track 12 — added to tier 1 data
12 Dynamic slippage model Crypto-Trader Deferred to #711 (dependency on Track 2 order book)
13 Track 7 VETO for initial production until simpler tracks demonstrate positive expectancy Crypto-Trader + Data-Scientist + Ops + Architect §6 — Track 7 explicitly gated behind quick-win delivery + positive-expectancy demonstration
14 Track 8 deferral until GPU infrastructure provisioned Ops + Crypto-Trader + Data-Scientist + Architect §6 — Track 8 deferred + GPU infra checklist required
15 Track 10 deferral confirmed (pseudo-labeling) All experts §5 Track 10 — explicitly marked "DO NOT IMPLEMENT" until quick-win results validated

Cross-cutting deferred issues (kill-switch pattern): #708 (kill-switch), #709 (MLOps readiness template), #710 (9-filter ablation), #711 (dynamic slippage). These are operational/process artefacts, not F1_buy improvements — split out to keep this plan focused per operator instruction.

v2 → v3 mapping (additional deltas from v2 holdouts)

# v2 holdout finding Source v3 integration
16 Kill-switch DESIGN must be signed off before any track ships Ops 7.8 §6 — operational prerequisites added; kill-switch DESIGN sign-off (#708) is gating Track 5 (no implementation requirement, just design)
17 MLOps readiness template DESIGN must be defined immediately Ops 7.8 §6 — MLOps template DESIGN sign-off (#709) is gating Track 5; each track then completes a copy of the template before its own merge
18 Live drift detection mechanism must be specified Ops 7.8 §6 — drift detection (PSI + rolling f1 gap + KL div) specified as part of #709 template; thresholds TBD per track
19 Production alerting + runbooks + escalation paths Ops 7.8 Folded into #709 (MLOps readiness template) — section 2 of the template
20 Standardize staged rollout (canary, shadow) Ops 7.8 Folded into #709 — section 4 of the template
21 Define expectancy as net of all realistic costs with explicit formula Crypto-Trader 7.2 §4 — explicit formula added: gross_pnl - taker_fee - spread - slippage - funding. Round-trip ≈ 45 bps with v3 interim assumptions
22 Apply interim conservative slippage (vs deferred dynamic) Crypto-Trader 7.2 §4 — interim slippage raised from 5 → 10 bps until #711 ships. One-line change in commun/audit/economic_thresholds.py ProductContext.for_spot()
23 Explicit cost assumptions in every backtest Crypto-Trader 7.2 §4 — table of cost components with values + sources. All backtest reports include this footer block.
24 Stress-case liquidity track Crypto-Trader 7.2 §5 NEW Track 13 — added (LOW PRIORITY). Mandatory before live deployment of any winning track.

4. Joint primary metrics (revised in v3)

Per Crypto-Trader recommendation, f1_buy alone is insufficient. Every track in this plan must report all of:

  • f1_buy — primary headline metric (the original target)
  • expectancynet average P&L per BUY trade after all realistic costs (formula below)
  • sortino_ratio — risk-adjusted return (downside-only volatility), computed on the net P&L series
  • max_drawdown — peak-to-trough loss in the OOS window, computed on the net P&L series
  • n_trades — sample size of BUY trades (rejected if <50 BUY trades/fold per Recommendation #3)

expectancy — explicit cost-net formula (v3, per Crypto-Trader Recommendation #1)

For one BUY trade:

gross_pnl  = (exit_price - entry_price) / entry_price        # decimal, signed
costs      = taker_fee_in + spread_half + slippage_in        # entry leg
           + taker_fee_out + spread_half + slippage_out      # exit leg
           + funding_cost  (= 0 for spot, ≠ 0 for perps)
net_pnl    = gross_pnl - costs                               # decimal, signed

with the v3 interim cost assumptions applied to ALL backtests until #711 ships:

Component v3 interim Source / Rationale
taker_fee (each leg) 10 bps (0.001) Binance spot taker, conservative
spread_half (each leg) 2.5 bps (0.00025) Conservative for defi_top5 altcoins, mid-range of observed spreads
slippage (each leg) 10 bps (0.001) v3 interim, raised from previous 5 bps per Crypto-Trader Recommendation #2
funding_cost 0 Spot trading; perps add this when applicable

Total round-trip cost = ~45 bps per BUY trade (was ~35 bps with old 5-bps slippage). This means expectancy reported in v3 is strictly more conservative than past Phase 1/2 numbers. All historical Phase 1/2 expectancy values must be re-computed with the v3 cost model before any cross-version comparison.

Implementation: refactor commun/audit/economic_thresholds.py ProductContext.for_spot() defaults to use 10 bps slippage (was 5 bps). One-line change. Apply on next FTF run kickoff. Until #711 (dynamic slippage) ships, this static 10 bps is the floor.

Acceptance criteria (per-track) — v4 amendment : F1_LOCK vs TRADING_LOCK separation

The 6 gates below define a track as F1_LOCK (predictive uplift verified) OR TRADING_LOCK (full production-ready). The two statuses are now distinct :

  • F1_LOCK : predictive-uplift verified. Authorises the track to ship its code to main but does NOT authorise live trading promotion. Gates 1 + 5 + 6.
  • TRADING_LOCK : production-ready economic verdict. Requires F1_LOCK plus the 3 economic gates (2 + 3 + 4). Only TRADING_LOCK unlocks live trading promotion of the variant.
Gate Criterion F1_LOCK TRADING_LOCK
1 — F1_buy uplift f1_buy improves with 95% bootstrap CI excluding 0 (Δ ≥ +0.015) ✅ required ✅ required
2 — Economic non-degradation expectancy_net ≥ baseline optional ✅ required
3 — Risk-adjusted sortino_ratio ≥ baseline optional ✅ required
4 — Tail risk max_drawdown ≤ baseline + 1% optional ✅ required
5 — Per-asset breadth ≥ 4/5 cryptos improve on f1_buy ✅ required ✅ required
6 — Sample size ≥ 50 BUY trades/fold ✅ required ✅ required

A track that boosts f1_buy while dropping expectancy_net cannot become TRADING_LOCK — it MAY become F1_LOCK (if gates 1+5+6 pass) but its variant is marked "research-only, not trading-ready" in the LOCK comment.

Retroactive re-classification (2026-05-13) : Track 14 (5m timeframe LOCK, 2026-05-03) was originally LOCKed under the f1_buy-primary derogation (§6 below). Under the v4 separation, it is F1_LOCK on 5m, TRADING_LOCK NOT granted because of the operator-flagged Sortino + MaxDD regression on 5m. Track 14's status table entry in §10 is updated accordingly.

Target ladder (v4 — replaces the single "f1_buy ≥ 0.75" objective)

The original mission set f1_buy ≥ 0.75 median as the singular target. v4 acknowledges that a single number is misleading — f1_buy=0.75 global is likely unreachable on this surface, but f1_buy=0.80 on a confidence-gated subset IS plausible (and economically more useful). The ladder below replaces "the target" with operationally meaningful tiers :

Tier f1_buy threshold Meaning Trading authorisation
T0 ≥ 0.40 baseline restored (pre-#891 ~0.42 — the S18+S19 closure gate) none — gate to resume metric verdicts
T1 ≥ 0.50 economically interesting candidate (above ~breakeven at 45 bps round-trip) none — F1_LOCK eligible if other gates pass
T2 ≥ 0.60 strong model candidate TRADING_LOCK eligible (subject to §4 gates 2-4)
T3 ≥ 0.70 high-confidence subset target TRADING_LOCK + abstention policy required
T4 ≥ 0.80 confidence-gated subset only (not global) — requires Track 17 abstention TRADING_LOCK with explicit abstention contract + Track 13 stress-case clearance

A track that hits T1 globally + T4 on a confidence-gated subset is more valuable than a track that hits T2 globally without confidence gating. The ladder makes this trade-off explicit. F1=0.80 is NOT a global objective in v4 — it is a confidence-gated subset objective (see Track 17 in §5 v4 amendment).


5. The 12 tracks (10 from v1 + 2 added in v2)

Each track is structured: Hypothesis · Implementation · Cost · Adjacent tried · Falsifiability. Tracks 1-10 unchanged from v1 except where v1 recommendations apply (marked [v2]).

Tier 1 — DATA

Track 1. Cross-asset BTC features

  • Hypothesis: altcoins follow BTC's macro regime; current model misses BTC's volatility / direction state.
  • Implementation: add btc_return_1h, btc_return_4h, btc_return_24h, btc_realized_vol_24h, btc_z_score_close, btc_correlation_15m_lag5 via cvntrade_enrich.py.
  • [v2] Recommendation #2: explicit purging/embargo on BTC features identical to own-asset features (purge_bars=20, embargo_bars=10 per ADR-14 standard). BTC values at time t use only data ≤ t - purge_bars. Documented invariant: no BTC feature at training time uses information ≤ 30 candles ahead of label time.
  • Cost: 1.5-2 days (FE + explicit purging audit).
  • Adjacent tried: never; no cross-asset feature in current FE.
  • Falsifiability: mean f1_buy across 5 cryptos must improve by ≥ 0.02 AND expectancy must not degrade. Else abandon.
  • Risk: 🟢 low.
  • Δ f1_buy expected: +0.03 to +0.06.

Track 2. Order book microstructure features [DEFERRED to big-bet]

(Same as v1 — full description in v1 §4.)

  • [v2] Recommendation #13: prioritize forward data collection over historical reconstruction. If reconstruction is pursued, allocate dedicated data validation effort + clearly quantify the impact of a reduced training window. Design data quality monitoring + alerting up-front (not retrofitted).
  • Gating (per §6): only proceed if quick-win bundle delivers <+0.05 f1_buy combined.

Track 12. Fractional differentiation + feature interactions [NEW v2]

  • Hypothesis (per Recommendation #11): the FE pipeline's _lag_N features are integer-differenced (binary stationarity hammer). Fractional differentiation (López de Prado, Advances in Financial Machine Learning chapter 5) preserves more memory of the price series while still passing the ADF test. Combined with explicit feature interactions (e.g., RSI_14 × BB_position), it widens the FE input space.
  • Implementation:
  • Add frac_diff_close_d0.4 as a new feature (d=0.4 chosen to preserve ~80% memory while passing ADF). One-time fit per crypto, applied bar-by-bar in inference.
  • Add 5-10 explicit interaction features (e.g., rsi14_x_atr_normalized, volume_z_x_close_change_pct) chosen from domain knowledge.
  • Cost: 2-3 days (frac diff requires statsmodels + custom transformer; interactions are ~30 lines).
  • Adjacent tried: integer differencing (lag features) only. No fractional, no interactions.
  • Falsifiability: combined f1_buy improvement ≥ 0.02. Else abandon individual sub-features (frac diff and interactions can be evaluated independently if needed).
  • Risk: 🟢 low.
  • Δ f1_buy expected: +0.02 to +0.04.

Track 13. Stress-case liquidity analysis [NEW v3, LOW PRIORITY]

  • Hypothesis (per Crypto-Trader Recommendation #4): the model's evaluation assumes normal-regime fills. Crypto markets have known liquidity stress events (flash crashes, exchange outages, spread blow-outs) where actual slippage can hit 100-500 bps for seconds-to-minutes. A model that happens to BUY into these events incurs catastrophic execution losses not reflected in the static-cost backtests.
  • Implementation:
  • Identify historical stress events in the training window (e.g., realized_vol_5m > 99th percentile OR spread > 5x rolling median).
  • Re-evaluate each track's accepted variant on a stress-only subset of the OOS data.
  • Report expectancy_stress and worst_trade_pnl_stress separately from main metrics.
  • Cost: 3-4 days (stress event identification + new evaluation pipeline).
  • Priority: LOW. Run AFTER quick-win bundle delivers a winning configuration. Before live deployment of any winning track, mandatory.
  • Adjacent tried: never. All Phase 1/2 evaluations assumed uniform fill quality.
  • Falsifiability: if the winning quick-win configuration shows expectancy_stress < 0, block live deployment until stress robustness is improved (e.g., add a stress-state pre-filter that vetoes BUYs during high-vol windows).
  • Risk: 🟢 low (analysis only, not training).
  • Δ f1_buy expected: 0 (this is a guardrail, not a booster).

Tier 2 — LABEL ENGINEERING

Track 4. Soft labels with partial credit

(Same as v1.)

  • [v2] Recommendation #4: integrate cleanlab (or equivalent confident-learning library) as a label-noise detection step. Before training, identify samples whose labels are likely mislabeled (cleanlab's find_label_issues API). Filter or reweight those samples. Apply both to soft labels (Track 4) and asymmetric smoothing (Track 5).
  • Cost updated: 2.5 days (added cleanlab integration).

Track 5. Asymmetric label smoothing

(Same as v1, with cleanlab integration per Recommendation #4.)

  • Cost updated: 1 day.

Tier 3 — LOSS FUNCTION

Track 6. Focal loss

(Same as v1.)

Track 7. Cost-sensitive loss aligned with P&L [VETO + DEFERRED]

(Same as v1 description.)

  • [v2] Recommendation #13: VETO for initial production until simpler tracks demonstrate positive expectancy. If revisited, requires:
  • Formal verification or extensive numerical validation of the custom loss's gradient and Hessian against known test cases
  • Validation of P&L curves on held-out data showing the custom loss actually correlates with expectancy improvement
  • Gating (per §6): only proceed if quick-win bundle delivers positive expectancy AND f1_buy ≥ 0.50.

Tier 4 — MODEL ARCHITECTURE

Track 8. Sequence model (TCN/Transformer) as residual [DEFERRED]

(Same as v1.)

  • [v2] Recommendation #14: deferred until quick-win exhausted AND dedicated GPU infrastructure provisioned and monitored. Specific prerequisites:
  • GPU node pool in Scaleway Kapsule with autoscaler
  • Monitoring (utilisation, memory, OOM alerts)
  • Failure-isolation: TCN failures must not cascade to XGBoost predictions
  • Trust boundary: TCN scoring stays in shadow mode for 14d before α > 0.0 in stack

Track 11. Ensemble diversity (LightGBM + CatBoost + snapshot ensembles) [NEW v2]

  • Hypothesis (per Recommendation #10): a single XGBoost model has known biases (axis-aligned splits, sensitivity to feature scaling). Diversifying across:
  • LightGBM (leaf-wise growth, often outperforms XGBoost on tabular)
  • CatBoost (ordered boosting, less prone to target leakage)
  • XGBoost snapshot ensemble (cyclical learning rate → multiple checkpoints averaged)

Then averaging predictions (or learning weights via stacking) reduces variance and often improves f1. - Implementation: - Train each model independently per (crypto, fold) in parallel - Stacking layer: simple logistic regression on [xgb_prob, lgb_prob, cb_prob] → final prob - Stored as 3 separate MLflow registered models per crypto - Cost: 4-5 days (LightGBM + CatBoost trainers exist as patterns in src/training/, just need wiring + stacking layer). - Adjacent tried: never. XGBoost only. - Falsifiability: stacked f1_buy ≥ best-single + 0.02 AND no single model dominates the stack weights (>80%) — else simplify back to single best. - Risk: 🟡 medium (3× training cost, 3× model storage). - Δ f1_buy expected: +0.02 to +0.05.

Tier 5 — CALIBRATION

Track 9. Threshold optimization per-régime

(Same as v1.)

Tier 6 — SEMI-SUPERVISED

Track 10. Pseudo-labeling [DEFERRED — DO NOT IMPLEMENT]

  • [v2] Recommendation #15: confirmed deferral. Only revisit after significant validated improvements from other tracks AND a formal research plan to detect, quantify, and mitigate self-reinforcing biases.

Tier 7 — POST-S18 EXTENSIONS (gated by baseline restoration — v5)

The 8 tracks below were rescoped in v5 (2026-05-13) after the operator's review : the gates have moved from "find a loss / label tweak that bumps f1_buy on the base classifier" (the v1-v3 strategy that produced 4 ABANDONED Tracks in a row) toward selective prediction + problem reformulation + regime-aware + diagnostic approaches. These tracks are metric-gate-suspended during the S18/S19 freeze (§0bis) but their implementation work can begin once S18 Step 0 verdict is PASS.

The Tier 7 replaces the v4 5-track structure (regime-engine, market-context, confidence-gating, feature-block ablation, model-quality ladder) with the operator's 8-track refinement below. The v4 tracks are absorbed : v4 "Regime Engine" → v5 Track 17 (Regime-aware modeling), v4 "Confidence gating" → v5 Track 15 (Confidence gating / selective prediction), v4 "Feature block ablation" → v5 Track 20, v4 "Market Context" merged into Track 17, v4 "Model quality ladder" replaced by v5 Track 19 (Simple model baseline — sharper diagnostic).

Track 15. Confidence gating / selective prediction

  • Hypothesis : the model contains predictive signal that is diluted across low-confidence predictions. Restricting execution to high-confidence predictions increases f1_buy.
  • Implementation :
  • Add FTF metrics : precision@k, f1_buy@k for k ∈ {10%, 20%, 30%}
  • Add FTF factor confidence_threshold ∈ {none, p70, p80, p90}
  • Optional : percentile-based gating on calibrated probabilities (depends on calibration quality — see Track 19)
  • Cost : ~1-1.5 days
  • Adjacent tried : Track 9 (threshold sweep) — but per-asset / per-regime threshold, no global confidence selection
  • Falsifiability : f1_buy@top20% ≥ +0.10 vs global baseline AND ≥ 4/5 cryptos improve. Else ABANDON.
  • Risk : 🟢 low
  • Δ f1_buy expected : +0.05 to +0.20 (subset-dependent — directly unlocks T4 of the target ladder)
  • Status : Tier-7, highest-priority alpha track post-S19 because it directly unlocks the f1_buy ≥ 0.80 confidence-gated tier.

Track 16. Edge regression (target reformulation)

  • Hypothesis : binary BUY/HOLD classification is a poor proxy for trading decisions — the model is asked to discretise a continuous quantity. Predicting expected net return improves separability.
  • Implementation :
  • Target : y = net_return_after_costs (regression, not classification)
  • Train regression model (XGBRegressor / LGBMRegressor plug-ins into the harness adapter surface)
  • Decision rule at inference : trade if y_pred > threshold (threshold calibrated OOS per ADR-15)
  • Cost : ~2-3 days
  • Adjacent tried : none (new problem formulation — v1-v3 stayed inside binary classification)
  • Falsifiability : corr(y_pred, realized_pnl) > 0.2 AND derived f1_buy ≥ baseline (after thresholding y_pred). Else ABANDON.
  • Risk : 🟡 medium — regression metric ≠ classification metric, so calibration of the threshold is now load-bearing.
  • Δ f1_buy expected : +0.02 to +0.08
  • Status : Tier-7, new formulation surface.

Track 17. Regime-aware modeling

  • Hypothesis : mixing heterogeneous market regimes (high-vol / low-vol, BTC bull / bear, sector-rotation phases) degrades model performance ; explicit regime modeling improves signal extraction. Absorbs the v4 "Regime Engine" + v4 "Market Context" tracks into one structural lever.
  • Implementation :
  • Add regime features : volatility bucket (rolling-σ percentile), trend state (BTC dominance + slope), market-context indices (DeFi / L1 / L2 cohort returns)
  • Variant A : regime features added to the feature matrix (single global model learns regime conditioning)
  • Variant B : train per-regime models (regime classifier routes inference to the matching sub-model)
  • Cost : ~2 days (Variant A) / ~3-4 days (Variant B with classifier)
  • Adjacent tried : CUSUM (used as pre-inference filter, not as feature/router) ; Track 1 BTC features (limited to BTC, not sector)
  • Falsifiability : Δ f1_buy ≥ +0.02 AND reduced fold variance (the latter is the regime-discrimination signal — lower variance = the model is no longer blending regimes).
  • Risk : 🟡 medium — regime classifier overfitting + cold-start when a new regime emerges + sector membership stability (DeFi vs L2 boundary fuzzy).
  • Δ f1_buy expected : +0.02 to +0.05 globally ; +0.05 to +0.10 regime-specific (Variant B).
  • Status : Tier-7, depends on Track 1 leakage investigation resolving (GH #806) — Track 1's macro features are a Track 17 ingredient.

Track 18. Continuous / probabilistic labels

  • Hypothesis : discrete triple-barrier labels (TP / SL / TIMEOUT → {BUY, HOLD}) lose information about how close the trade came to TP. Continuous targets improve the learning signal.
  • Implementation :
  • Target option (a) : normalised return at horizon (regression — overlaps with Track 16's surface)
  • Target option (b) : soft probability proxy — P(TP) derived from rolling stats over historical barrier outcomes
  • Train regression or soft-classification model with KL divergence / cross-entropy loss
  • Cost : ~2-3 days
  • Adjacent tried : Track 4 (soft labels, limited to label smoothing — did NOT change target generation pipeline)
  • Falsifiability : Δ f1_buy ≥ +0.015 AND improved calibration (ECE ↓ ≥ 20 % vs binary baseline).
  • Risk : 🟡 medium — overlaps with Track 16 (must clarify which lever drives the lift if both ship).
  • Δ f1_buy expected : +0.01 to +0.04
  • Status : Tier-7.

Track 19. Simple model baseline (signal sanity check — NEW v5)

  • Hypothesis : a deliberately simple model (logistic regression + shallow tree, depth ≤ 3) reveals whether signal exists or a bug in the complex pipeline is present. This is the diagnostic complement to S18 : S18 isolates the harness regression ; Track 19 isolates whether the feature/label/architecture stack delivers learnable signal at all.
  • Implementation :
  • Logistic regression + decision tree (max_depth=3) trained on the same train/val/test splits the FTF uses
  • No HPO, no CUSUM, no θ-sweep — strip every optionality to the bare minimum
  • Compare resulting f1_buy to the full pipeline
  • Cost : ~0.5-1 day
  • Adjacent tried : none formally (the v1-v3 baselines were all XGBoost-based)
  • Falsifiability :
  • simple ≈ complex → signal issue (the complex pipeline isn't extracting more than what's trivially there)
  • simple > complex → bug confirmed (a regression in the harness — feeds into S18 verdict)
  • simple ≪ complex → complex pipeline IS adding value (expected case, validates the architecture)
  • Risk : 🟢 low (diagnostic only)
  • Δ f1_buy expected : 0 (this is a diagnostic, not a booster)
  • Status : Tier-7, second-priority alpha track — pairs with S18 to triangulate the regression.

Track 20. Feature block ablation

  • Hypothesis : some feature families degrade performance (noise dilution > marginal signal). Block-level ablation identifies which to drop.
  • Implementation :
  • Define feature blocks : momentum, volatility, volume, trend, microstructure (when L2 available), BTC cross-asset, sector context, regime context
  • FTF factor feature_block_mask toggles each block on/off (one variant per block-removed)
  • Cost : ~2 days
  • Adjacent tried : feature cap (top-N by importance) — block-aware tagging absent
  • Falsifiability : identify ≥ 1 block whose removal improves f1_buy ≥ +0.02 (with per-asset gate). Otherwise the diagnostic still has value (knowing nothing is removable).
  • Risk : 🟢 low
  • Δ f1_buy expected : +0.01 to +0.05
  • Status : Tier-7, infrastructure track — can start during S18/S19 freeze (non-metric-gated, since it requires a working baseline to compare against).

Track 21. Temporal horizon sweep

  • Hypothesis : the current horizon (H=4h via ATR0.5_1.5_H4) is suboptimal — too short → noise dominates ; too long → cost amortisation kicks in but signal decays.
  • Implementation :
  • FTF factor horizon_hours sweeping H ∈ {2, 4, 8, 12}
  • Composes with Track 14's 5m timeframe LOCK (5m × H=2 = 24 candles ; 5m × H=12 = 144 candles)
  • Cost : ~1-2 days
  • Adjacent tried : indirect via PTE envelope strategies (ATR0.5_1.5_H4 is a single point on this surface, not a swept variable)
  • Falsifiability : Δ f1_buy ≥ +0.02 on ≥ 4/5 cryptos at the best horizon.
  • Risk : 🟢 low (data-only sweep)
  • Δ f1_buy expected : +0.02 to +0.06
  • Status : Tier-7. Composes with Track 14 LOCK on 5m.

Track 22. Class definition refinement (class purity)

  • Hypothesis : the BUY class contains noisy samples — trades that hit TP marginally or via path-dependent fluctuations. Filtering or weighting by return magnitude tightens the class definition.
  • Implementation :
  • Filter low-return BUY samples (drop trades with realised return < ε threshold)
  • OR weight samples by return magnitude in the loss function
  • Cost : ~1-2 days
  • Adjacent tried : Track 6 (focal loss) — indirect attempt at class re-weighting, ABANDONED
  • Falsifiability : precision ↑, recall ↓, f1_buy ↑ (the trade-off of a tighter class definition).
  • Risk : 🟡 medium — risk of overfitting to clean cases (class-definition leakage : the same trades that "would have been profitable" are easy to predict in-sample).
  • Δ f1_buy expected : +0.02 to +0.05
  • Status : Tier-7.

Tier 7 sequencing (v5) : the 8 tracks are NOT strictly sequential. The priority order is :

  1. Track 19 (Simple model baseline) — pairs with S18 for diagnostic ; cheapest (~0.5-1d).
  2. Track 15 (Confidence gating) — directly unlocks T4 of the target ladder.
  3. Track 17 (Regime-aware) — absorbs v4 Regime Engine + Market Context, single biggest structural lever.
  4. Track 16 (Edge regression) OR Track 18 (Continuous labels) — problem reformulation surface, pick one then evaluate the other on its result.
  5. Track 20 (Feature block ablation) — infrastructure, parallelizable with metric-gated tracks.
  6. Track 21 (Horizon sweep) — composes with Track 14 LOCK ; sweep when baseline restored.
  7. Track 22 (Class purity) — last because the risk of class-definition leakage requires the other tracks' diagnostic surface to interpret cleanly.

6. Sequencing, gates & operational prerequisites (revised in v3)

Operational prerequisites (NEW in v3, per Ops holdout)

Before Track 5 (the first quick-win) merges to main, the following design artefacts must be signed off — implementation can land in parallel:

  • Kill-switch DESIGN sign-off (#708): API surface + state storage (PostgreSQL) + multi-channel triggers (Console UI button + CLI + emergency env var + Grafana auto-trigger) + audit log format + reversibility flow + bypass-cache invariant. Design doc reviewed by Ops + signed. Implementation can be parallel work but not blocking the design.
  • MLOps readiness template DESIGN sign-off (#709): the markdown template at documentation/templates/mlops_readiness_template.md defining the 6 sections (production monitoring + alerting/runbooks + drift detection + staged rollout + rollback plan + DRI). Each track in this plan then completes a copy of this template before its own merge.
  • Live drift detection mechanism specified (in #709 template): data drift via PSI on feature distributions, concept drift via rolling f1_buy gap (last 30d vs training window), model behaviour drift via prediction distribution shift (KL divergence). Specific thresholds TBD per track. Living in the MLOps template, not implemented separately.

These three are DESIGN deliverables with operator sign-off — they do not require full implementation before Track 5. They DO require existence as approved documents.

Phase 1: Quick-win bundle — sequential delivery

Per Recommendation #1, tracks must be tested sequentially to attribute gains correctly. Order chosen for fastest validation cycle:

  1. Track 5 (asymmetric label smoothing) — 1 day, cheapest. Baseline: Phase 2 rerun (ftf_20260427_170614_06626e_*). Gated by operational prereqs above.
  2. Track 6 (focal loss) — 1 day after Track 5 stable. Baseline: post-Track 5.
  3. Track 9 (per-regime threshold) — 2 days after Track 6 stable. Baseline: post-Track 6.
  4. Track 1 (BTC features) — 1.5-2 days after Track 9 stable. Baseline: post-Track 9.
  5. Track 12 (fractional diff + interactions) — 2-3 days after Track 1 stable. Baseline: post-Track 1.

Total quick-win bundle: ~8-10 days serial. Each track must clear the gate before the next starts.

Gates (apply to every track)

A track is ACCEPTED (kept in production) if and only if all conditions hold on a 5-fold OOS evaluation:

  1. F1_buy gate: mean Δf1_buy ≥ +0.015 (with 95% bootstrap CI excluding 0)
  2. Joint metric gate: Δexpectancy ≥ 0 AND Δsortino_ratio ≥ 0 AND Δmax_drawdown ≤ +1% (no significant degradation)
  3. Stability gate: per-fold variance of f1_buy ≤ 0.05 (per Recommendation #8)
  4. Per-asset gate: f1_buy improves on ≥ 4/5 cryptos (no all-eggs-in-one-basket)
  5. Sample size gate: ≥ 50 BUY trades per fold (per Recommendation #3)
  6. MLOps gate: completed readiness plan from #709 template

A track that fails ANY gate is REJECTED (or revised + re-evaluated). Gates apply to every track without exception, subject to the f1_buy-primary derogation defined immediately below.

F1-mission f1_buy-primary derogation (added 2026-05-03 per operator directive)

Per operator directive 2026-05-02 ("la mission c'est f1_buy, pas sortino ni win rate") the F1 mission's LOCK verdict is computed on the f1_buy-primary subset of the 6 gates above :

  • In-scope for LOCK in this mission : gates 1 (f1_buy lift), 3 (stability), 4 (per-asset), 5 (sample size), 6 (MLOps) → 5/6 gates
  • Deferred from LOCK in this mission : gate 2 (joint metric — Δexpectancy / Δsortino / Δmax_drawdown) — handled as a separate filter-tuning follow-up Story scope under the "filter tuning" mission, not as a blocker on the f1_buy LOCK

Rationale : the F1 mission's primary metric is f1_buy (per §0 + §4 joint primary metric design). Gate 2 (joint economic metric) was scoped as a guard against pathological joint-metric regressions, not as a co-primary acceptance criterion. The operator-directed mission-scope is explicitly f1_buy-first ; subordinating the f1_buy LOCK to joint-economic clearance would block the mission's primary deliverable on a metric that belongs to a different mission's scope (filter-tuning / economic-value-growth).

Scope of the derogation : - Mission-scoped : applies ONLY to the F1_buy boost mission. Other missions (filter tuning, economic value growth, kill-switch, etc.) retain the full 6-gate spec without derogation. - Documented per LOCK : every Track LOCK in this mission MUST cite this §6 derogation explicitly in its results dossier and call out the deferred-to-follow-up status of gate 2. - ADR-0079 interaction : ADR-0079 invariant 5 ("LOCK requires lock rule cleared AND all 6 official gates pass") is interpreted in this mission as "the f1_buy-primary gate subset (5/6) passes" per this derogation. ADR-0079 itself is not amended (it remains the canonical 6-gate spec for non-F1 missions and for the F1 mission's gate 2 follow-up Story). - Audit trail : the deferred gate 2 evaluation MUST land as a follow-up Story result under the filter-tuning mission, with its own dossier closure per ADR-0079 — the derogation defers gate 2, it does not eliminate it.

Applied to : - Track 14 (5m timeframe) : LOCK on f1_buy-primary derogation, gate 2 deferred to filter-tuning follow-up Story (filed separately). See results dossier §13.

Post-S18 freeze list (v5 — explicit gating)

Until the S18/S19 baseline is restored, the following Tracks are frozen (implementation paused, metric verdicts not closeable) :

  • Track 8 (sequence model TCN/Transformer residual) — assumes a stable validated signal in the base classifier ; running it on a regressed baseline would amplify noise and invalidate the residual decomposition.
  • Track 7 (cost-sensitive loss) — already VETOED in §6 phase 2 ; the v5 freeze adds the explicit S18/S19-gating annotation to the VETO.
  • Track 10 (pseudo-labelling) — was already deferred indefinitely in v1-v4 ; the v5 freeze reaffirms that pseudo-labelling on a regressed baseline self-reinforces the regression, making it strictly worse than the v1-v4 indefinite deferral.

Track 13 (stress-case liquidity) remains mandatory before live deployment but is deprioritised during the signal recovery phase — there is no production candidate to stress-test while the baseline is regressed.

The Tier 7 alpha tracks (15-22) are metric-gate-suspended during the freeze (their LOCK / ABANDON verdicts cannot close), BUT their implementation work CAN proceed once S18 Step 0 verdict is PASS — see §0bis.3 "What can still ship today" + the Tier 7 sequencing block at the end of §5.

LOCK types reminder (v5 — surfaced from §4)

The 6 gates in §4 acceptance criteria define two distinct verdicts :

  • F1_LOCK — predictive uplift verified only (f1_buy gate + stability + per-asset + sample size + MLOps). Authorises code merge to main, NOT live trading.
  • TRADING_LOCK — full joint metric validation (F1_LOCK gates plus expectancy_net ≥ baseline + sortino non-regress + max_drawdown bound). Required for live trading promotion.

A Tier 7 track may earn F1_LOCK without satisfying TRADING_LOCK ; promotion to live trading requires a separate filter-tuning Story closing the economic gap.

Phase 2: Big bet — only if quick-win plateaus

If the quick-win bundle delivers combined Δf1_buy ≥ 0.05 AND Δexpectancy > 0, big-bet tracks remain deferred (no need to rush).

If the quick-win bundle delivers combined Δf1_buy < 0.05, evaluate the big-bet tracks in this order:

  1. Track 11 (ensemble diversity) — 4-5 days; lowest infra cost of the big bets.
  2. Track 2 (order book features) — 5-7 days; needs forward data collection start.
  3. Track 8 (sequence model residual) — 7-10 days; gated by GPU infra.
  4. Track 7 (cost-sensitive loss) — 2 days; VETOED until quick-win + ensemble demonstrate positive expectancy.

Phase 3: Pseudo-labeling

Only after Phase 1 + Phase 2 are exhausted, with a formal research plan per Recommendation #15.

Outcomes (live — updated as Tracks complete)

This subsection records the per-Track verdict from each FTF sweep. Updated immediately on Story closure (per STORY_WORKFLOW.md §2.5 ML/FTF closure contract).

Track Phase Verdict Date Results dossier
5 — Asymmetric label smoothing + cleanlab quick-win 1 ABANDONED (both branches : label_smoothing variants mean Δf1 ∈ [-0.085, -0.072] Cohen's d ∈ [-1.8, -1.1] ; cleanlab variants filter / reweight mean Δf1 ∈ [-0.081, -0.076] Cohen's d ∈ [-1.4, -1.2] BH p ≤ 5e-06 ; 0/5 cryptos improve on either branch) 2026-04-29 2026-04-29-track5-label-smoothing-results.md (§6.2 cleanlab update)
6 — Focal loss quick-win 2 ABANDONED (all 4 focal variants ; Δf1_buy ∈ [-0.10, -0.05] CI95 excluding 0 wrong-side, Cohen's d ≤ -1.2, BH p < 1.3e-07, 0/5 cryptos improve) 2026-04-29 2026-04-29-track6-focal-loss-results.md
9 — Per-regime threshold quick-win 3 ABANDONED (4 variants : per_regime_f1, per_regime_expectancy, per_regime_f1_with_floor, coarse_3regime ; lock rule fail 0/4 metrics on every variant ; max effect size |d|=0.45 in baseline's favor not significant p-adj=0.21 ; f1_buy regresses Δ ∈ [-0.016, -0.004] on every variant ; AAVEUSDC actively harmed (negative Sortino) on per_regime_f1 + per_regime_f1_with_floor ; per-asset gate ≤ 3/5) 2026-05-01 2026-05-01-track9-per-regime-threshold-results.md
11 — Ensemble diversity big bet INCONCLUSIVE — closure RETRACTED 2026-05-02 (original ABANDON of 2026-05-01 retracted ; supersession argument was mathematically unsound — uniform averaging dilutes orthogonal signals so the 3-variant null does NOT rule out lgb_only/cb_only signal ; pending autotrainer dispatcher PR + re-sweep with full 6-variant coverage) 2026-05-01 (retracted 2026-05-02) 2026-05-01-track11-ensemble-diversity-results.md §13 POST-CLOSURE REOPEN
1 — BTC cross-asset features quick-win 4 ABANDONED — leakage gate fail per Track 1 plan dossier 2026-04-30-track1-btc-features-plan.md §4.6, pending root-cause investigation (sweep ftf_20260501_230526_2483f9_ATR0.5_1.5_H4, 150 results 0 errors, full 6-variant coverage ; mandatory paired t-test on f1_buy(btc_full_purge0) − f1_buy(btc_full) gives p=0.0401, d=+0.434 → leakage suspected per Track 1 plan dossier 2026-04-30-track1-btc-features-plan.md §4.6 ; Sortino diverges (canonical wins by 0.32) but the gate is on f1_buy not Sortino ; promising btc_full vs none Expectancy d=+0.61 + Win Rate d=+0.57 cannot be cleanly attributed until leakage source localised ; f1_buy lift gate also fails Δ=+0.007 vs +0.015 needed ; factor stays in MODEL_FACTORS for the investigation re-sweep) 2026-05-02 2026-05-02-track1-btc-features-results.md
14 — Timeframe sweep (5m / 15m / 30m / 1h) data tier (meta) F1_LOCK on 5m — TRADING_LOCK NOT granted (retroactively re-classified 2026-05-13 under v4 §4 amendment ; graduated from KEEP_AVAILABLE on 2026-05-03 after deep-mode confirmation ; standard sweep ftf_20260502_145754_54d6d1_ATR0.5_1.5_H4 Δf1_buy=+0.060 5/5 cryptos ; deep sweep ftf_20260502_222942_c34370_ATR0.5_1.5_H4 Δf1_buy=+0.125 p_BH=3.4e-10 d=+0.811 13/13 cryptos ; first Track to clear F1_LOCK gates ; TRADING_LOCK NOT granted — Sortino + MaxDD regress on 5m, F1_LOCK gates 4-6 NOT met ; Console flip operator-triggered per ADR-42 was provisional under pre-v4 derogation, now reviewed under v4 ; promotion to TRADING_LOCK requires economic regression resolution via filter-tuning Story) 2026-05-03 (F1_LOCK) · TRADING_LOCK pending 2026-05-02-track14-timeframe-results.md

Cross-track lesson (4 ABANDON + 1 INCONCLUSIVE + 1 LOCK — first f1_buy-positive verdict) : Tracks 5/6/9 closed ABANDON for lack of signal (training signal manipulation + threshold calibration are not productive levers at the current dataset). Track 11 is INCONCLUSIVE pending autotrainer dispatcher. Track 1 closed ABANDON-on-leakage-gate : the f1_buy paired t-test on purge0 − canonical is statistically significant in favour of purge0 (the leakage-detector variant outperforms canonical, p=0.0401, d=+0.434) which per Track 1 plan dossier 2026-04-30-track1-btc-features-plan.md §4.6 mandates ABANDON pending investigation — the Sortino lift on canonical (1.710 vs 1.547) is REAL but cannot be cleanly attributed until the leakage source is localised. Track 14 (timeframe 5m) is the first LOCK of the F1 mission : standard mode delivered Δf1_buy=+0.060 (4× the +0.015 gate, 5/5 cryptos, p=0.0002), deep mode confirmed and doubled the lift to Δf1_buy=+0.125 (13/13 cryptos, p_BH=3.4e-10, Cohen d=+0.811 large). Per the operator's f1_buy-primary directive (2026-05-02), the LOCK verdict is on f1_buy alone — the 5m economic regression (Sortino, MaxDD) is out of scope of the F1 mission and flagged for a separate filter-tuning Story.

Strategic state — post Track 1 ABANDON-with-caveat + Track 14 LOCK on 5m (revised 2026-05-03) : the data tier is CONFIRMED PRODUCTIVE by Track 14's LOCK on 5m timeframe — finer bar resolution unlocks +0.125 Δf1_buy with p_BH=3.4e-10 and 13/13 per-asset coverage, the strongest result of the F1 mission. Track 1 (BTC features) is the second data-tier signal pending the leakage investigation, which will produce one of two outcomes :

  • (a) leakage is real ADR-14 violation → canonical purge=20 already plugs it → re-sweep at deep mode → corrected verdict (likely KEEP_AVAILABLE or LOCK if effect sizes hold)
  • (b) leakage is production-exploitable signal mistakenly purged → adjust purge_bars sensitivity sweep, find production-feasible minimum → corrected canonical → re-sweep → potentially better lift than the current canonical

The Track 1 investigation is the gating step for the BTC-features lever. Track 12 (frac diff) is now conditionally clearable : Track 14's LOCK signals the data tier IS productive, which provides the strategic justification for Track 12 launch — but the formal precondition (Track 1 leakage investigation) still applies if Track 12's design depends on the BTC-features baseline. Independent of Track 1's investigation, Track 12 can launch on the 5m timeframe to compose with Track 14's LOCK.

Process lesson (Track 1 specific) : the leakage check spec must be applied verbatim from the plan dossier — the original draft of the Track 1 results dossier compared Sortino instead of f1_buy (per the Sortino-favours-canonical signal) and shipped a wrong-spec KEEP_AVAILABLE that was retracted in CR pass 1. Future closures with mandatory gates : extract the per-cell paired data from PG finetune_results directly when the plan calls for a paired test ; the canonical PDF report aggregates only show per-variant means, not the paired t-test breakdown.

Implication for §6 sequencing (revised 2026-05-03 post Track 14 LOCK) :

  • Highest-priority next operator action : trigger the Console flip on factor_timeframe=5m per ADR-42 atomic per-crypto promotion — the LOCK verdict authorises it, the operator decides timing (immediately on the f1_buy LOCK, or after the filter-tuning follow-up addresses the economic regression).
  • Parallel filter-tuning follow-up Story : investigate whether the f1_buy lift on 5m can be preserved while fixing the economic regression (Sortino 0.068 vs 1.242, MaxDD -16% vs -10.5% vs 15m baseline). Candidates : CUSUM threshold re-tuning for 5m bars (current CVN_CUSUM_THRESHOLD_H=3.0 calibrated for 15m), cost-aware confidence threshold, dynamic position sizing tied to per-bar volatility. Likely scope under "filter tuning" mission, not F1 boost.
  • Investigate the deep-mode 17 % fold OOM : 14/17 mapped tasks failed mid-run on 5m × 36mo deep — likely OOM at the standard pod profile. Filed as follow-up issue ; recommend either heavy pod profile when 5m ∈ variants AND power_mode=deep OR a 5m-specific resource override in forecast_resources(). Doesn't change the LOCK verdict (332/400 = 83 % completion was sufficient for p_BH=3.4e-10) but matters for future deep sweeps that depend on the timeframe factor.
  • Track 1 leakage root-cause investigation : ship the per-feature ablation + purge_bars sensitivity sweep (filed as GH issue #806). ~1-2 days dev + ~3-4h re-sweep. Compose with Track 14 LOCK : re-sweep BTC-features at 5m timeframe to test interaction with the data-tier LOCK.
  • In parallel : ship Track 11 block A (issue #802 — autotrainer dispatcher) to unblock the architecture tier re-sweep. Independent of Tracks 1 + 14.
  • Track 12 (fractional differentiation) : strategic justification cleared by Track 14's LOCK. Sweep can launch at 5m timeframe to test compositional gain on top of the LOCK ; no longer blocked by Track 1's ABANDON.
  • Track 7 (cost-sensitive loss) — already VETOED in §6 phase 2 — veto re-evaluable : Track 14's LOCK on f1_buy is the first f1_buy-positive lever, but the Sortino regression on 5m means the cost-sensitive loss may be worth re-considering on the 5m baseline. Operator triage.
  • Re-evaluation hooks : Track 9 retry candidate IF Track 14's LOCK + Track 1 leakage investigation produce composable variants (per-regime threshold on top of 5m + BTC-enriched features may shift optimum) ; Track 11 retry REQUIRED post autotrainer-dispatcher.

7. Reporting standard (NEW in v2)

Per Recommendation #8, every track's evaluation produces:

  • Mean Δf1_buy with 95% bootstrap CI (10000 resamples)
  • Per-fold variance of f1_buy (and joint metrics)
  • Per-asset breakdown (5 cryptos × all metrics)
  • Class distribution per fold (BUY %, HOLD %)
  • Joint metric table: f1_buy + expectancy + sortino + max_drawdown + n_trades, all with CIs

Format: standard FTF PDF report extended with these sections. Reused for every track's evaluation.


8. Cross-cutting concerns

  • All tracks respect ADR-25 (no silent fallback)
  • All tracks integrate with FTF (ADR-58)
  • All tracks reversible via env var (ADR-59)
  • Comparison baseline: Phase 2 rerun (ftf_20260427_170614_06626e_*) — clean post-#704 baseline

Operational prerequisites (cross-cutting deferred issues)

  • #708 — System kill-switch: required before ANY track enters live (paper acceptable without it during evaluation)
  • #709 — MLOps readiness template: required deliverable for each track
  • #710 — 9-filter ablation: orthogonal diagnostic, may inform Track 6/7 design
  • #711 — Dynamic slippage: depends on Track 2; same gating

9. Out of scope (v4 amendment — F1_LOCK scope only)

The v1-v3 phrasing "Sortino, win_rate, n_trades, expectancy — separate plan" was misleading in light of the v4 §4 F1_LOCK / TRADING_LOCK separation. Corrected wording (v4) :

  • Out of scope for F1_LOCK (predictive-uplift verdict — Tracks closing as F1_LOCK per §4) :
  • Sortino ratio, win_rate, n_trades, expectancy_net — these are TRADING_LOCK gates, not F1_LOCK gates. A Track can achieve F1_LOCK without meeting them.
  • Multi-crypto portfolio effects — per-crypto model level only.
  • Required for TRADING_LOCK (production-ready verdict — Tracks promoted to live trading) :
  • Sortino ratio ≥ baseline (gate 3 §4)
  • max_drawdown ≤ baseline + 1% (gate 4 §4)
  • expectancy_net ≥ baseline (gate 2 §4)
  • Stress-case clearance (Track 13)
  • Out of scope for both :
  • PTE changes — locked (separate Epic CVN-N001-EC)
  • Track A audit completion — diagnostic, not action
  • Production deployment automation — separate Epic CCP wp#149

10. Implementation tracking

Each track in §5 will be implemented via a dedicated child issue of #707 following the standard PR/CR/committee-pr_review cycle (per ADR-68 + project memory feedback_dev_process.md).

Track status

Track Phase GH issue / OP wp Status
5 — Asymmetric label smoothing + cleanlab quick-win 1 #712 (S01) · OP wp#40 Closed ABANDONED (both branches verified post S08 + S10 + sympy fixes) — see results §6.2
6 — Focal loss quick-win 2 #713 (S02) · OP wp#41 Closed ABANDONED — see results
9 — Per-regime threshold quick-win 3 #714 (S03) · OP wp#42 Closed ABANDONED (sweep ftf_20260430_194027_3d0171_ATR0.5_1.5_H4, lock rule fail 0/4 metrics on every variant) — see results
1 — BTC cross-asset features quick-win 4 #715 (S04) · OP wp#43 Closed ABANDONED — leakage gate fail per Track 1 plan dossier 2026-04-30-track1-btc-features-plan.md §4.6, pending root-cause investigation (sweep ftf_20260501_230526_2483f9_ATR0.5_1.5_H4, 150 results 0 errors, full 6-variant coverage ; mandatory paired t-test on f1_buy(btc_full_purge0) − f1_buy(btc_full) gives p=0.0401, d=+0.434 — purge=0 outperforms canonical with stat sig → leakage suspected per Track 1 plan dossier 2026-04-30-track1-btc-features-plan.md §4.6 verbatim ; Sortino diverges (canonical wins by 0.32, +23%) but the gate is on f1_buy not Sortino — divergence is informative not contradictory ; promising effect sizes on btc_full vs none Expectancy d=+0.61 / Win Rate d=+0.57 cannot be cleanly attributed until leakage source localised ; f1_buy lift gate also fails Δ=+0.007 vs +0.015 needed ; factor stays in MODEL_FACTORS for the investigation re-sweep — see GH #806 leakage investigation) — see results
12 — Frac diff + interactions quick-win 5 #716 (S05) · OP wp#44 Closed (CLOSE-DELIVERY — mise à niveau EE 2026-06-08, plan_review 477b3923 ; gate métrique GELÉ ML_USELESS, aucun verdict LOCK/ABANDON/tradabilité ; re-éval métrique = Story séparée post-récupération) — impl PR #885 merged 2026-05-08T23:45Z (squash sha 0a407688). Ships frac-diff (López de Prado AFML eq. 5.2, default min_w=5e-4 per committee plan_review 2681aa97 Option 2) + 5 explicit domain interactions + 8 FTF variants registered (incl. Q2 sensitivity _w1e3 / _w1e5 per committee) + InferenceAPI deploy-time guardrail blocking models with frac_diff_* features pre-loader (committee pr_review df09258b round 2 98b16083 PASSED-WITH-CHANGES). Awaiting FTF sweep launch on factor=frac_diff crypto_group=defi_top5 per the 2-stage protocol (Stage 1: 6 variants ; Stage 2: 2 sensitivity variants only if Stage 1 clears the +0.020 Δf1_buy gate). 6 round-2 residuals tracked as follow-ups (loader S05a/wp#140, DAG automation S05b/wp#141, future ADR #888). Next gate Tested → Closed per ADR-0079 verdict tree LOCK / KEEP_AVAILABLE / ABANDON.
11 — Ensemble diversity big bet #717 (S06) · OP wp#45 Closed (CLOSE-DELIVERY — mise à niveau EE 2026-06-08, plan_review 477b3923 ; gate métrique GELÉ, aucun verdict ; re-éval post-récupération) — 7-bug hotfix PR #872 merged 2026-05-08T20:32Z (sha 461a39b2) addresses the autotrainer-dispatcher gaps that blocked the v1 sweep (LGB Booster predict_proba wrap, CB metrics shape, CUSUM sigma propagation, AUC early-stop + θ-sweep, advanced ML metrics). Validation sweep manual__2026-05-08T21:00:53_post_872_v2 (factor=ensemble_diversity, group=defi_top5) running on the patched image. Next gate (per ADR-0081 §2.6) → Tested on per-track gate decision (LOCK / KEEP_AVAILABLE / ABANDON per ADR-0079). Original sweep ftf_20260501_152755_d84267_ATR0.5_1.5_H4 data + retracted 2026-05-02 verdict kept in results dossier §13 as audit trail.
2 — Order book features big bet planning #718 (S07, wp#46 Closed) · impl #859 (S07-impl, wp#128 Rejected — ABANDON in EE, mise à niveau 2026-06-08, pas de destination L2 nommée) Planning Closed 2026-05-07 — plan dossier merged via PR #850 (squash 7d25aa39, committee plan_review 4076bdca PASSED + v2.1/v2.2/v2.3 amendments + committee pr_review PASSED). Operator decisions locked wp#46 comment 679 (2026-05-05) — A=b, B=b, C=b, D=b, E=c, F=a, G=same as btc, H=a. Implementation ABANDONED (Rejected) — follow-up Story CVN-N001-EE-S07-impl (wp#128, GH #859) ABANDON in EE (mise à niveau 2026-06-08, plan_review 477b3923) : pari métrique sous gel ML_USELESS, pas de destination L2 nommée. L'ingestion L2 (S15) reste livrée. Réouverture : Track 2 re-sponsorisé via Epic L2/data nommée + owner + preuve EI.
8 — Sequence model residual big bet #719 (S08) · OP wp#47 Rejected (ABANDON — mise à niveau EE 2026-06-08, plan_review 477b3923 ; gros pari hors-trajectoire sous gel ML_USELESS ; jamais démarré ; réouverture §4quater)
7 — Cost-sensitive loss big bet #720 (S09) · OP wp#48 Rejected (ABANDON — mise à niveau EE 2026-06-08, plan_review 477b3923 ; VETO maintenu, jamais démarré ; réouverture §4quater)
13 — Stress-case liquidity guardrail #721 (S10) · OP wp#49 Rejected (ABANDON — mise à niveau EE 2026-06-08, plan_review 477b3923 ; requis seulement pré-live, sans objet sous gel ; réouverture §4quater)
10 — Pseudo-labeling semi-sup #722 (S11) · OP wp#50 Rejected (ABANDON — mise à niveau EE 2026-06-08, plan_review 477b3923 ; do-not-implement sous baseline régressée ; réouverture §4quater)
14 — Timeframe sweep (5m / 15m / 30m / 1h) data tier (meta) #808 (S12) · OP wp#101 Closed LOCK (5m) — graduated from KEEP_AVAILABLE → LOCK on 2026-05-03 after deep-mode confirmation (sweep ftf_20260502_222942_c34370_ATR0.5_1.5_H4, 332/400 useful cells = 83 % completion, 17 % fold OOM/timeout — see follow-up issue ; deep-mode 5m vs 1h on f1_buy : Δ=+0.1252 (2× the standard-mode lift), CI95=[+0.092, +0.160], paired t p_BH=3.4e-10, Cohen d=+0.811 large, per-asset 13/13 cryptos win — strongest result of the F1 mission to date) ; standard-mode evidence preserved (sweep ftf_20260502_145754_54d6d1_ATR0.5_1.5_H4, 99/100 cells, Δf1_buy=+0.060 4× the gate, p=0.0002, d=+0.888, 5/5 cryptos) ; 30m + 1h ABANDON on f1_buy gate (deep p_BH=0.053 / 0.074 borderline non-significant) ; 15m baseline retained as production timeframe pending operator decision on Console flip per ADR-42 ; operator caveat (preserved) : 5m has economic regression (Sortino, MaxDD) at standard mode — out of scope of f1_buy LOCK per F1 plan §6 spec, flagged for filter-tuning follow-up Story — see results §13

Update this table as each child issue is opened and progressed. Each track's gates (§6) must be cleared before the next sequential track starts.


11. References

11.1 Cross-cutting mission (Pipeline Contract Hardening)

The Track 5 implementation surfaced an "implicit contract" class of bugs between apply_label_pipeline and downstream consumers (3 production incidents in 24h on the same surface, documented in OPERATIONS.md §17 §17.1 / §17.2 / §17.3). These motivated the sibling Epic CVN-N011-EA — Pipeline Contract Hardening which hosts cross-cutting Stories shared by F1 mission and other epics :

Story Title Status PR
S07 Calibration on hard labels (Bug #1) Closed #765
S08 Cleanlab class-aware drop cap Closed (squash 9ff3966e) #769
S09 ECE returns 0 silently on soft labels Open TBD
S10 gRPC fork deadlock blocks HPO/FTF sweeps (P1) Open (wp#91, #774) TBD

The CVN-N011-EA Epic is the authoritative tracker for hardening Stories ; this table is a convenience pointer.

11.2 Tests covering this mission

See the tests index section "ML Boost — Track 5 + 6 + hardening" for the full list. Highlights :

  • tests/integration/test_track5_label_smoothing.py — full Track 5 variant matrix end-to-end (incl. type preservation, calibration assertions)
  • tests/integration/test_track6_focal_loss.py — Track 6 5-variant matrix + temperature scaling + joint Track 5 × Track 6
  • tests/unit/training/labels/test_label_pipeline.py::TestSuspectMaskPerClassCap — S08 regression bar (8 tests, deterministic)
  • tests/unit/training/XGBoost/test_focal_loss_formal_verification.py — Track 6 SymPy symbolic verification (32 tests)

11.3 Plan + audit trail

  • v1 plan + v1 verdict: documentation/reviews/2026-04-27-f1-buy-boost-plan.md, committee session 8db2529d
  • v2 plan + v2 verdict: documentation/reviews/2026-04-27-f1-buy-boost-plan-v2.md, committee session 08655dce
  • v3 plan + v3 verdict (this document, drives implementation): documentation/reviews/2026-04-27-f1-buy-boost-plan-v3.md, committee session 9d4942cb
  • Pre-13-tracks v2 design (superseded 2026-04-27, archived): documentation/_archive/CVN-N001-f1-mission-design-v2-2026-04-22.md

11.4 Adjacent contexts

  • Track A umbrella: #690
  • Phase 2 audit report: documentation/reviews/2026-04-27-phase2-track-a-pre-predictions.md
  • Today's prerequisites: #700/#701, #703/#704, #706
  • Cross-cutting deferred: #708, #709, #710, #711
  • ADR-14, ADR-15, ADR-25, ADR-58, ADR-67
  • Existing FTF tuning protocol: documentation/TUNING_PROTOCOL.md
  • Sibling epics: CVN-N001-EC (PTE envelope, #630), CVN-N001-ED (FI ablation, #640+#682+#685+#688), CVN-N006 (MLflow backbone), CVN-N011-EA (hardening, see §11.1 above)
  • Parent need: CVN-N001 (F1 mission, issue #608)
  • Post-S17 / S18 regression context: documentation/architecture/TRAINING_PIPELINE.md §6 — explains the S18 freeze note at the top of this plan