Skip to content

PTE envelope sweep — test plan

Issue: #608 (F1 mission — phase "lock the envelope") Factor: pte_envelope (11 variants, added in PR #630) PR branch: feat/608-pte-envelope-factor Date: 2026-04-23 Operator: dco


1. Context

Two FTF runs on 2026-04-22 triangulated the real lever for the mission:

Run PTE Factor Sortino f1_buy action_rate
ftf_20260422_174216 ATR1.5_3.0_H5 threshold_method 0.68 0.06 0.008
ftf_20260422_220929 ATR0.5_1.5_H4 horizon 1.69 0.40 0.161

Switching the envelope from ATR1.5_3.0 to ATR0.5_1.5 multiplied Sortino ×2.5, action_rate ×20, return ×3.5. The horizon sweep (H3..H12) on the new envelope came back totally flat — 0/4 metrics significant, all Sortinos 1.5-1.7. Horizon is not the knob.

The mission depends on finding the (SL_mult, TP_mult) pair that simultaneously:

  • Keeps action_rate ≥ 0.05 (operator criterion: "enough trades to learn")
  • Maximizes f1_buy − const_F1 (the only F1 advantage that means anything — raw F1 moves mechanically with pos_rate)
  • Maximizes Sortino at a drawdown we can live with

No envelope sweep has ever been run as a coupled factor. Existing tp_multiplier and sl_multiplier factors sweep one axis at a time, which mixes volume effect and edge effect in the cross-variant variance.

2. Hypothesis

H1 — There is a measurable (SL, TP) pair at which the model's advantage = f1_buy − const_F1 is strictly > 0 and Sortino > 1.5, with BH-corrected paired-t significance on at least 2 of {Sortino, expectancy, total_return, win_rate} — the lock criterion from ADR-14 / Issue #595 Phase A.

H2 — The best pair is not sl0.5_tp1.5 (current γ anchor). The γ anchor was inherited from the label-quality scan as a reasonable starting point, not from an empirical sweep. If H2 holds, γ's base_env should be updated post-run.

H3 — The RR ratio (TP/SL) matters more than the absolute scale. Specifically, tight RR (1:1 to 1:2) and wide RR (1:4 to 1:5) should produce different regimes:

  • Tight RR → high pos_rate, high trade volume, low per-trade edge, fragile to costs.
  • Wide RR → low pos_rate, fewer trades, higher per-trade edge, more variance.

The sweep will tell which regime wins on this market.

Null hypothesis — Every variant lands within ±0.15 Sortino of the anchor, no BH significance, no advantage above +0.02 anywhere. In that case the PTE is not the lever and we need to revisit the feature/model axes before any other envelope work.

3. Grid — 11 coupled (SL, TP) variants

Variant SL × ATR TP × ATR RR Hypothesis role
sl0.3_tp1.0 0.3 1.0 1:3.3 Tight floor — max trade density
sl0.3_tp1.5 0.3 1.5 1:5 Very tight SL + ambitious TP
sl0.5_tp1.0 0.5 1.0 1:2 Current SL, compact TP
sl0.5_tp1.5 0.5 1.5 1:3 γ anchor (baseline)
sl0.5_tp2.0 0.5 2.0 1:4 Current SL, wider TP
sl0.5_tp2.5 0.5 2.5 1:5 Ambitious RR, same SL
sl1.0_tp1.0 1.0 1.0 1:1 Symmetric — test if edge exists without RR advantage
sl1.0_tp1.5 1.0 1.5 1:1.5 Modest RR, medium scale
sl1.0_tp2.0 1.0 2.0 1:2 Balanced medium
sl1.0_tp3.0 1.0 3.0 1:3 Medium SL, wide TP
sl1.5_tp3.0 1.5 3.0 1:2 Pre-mission envelope (reference)

Horizon held constant at base_env default (H=4h). Binary mode + f1_binary HPO + ThresholdCalibrator.f1_binary already pinned in base_env via γ (2026-04-22). No other factor varied in this run.

4. Resource estimate

Quantity Value
Variants 11
Cryptos (defi_top5) 5 (AAVEUSDC kept per operator decision)
Folds 5
HPO trials / variant / fold 50
Total pods (factor × crypto) 55
Model-fits total 11 × 5 × 5 = 275
Profile standard (4 CPU / 8Gi per pod)
Airflow max_active_tasks 24
Wall-time estimate ~8h (extrapolated from run ftf_20260422_220929 which did 25 pods in 3h19m at same config)

confirm_long_run=true required because forecast > 3h.

5. Success criteria (decision rule)

LOCK — promote the winner to base_env

A variant V is locked iff all three conditions hold:

  1. Sortino V ranks first AND BH-adjusted p < 0.05 vs the anchor sl0.5_tp1.5 on at least 2 of the 4 primary metrics (Sortino / expectancy / total_return / win_rate), Cohen's d ≥ 0.3 in winner direction. (ADR-14 lock rule.)
  2. advantage > +0.02 averaged across cryptos (f1_buy − const_F1). Captures that the model adds value above naive constant-BUY.
  3. AAVEUSDC Sortino > -1.0 under V. Tolerance: AAVE can stay negative but must not amplify beyond what it showed in run B (-0.6 to -0.8). Hard floor -1.0 catches regimes where the envelope makes AAVE structurally broken.

If all three pass → update ftf_config.base_env via Console:

CVN_SL_MULT = "<winner SL>"
CVN_TP_MULT = "<winner TP>"

NOT-LOCK — keep γ anchor

Any variant tied or better than anchor on Sortino WITHOUT BH significance → no promotion, anchor stays. Move to next lever (Δf1 reduction, feature count, or model ensemble).

PIVOT — if null hypothesis wins

All 11 variants within ±0.15 Sortino of anchor, 0/11 with advantage > +0.02 → envelope is not the lever. Open separate audit issue on features/label quality before any further envelope work. Document this as a dead-end in the mission plan v2.

6. What we measure

Primary (decision-driving)

  • f1_buy, precision_buy, recall_buy, action_rate, Δf1 (overfit gap)
  • advantage = f1_buy − const_F1 computed post-hoc, added to the analysis notebook — not currently in the PDF report (follow-up to add to report_pdf.py)
  • Sortino, expectancy_per_trade, total_return_pct, win_rate, max_dd, n_trades
  • BH-corrected paired-t p-values per (variant × metric)
  • Cohen's d per (variant × metric)

Secondary (diagnostics)

  • Per-fold stability (variance across folds 3-7 — fold 4 has been a consistent outlier)
  • Per-crypto breakdown (especially AAVEUSDC trajectory)
  • Signal funnel (CUSUM block rate, concurrency block rate, total survival)
  • Calibration quality (Brier, ECE) — will deteriorate at high pos_rate, expected

Red flags (abort criteria during run)

  • Any variant's median action_rate < 0.01 → label too sparse, fold will crash on threshold calibration. Known fragility of the f1_binary path.
  • Per-pod OOM → bump profile from standard (8Gi) to heavy (24Gi) via power_mode=deep. Don't go full deep unless a pod actually OOMs — deep pulls in defi_full (17 cryptos) which we don't want here.

7. Analysis plan (post-run)

Local SQL after the run completes:

SELECT
  variant,
  crypto,
  fold_id,
  f1_buy, precision_buy, recall_buy, action_rate,
  n_trades_val, sortino_val, expectancy_val,
  sortino, total_return_pct, win_rate, max_dd_pct,
  n_trades_backtest
FROM finetune_results
WHERE run_id = '<new_run_id>'
ORDER BY variant, crypto, fold_id;

One-shot analyzer: scripts/analyze_pte_envelope_run.py <path_to_pdf> parses the FTF report and applies the full 3-condition lock rule (§5) automatically. Emits a markdown summary + JSON, prints LOCK / NOT_LOCK / PIVOT to stdout. Runs in seconds, no DB / MLflow needed. Tested against the horizon run (2026-04-22).

Manual pipeline (if the analyzer breaks on a PDF format change):

  1. Download PDF from Console → Runs → extract summary table.
  2. Join f1_buy (Couche A) + Sortino (Couche C) per (variant, crypto, fold).
  3. Compute pos_rate — not yet in finetune_results. Two sources:

a. Preferred (exact): re-run scripts/label_quality_scan.py with the winning (SL_mult, TP_mult) from the sweep. That script replays the triple-barrier on raw OHLCV and returns pos_rate per (crypto, variant). Cost: ~2 min / config on defi_top5.

b. Heuristic (fast, noisy): use the model's observed action_rate as a proxy for pos_rate. At an F1-maximizing threshold the two converge within ~10% on balanced-ish labels (we saw action_rate ≈ 0.16, close to the label-scan's pos_rate ≈ 0.25 on the H4 anchor). Good enough for ranking variants when the true rate isn't yet plumbed through.

Follow-up gated on this run: extend report_pdf.py (or the finetune_results ETL upstream) to persist pos_rate per (variant, crypto, fold) so subsequent runs skip step 3. Currently none of the factors produce this column.

  1. Compute const_F1 = 2 × pos_rate / (1 + pos_rate) per (variant, crypto).
  2. Compute advantage = f1_buy − const_F1 for each cell; aggregate (median across folds × cryptos).
  3. Lock / not-lock / pivot decision per §5.

8. Follow-ups gated on this run

  • If lock → update base_env, re-run a short validation FTF with a secondary factor (candidate next: overfit-reduction factor hpo_regularization_band {loose/tight}) to confirm the locked PTE is robust to model complexity changes.
  • If not-lock or pivot → AAVEUSDC audit, then overfit-reduction on anchor.
  • Either way → extend report_pdf.py to compute and display advantage alongside f1_buy by default. Raw f1_buy without the pos_rate reference is misleading and we relearned this twice already.

9. Timeline

Step ETA
PR #630 merged today, post-CR
Console flip if needed (operator) today
Trigger FTF w/ factor=pte_envelope today
Run completes +8h wall
Analysis + decision +9h total

10. Stories (retro-registered in OP — 2026-06-09)

Cet Epic (expérience 2026-04-23/24) n'avait jamais été tracé en OpenProject. Enregistré a posteriori : Epic wp#258 (GH #1146), parent Need CVN-N001.

Story Titre GH · OP Statut
CVN-N001-EC-S01 pte_envelope factor — coupled (SL,TP) sweep impl #1147 · wp#259 Closed (PR #630 mergé)
CVN-N001-EC-S02 sweep run 11 variants + décision LOCK/NOT-LOCK/PIVOT #1148 · wp#260 Closed (décision PTE ATR0.5_1.5_H4, ~2026-04-24)

Follow-ups §8 non-poursuivis (jamais démarrés ; programme pivoté vers le gel ML_USELESS) — non créés en Story : affichage advantage dans report_pdf.py · validation FTF hpo_regularization_band / audit AAVEUSDC. Réouverture via nouvelle Story si le travail PTE reprend.