PTE envelope sweep — test plan¶
Issue: #608 (F1 mission — phase "lock the envelope")
Factor: pte_envelope (11 variants, added in PR #630)
PR branch: feat/608-pte-envelope-factor
Date: 2026-04-23
Operator: dco
1. Context¶
Two FTF runs on 2026-04-22 triangulated the real lever for the mission:
| Run | PTE | Factor | Sortino | f1_buy | action_rate |
|---|---|---|---|---|---|
ftf_20260422_174216 |
ATR1.5_3.0_H5 |
threshold_method | 0.68 | 0.06 | 0.008 |
ftf_20260422_220929 |
ATR0.5_1.5_H4 |
horizon | 1.69 | 0.40 | 0.161 |
Switching the envelope from ATR1.5_3.0 to ATR0.5_1.5 multiplied Sortino ×2.5, action_rate ×20, return ×3.5. The horizon sweep (H3..H12) on the new envelope came back totally flat — 0/4 metrics significant, all Sortinos 1.5-1.7. Horizon is not the knob.
The mission depends on finding the (SL_mult, TP_mult) pair that simultaneously:
- Keeps
action_rate ≥ 0.05(operator criterion: "enough trades to learn") - Maximizes
f1_buy − const_F1(the only F1 advantage that means anything — raw F1 moves mechanically with pos_rate) - Maximizes Sortino at a drawdown we can live with
No envelope sweep has ever been run as a coupled factor. Existing tp_multiplier and sl_multiplier factors sweep one axis at a time, which mixes volume effect and edge effect in the cross-variant variance.
2. Hypothesis¶
H1 — There is a measurable (SL, TP) pair at which the model's advantage = f1_buy − const_F1 is strictly > 0 and Sortino > 1.5, with BH-corrected paired-t significance on at least 2 of {Sortino, expectancy, total_return, win_rate} — the lock criterion from ADR-14 / Issue #595 Phase A.
H2 — The best pair is not sl0.5_tp1.5 (current γ anchor). The γ anchor was inherited from the label-quality scan as a reasonable starting point, not from an empirical sweep. If H2 holds, γ's base_env should be updated post-run.
H3 — The RR ratio (TP/SL) matters more than the absolute scale. Specifically, tight RR (1:1 to 1:2) and wide RR (1:4 to 1:5) should produce different regimes:
- Tight RR → high pos_rate, high trade volume, low per-trade edge, fragile to costs.
- Wide RR → low pos_rate, fewer trades, higher per-trade edge, more variance.
The sweep will tell which regime wins on this market.
Null hypothesis — Every variant lands within ±0.15 Sortino of the anchor, no BH significance, no advantage above +0.02 anywhere. In that case the PTE is not the lever and we need to revisit the feature/model axes before any other envelope work.
3. Grid — 11 coupled (SL, TP) variants¶
| Variant | SL × ATR | TP × ATR | RR | Hypothesis role |
|---|---|---|---|---|
sl0.3_tp1.0 |
0.3 | 1.0 | 1:3.3 | Tight floor — max trade density |
sl0.3_tp1.5 |
0.3 | 1.5 | 1:5 | Very tight SL + ambitious TP |
sl0.5_tp1.0 |
0.5 | 1.0 | 1:2 | Current SL, compact TP |
sl0.5_tp1.5 |
0.5 | 1.5 | 1:3 | γ anchor (baseline) |
sl0.5_tp2.0 |
0.5 | 2.0 | 1:4 | Current SL, wider TP |
sl0.5_tp2.5 |
0.5 | 2.5 | 1:5 | Ambitious RR, same SL |
sl1.0_tp1.0 |
1.0 | 1.0 | 1:1 | Symmetric — test if edge exists without RR advantage |
sl1.0_tp1.5 |
1.0 | 1.5 | 1:1.5 | Modest RR, medium scale |
sl1.0_tp2.0 |
1.0 | 2.0 | 1:2 | Balanced medium |
sl1.0_tp3.0 |
1.0 | 3.0 | 1:3 | Medium SL, wide TP |
sl1.5_tp3.0 |
1.5 | 3.0 | 1:2 | Pre-mission envelope (reference) |
Horizon held constant at base_env default (H=4h). Binary mode + f1_binary HPO + ThresholdCalibrator.f1_binary already pinned in base_env via γ (2026-04-22). No other factor varied in this run.
4. Resource estimate¶
| Quantity | Value |
|---|---|
| Variants | 11 |
| Cryptos (defi_top5) | 5 (AAVEUSDC kept per operator decision) |
| Folds | 5 |
| HPO trials / variant / fold | 50 |
| Total pods (factor × crypto) | 55 |
| Model-fits total | 11 × 5 × 5 = 275 |
| Profile | standard (4 CPU / 8Gi per pod) |
Airflow max_active_tasks |
24 |
| Wall-time estimate | ~8h (extrapolated from run ftf_20260422_220929 which did 25 pods in 3h19m at same config) |
confirm_long_run=true required because forecast > 3h.
5. Success criteria (decision rule)¶
LOCK — promote the winner to base_env¶
A variant V is locked iff all three conditions hold:
- Sortino
Vranks first AND BH-adjusted p < 0.05 vs the anchorsl0.5_tp1.5on at least 2 of the 4 primary metrics (Sortino / expectancy / total_return / win_rate), Cohen's d ≥ 0.3 in winner direction. (ADR-14 lock rule.) - advantage > +0.02 averaged across cryptos (
f1_buy − const_F1). Captures that the model adds value above naive constant-BUY. - AAVEUSDC Sortino > -1.0 under
V. Tolerance: AAVE can stay negative but must not amplify beyond what it showed in run B (-0.6 to -0.8). Hard floor -1.0 catches regimes where the envelope makes AAVE structurally broken.
If all three pass → update ftf_config.base_env via Console:
NOT-LOCK — keep γ anchor¶
Any variant tied or better than anchor on Sortino WITHOUT BH significance → no promotion, anchor stays. Move to next lever (Δf1 reduction, feature count, or model ensemble).
PIVOT — if null hypothesis wins¶
All 11 variants within ±0.15 Sortino of anchor, 0/11 with advantage > +0.02 → envelope is not the lever. Open separate audit issue on features/label quality before any further envelope work. Document this as a dead-end in the mission plan v2.
6. What we measure¶
Primary (decision-driving)¶
f1_buy,precision_buy,recall_buy,action_rate,Δf1(overfit gap)advantage = f1_buy − const_F1computed post-hoc, added to the analysis notebook — not currently in the PDF report (follow-up to add toreport_pdf.py)Sortino,expectancy_per_trade,total_return_pct,win_rate,max_dd,n_trades- BH-corrected paired-t p-values per (variant × metric)
- Cohen's d per (variant × metric)
Secondary (diagnostics)¶
- Per-fold stability (variance across folds 3-7 — fold 4 has been a consistent outlier)
- Per-crypto breakdown (especially AAVEUSDC trajectory)
- Signal funnel (CUSUM block rate, concurrency block rate, total survival)
- Calibration quality (Brier, ECE) — will deteriorate at high pos_rate, expected
Red flags (abort criteria during run)¶
- Any variant's
median action_rate < 0.01→ label too sparse, fold will crash on threshold calibration. Known fragility of the f1_binary path. - Per-pod OOM → bump profile from
standard(8Gi) toheavy(24Gi) viapower_mode=deep. Don't go full deep unless a pod actually OOMs —deeppulls in defi_full (17 cryptos) which we don't want here.
7. Analysis plan (post-run)¶
Local SQL after the run completes:
SELECT
variant,
crypto,
fold_id,
f1_buy, precision_buy, recall_buy, action_rate,
n_trades_val, sortino_val, expectancy_val,
sortino, total_return_pct, win_rate, max_dd_pct,
n_trades_backtest
FROM finetune_results
WHERE run_id = '<new_run_id>'
ORDER BY variant, crypto, fold_id;
One-shot analyzer: scripts/analyze_pte_envelope_run.py <path_to_pdf> parses the FTF report and applies the full 3-condition lock rule (§5) automatically. Emits a markdown summary + JSON, prints LOCK / NOT_LOCK / PIVOT to stdout. Runs in seconds, no DB / MLflow needed. Tested against the horizon run (2026-04-22).
Manual pipeline (if the analyzer breaks on a PDF format change):
- Download PDF from Console → Runs → extract summary table.
- Join
f1_buy(Couche A) + Sortino (Couche C) per (variant, crypto, fold). - Compute
pos_rate— not yet infinetune_results. Two sources:
a. Preferred (exact): re-run scripts/label_quality_scan.py with the
winning (SL_mult, TP_mult) from the sweep. That script replays the
triple-barrier on raw OHLCV and returns pos_rate per (crypto, variant).
Cost: ~2 min / config on defi_top5.
b. Heuristic (fast, noisy): use the model's observed action_rate as
a proxy for pos_rate. At an F1-maximizing threshold the two converge
within ~10% on balanced-ish labels (we saw action_rate ≈ 0.16, close
to the label-scan's pos_rate ≈ 0.25 on the H4 anchor). Good enough
for ranking variants when the true rate isn't yet plumbed through.
Follow-up gated on this run: extend report_pdf.py (or the
finetune_results ETL upstream) to persist pos_rate per (variant,
crypto, fold) so subsequent runs skip step 3. Currently none of the
factors produce this column.
- Compute
const_F1 = 2 × pos_rate / (1 + pos_rate)per (variant, crypto). - Compute
advantage = f1_buy − const_F1for each cell; aggregate (median across folds × cryptos). - Lock / not-lock / pivot decision per §5.
8. Follow-ups gated on this run¶
- If lock → update
base_env, re-run a short validation FTF with a secondary factor (candidate next: overfit-reduction factorhpo_regularization_band{loose/tight}) to confirm the locked PTE is robust to model complexity changes. - If not-lock or pivot → AAVEUSDC audit, then overfit-reduction on anchor.
- Either way → extend
report_pdf.pyto compute and displayadvantagealongsidef1_buyby default. Rawf1_buywithout the pos_rate reference is misleading and we relearned this twice already.
9. Timeline¶
| Step | ETA |
|---|---|
| PR #630 merged | today, post-CR |
| Console flip if needed (operator) | today |
Trigger FTF w/ factor=pte_envelope |
today |
| Run completes | +8h wall |
| Analysis + decision | +9h total |
10. Stories (retro-registered in OP — 2026-06-09)¶
Cet Epic (expérience 2026-04-23/24) n'avait jamais été tracé en OpenProject. Enregistré a posteriori : Epic wp#258 (GH #1146), parent Need CVN-N001.
| Story | Titre | GH · OP | Statut |
|---|---|---|---|
| CVN-N001-EC-S01 | pte_envelope factor — coupled (SL,TP) sweep impl |
#1147 · wp#259 | Closed (PR #630 mergé) |
| CVN-N001-EC-S02 | sweep run 11 variants + décision LOCK/NOT-LOCK/PIVOT | #1148 · wp#260 | Closed (décision PTE ATR0.5_1.5_H4, ~2026-04-24) |
Follow-ups §8 non-poursuivis (jamais démarrés ; programme pivoté vers le gel ML_USELESS) — non créés en Story : affichage advantage dans report_pdf.py · validation FTF hpo_regularization_band / audit AAVEUSDC. Réouverture via nouvelle Story si le travail PTE reprend.