Track 5 — Label smoothing results & gate decision¶
Story : CVN-N001-EE-S01 (wp#40) — Track 5 of F1_buy boost mission
Date : 2026-04-29
Authors : Dominique (operator) + Claude
Sweep run_id : ftf_20260428_181144_3163de
Verdict : ABANDON label_smoothing variants (mild + aggressive) — both regress f1_buy on 5/5 cryptos with statistical significance
Lock decision : factor_label_smoothing=none remains baseline. No Console flip.
1. Sweep state¶
Per psql query on finetune_results (table backing the FTF stats stack), the 2026-04-28 sweep produced :
| Factor | Variant | Useful rows (n_trades > 0) | Coverage (cryptos × folds) |
|---|---|---|---|
| label_smoothing | none (baseline) | 34 | 5 cryptos × 5 folds + retries |
| label_smoothing | mild | 25 | 5 × 5 |
| label_smoothing | aggressive | 25 | 5 × 5 |
| cleanlab | off (baseline) | 35 partial | 3 cryptos × 5 folds + 2 partial |
| cleanlab | filter | 0 | runtime hang + cap-par-classe bug |
| cleanlab | reweight | 0 | same |
label_smoothing complete : enough data for the per-track gate decision now. cleanlab branch separately blocked by CVN-N011-EA-S08 (per-class cap fix) — to be re-run post-merge.
2. Per-crypto means (5 folds each, baseline = none)¶
| Crypto | Variant | mean f1_buy | mean sortino | mean return | mean trades | mean ECE |
|---|---|---|---|---|---|---|
| AAVEUSDC | none | 0.4271 | 0.075 | +0.31 | 36 | 0.0363 |
| AAVEUSDC | mild | 0.3949 | 0.033 | -1.19 | 31 | 0.0000 ⚠ |
| AAVEUSDC | aggressive | 0.3644 | 0.108 | +1.50 | 32 | 0.0000 ⚠ |
| ARBUSDC | none | 0.4436 | 3.252 | +68.52 | 70 | 0.0385 |
| ARBUSDC | mild | 0.3918 | 3.298 | +65.95 | 59 | 0.0000 ⚠ |
| ARBUSDC | aggressive | 0.3783 | 2.472 | +51.55 | 46 | 0.0000 ⚠ |
| LDOUSDC | none | 0.3943 | 2.070 | +50.55 | 34 | 0.0500 |
| LDOUSDC | mild | 0.3476 | 1.996 | +53.97 | 25 | 0.0000 ⚠ |
| LDOUSDC | aggressive | 0.2994 | 2.071 | +48.75 | 29 | 0.0000 ⚠ |
| OPUSDC | none | 0.3806 | 1.993 | +37.56 | 72 | 0.0645 |
| OPUSDC | mild | 0.2773 | 2.004 | +32.24 | 42 | 0.0000 ⚠ |
| OPUSDC | aggressive | 0.3111 | 1.391 | +27.08 | 35 | 0.0000 ⚠ |
| UNIUSDC | none | 0.4120 | 1.997 | +46.49 | 64 | 0.0396 |
| UNIUSDC | mild | 0.3414 | 1.372 | +27.63 | 29 | 0.0000 ⚠ |
| UNIUSDC | aggressive | 0.3362 | 1.479 | +29.38 | 22 | 0.0000 ⚠ |
⚠ ECE anomaly : every soft-label row reads exactly 0.00000 (bit-for-bit zero) vs reasonable values on baseline. Documented as separate bug CVN-N011-EA-S09 (wp#87 / #770). Doesn't change the gate decision (the lift on f1_buy is the dominant signal) but means ECE_HOLD ≤ baseline + 0.01 gate would be a no-op for soft variants until S09 lands.
3. Per-fold paired deltas (variant - none)¶
Statistical analysis on n=25 pairs per variant (5 cryptos × 5 folds). All metrics are paired : same (crypto, fold) seed, only the variant differs.
| Variant | mean Δf1_buy | mean Δsortino | mean Δreturn | Cohen's d | CI95 bootstrap (n=10000) |
|---|---|---|---|---|---|
| mild | -0.0722 | -0.002 | n/a | -1.11 (large) | [-0.098, -0.049] excludes 0 |
| aggressive | -0.0849 | -0.238 | n/a | -1.82 (very large) | [-0.103, -0.067] excludes 0 |
Interpretation : both variants have highly significant negative effect on f1_buy. The CI95 excludes 0 from the wrong side ; Cohen's d > 1 is the largest effect size category. This is statistically conclusive evidence of regression, not noise.
4. Gate evaluation per F1_BUY_BOOST_PLAN.md §6¶
| Criterion | mild | aggressive |
|---|---|---|
| f1_buy lift ≥ +0.015 with CI95 excluding 0 | ❌ Δ = -0.072, CI [-0.098, -0.049] (significant in WRONG direction) | ❌ Δ = -0.085, CI [-0.103, -0.067] |
| ≥ 4/5 cryptos improve f1_buy individually | ❌ 0/5 | ❌ 0/5 |
| Cohen's d ≥ 0.3 | ❌ -1.11 | ❌ -1.82 |
| Story-specific Δ f1_buy ≥ +0.02 | ❌ | ❌ |
| sortino ≥ baseline | ⚠ ARBUSDC slightly improves (+0.42), 4/5 regress | ❌ aggressive 4/5 regress |
| expectancy_net ≥ baseline | ❌ | ❌ |
| max_drawdown ≤ baseline + 1 % | (not analysed — moot given f1_buy gate fail) | (idem) |
| ECE_HOLD ≤ baseline + 0.01 | ⚠ Cannot evaluate (S09 bug) | ⚠ Same |
| ≥ 50 BUY trades / fold | ❌ Most cryptos < 50 | ❌ Most cryptos < 50 |
Verdict per criterion : every primary criterion fails for both variants. No further analysis warranted.
5. Why label smoothing fails on this data¶
Hypotheses worth recording (not actionable, but informs future work) :
- Already-balanced training via class_balancing —
CVN_CLASS_BALANCING=1(active in the baseline) already weights BUY samples up. Adding label smoothing on top dilutes the BUY signal further → model becomes too conservative → fewer BUY signals → lower f1_buy. - Trade signal is rare-event learning, not classification confidence calibration — Müller (2019) results assume a high-volume, well-balanced classification task. On rare-event trading signals, the optimization target isn't to "soften overconfident predictions" but to "not miss the rare positive" — these point in opposite directions.
- Calibration was already isotonic — production already applies isotonic calibration post-train. Adding smoothing pre-train + calibration post-train double-corrects → over-smoothing.
These are post-hoc rationalizations, not pre-registered hypotheses, so they don't carry weight for decisions but they're a useful prompt for the joint Track 6 (focal loss) decision : if focal also smoothes-by-design, similar caution applies.
6. Decisions¶
6.1 Lock variant¶
No lock. Keep factor_label_smoothing=none as baseline. Console state unchanged.
6.2 Cleanlab branch — also ABANDONED¶
Operator re-triggered the cleanlab variants on 2026-04-29 16:34 UTC after S08 (class-aware cap, PR #769 squash 9ff3966e) + S10 (gRPC fork deadlock fix, PR #777 squash 0f6220fc) + sympy hotfix (PR #775 squash 510b10db) had all landed and deployed.
Run : ftf_20260429_163445_5a99ff_ATR0.5_1.5_H4 — status failed (76 rows logged in finetune_results = 5 cryptos × 5 folds × 3 variants [off, filter, reweight] matrix (75 expected) + 1 fold retry ; 75 useful after excluding 1 training_failed on UNIUSDC fold 3 off baseline, an unrelated cluster transient).
Stats vs off baseline (paired by crypto × fold, n=24 after excluding the failed fold pair) :
| Variant | mean Δf1_buy | CI95 | Cohen's d | BH p (m=2) |
|---|---|---|---|---|
filter |
-0.0811 | [-0.109, -0.057] | -1.21 | 4.87e-06 |
reweight |
-0.0762 | [-0.099, -0.056] | -1.38 | 1.36e-06 |
6 gates verdict (per F1 plan §6) — both variants fail 4 of 6 :
- F1_buy gate : FAIL × 2 (CI95 excludes 0 in wrong direction)
- Joint metric : FAIL × 2 (Δsortino negative on both, -0.19 / -0.34)
- Stability : PASS × 2 (max var 0.025 / 0.017)
- Per-asset : FAIL × 2 (0/5 cryptos improve on either variant)
- Sample size : FAIL × 2 (mean n_trades 35.5 / 31.4 vs 50 cap — cleanlab drops samples by design which compounds the BUY-trade scarcity)
- MLOps : PASS × 2
Decision : ABANDON for both filter and reweight. The S08 per-class cap fix correctly bounded the BUY drop to ≤ 5 % (visible in production logs buy_drop_pct=4.99 ≤ cap_pct=5.0) — the cap works as designed. But the cap working did NOT make cleanlab a positive intervention on this data : the dropped samples (whatever their distribution) cost more f1_buy than the noise removal earns.
Per-crypto f1_buy summary (all variants regress vs off) :
| Crypto | off | filter | reweight |
|---|---|---|---|
| AAVEUSDC | 0.4586 | 0.4141 | 0.3775 |
| ARBUSDC | 0.4552 | 0.3904 | 0.4121 |
| LDOUSDC | 0.4124 | 0.2681 | 0.2952 |
| OPUSDC | 0.3891 | 0.2973 | 0.3160 |
| UNIUSDC | 0.4514 | 0.3548 | 0.3566 |
LDO is the worst regression (~14 pts on filter). No crypto where cleanlab even matches off.
6.3 Joint variant¶
Skipped. A joint mild × cleanlab=* variant has no reasonable path to clearing the gate given mild alone regresses by -7 % on f1_buy. The joint sweep (additional 25 rows) would be wasted compute.
6.4 OP Story closure¶
wp#40 (CVN-N001-EE-S01) → status Closed with verdict ABANDONED (both branches : label_smoothing + cleanlab). Comment links this dossier + the S08 / S10 / sympy fix PRs that unblocked the cleanlab re-run.
6.5 Follow-ups¶
CVN-N011-EA-S09— fix ECE returning 0 silently on soft-label rows (P2)CVN-N011-EA-S08— class-aware cap fix landed (PR #769 merged 2026-04-29 squash9ff3966e)CVN-N011-EA-S10— gRPC fork deadlock blocks HPO/FTF sweeps (P1, #774) — landed before the 2026-04-29 cleanlab re-run (PR #777 merged squash0f6220fc)
7. Linked context¶
- Plan dossier :
2026-04-28-track5-label-smoothing-plan.md - Hotfix v1 (DataFrame coercion) :
2026-04-28-track5-hotfix-dataframe-coercion.md(PR #752) - Hotfix v2 (type preservation) :
2026-04-28-track5-hotfix-v2-type-preservation.md(PR #754) - Bug #1 calibration refactor :
2026-04-28-track5-bug1-calibration-refactor-plan.md(PR #765) - Cleanlab class-aware cap : PR #769 (Story
CVN-N011-EA-S08) - Operations incident log :
OPERATIONS.md§17.1, §17.2, §17.3 - Mission overview : ML Boost — F1_buy 75 mission