Track 5 — Label smoothing results & gate decision¶

Story : CVN-N001-EE-S01 (wp#40) — Track 5 of F1_buy boost mission Date : 2026-04-29 Authors : Dominique (operator) + Claude Sweep run_id : ftf_20260428_181144_3163de Verdict : ABANDON label_smoothing variants (mild + aggressive) — both regress f1_buy on 5/5 cryptos with statistical significance Lock decision : factor_label_smoothing=none remains baseline. No Console flip.

1. Sweep state¶

Per psql query on finetune_results (table backing the FTF stats stack), the 2026-04-28 sweep produced :

Factor	Variant	Useful rows (n_trades > 0)	Coverage (cryptos × folds)
label_smoothing	none (baseline)	34	5 cryptos × 5 folds + retries
label_smoothing	mild	25	5 × 5
label_smoothing	aggressive	25	5 × 5
cleanlab	off (baseline)	35 partial	3 cryptos × 5 folds + 2 partial
cleanlab	filter	0	runtime hang + cap-par-classe bug
cleanlab	reweight	0	same

label_smoothing complete : enough data for the per-track gate decision now. cleanlab branch separately blocked by CVN-N011-EA-S08 (per-class cap fix) — to be re-run post-merge.

2. Per-crypto means (5 folds each, baseline = none)¶

Crypto	Variant	mean f1_buy	mean sortino	mean return	mean trades	mean ECE
AAVEUSDC	none	0.4271	0.075	+0.31	36	0.0363
AAVEUSDC	mild	0.3949	0.033	-1.19	31	0.0000 ⚠
AAVEUSDC	aggressive	0.3644	0.108	+1.50	32	0.0000 ⚠
ARBUSDC	none	0.4436	3.252	+68.52	70	0.0385
ARBUSDC	mild	0.3918	3.298	+65.95	59	0.0000 ⚠
ARBUSDC	aggressive	0.3783	2.472	+51.55	46	0.0000 ⚠
LDOUSDC	none	0.3943	2.070	+50.55	34	0.0500
LDOUSDC	mild	0.3476	1.996	+53.97	25	0.0000 ⚠
LDOUSDC	aggressive	0.2994	2.071	+48.75	29	0.0000 ⚠
OPUSDC	none	0.3806	1.993	+37.56	72	0.0645
OPUSDC	mild	0.2773	2.004	+32.24	42	0.0000 ⚠
OPUSDC	aggressive	0.3111	1.391	+27.08	35	0.0000 ⚠
UNIUSDC	none	0.4120	1.997	+46.49	64	0.0396
UNIUSDC	mild	0.3414	1.372	+27.63	29	0.0000 ⚠
UNIUSDC	aggressive	0.3362	1.479	+29.38	22	0.0000 ⚠

⚠ ECE anomaly : every soft-label row reads exactly 0.00000 (bit-for-bit zero) vs reasonable values on baseline. Documented as separate bug CVN-N011-EA-S09 (wp#87 / #770). Doesn't change the gate decision (the lift on f1_buy is the dominant signal) but means ECE_HOLD ≤ baseline + 0.01 gate would be a no-op for soft variants until S09 lands.

3. Per-fold paired deltas (variant - none)¶

Statistical analysis on n=25 pairs per variant (5 cryptos × 5 folds). All metrics are paired : same (crypto, fold) seed, only the variant differs.

Variant	mean Δf1_buy	mean Δsortino	mean Δreturn	Cohen's d	CI95 bootstrap (n=10000)
mild	-0.0722	-0.002	n/a	-1.11 (large)	[-0.098, -0.049] excludes 0
aggressive	-0.0849	-0.238	n/a	-1.82 (very large)	[-0.103, -0.067] excludes 0

Interpretation : both variants have highly significant negative effect on f1_buy. The CI95 excludes 0 from the wrong side ; Cohen's d > 1 is the largest effect size category. This is statistically conclusive evidence of regression, not noise.

4. Gate evaluation per F1_BUY_BOOST_PLAN.md §6 ¶

Criterion	mild	aggressive
f1_buy lift ≥ +0.015 with CI95 excluding 0	❌ Δ = -0.072, CI [-0.098, -0.049] (significant in WRONG direction)	❌ Δ = -0.085, CI [-0.103, -0.067]
≥ 4/5 cryptos improve f1_buy individually	❌ 0/5	❌ 0/5
Cohen's d ≥ 0.3	❌ -1.11	❌ -1.82
Story-specific Δ f1_buy ≥ +0.02	❌	❌
sortino ≥ baseline	⚠ ARBUSDC slightly improves (+0.42), 4/5 regress	❌ aggressive 4/5 regress
expectancy_net ≥ baseline	❌	❌
max_drawdown ≤ baseline + 1 %	(not analysed — moot given f1_buy gate fail)	(idem)
ECE_HOLD ≤ baseline + 0.01	⚠ Cannot evaluate (S09 bug)	⚠ Same
≥ 50 BUY trades / fold	❌ Most cryptos < 50	❌ Most cryptos < 50

Verdict per criterion : every primary criterion fails for both variants. No further analysis warranted.

5. Why label smoothing fails on this data¶

Hypotheses worth recording (not actionable, but informs future work) :

Already-balanced training via class_balancing — CVN_CLASS_BALANCING=1 (active in the baseline) already weights BUY samples up. Adding label smoothing on top dilutes the BUY signal further → model becomes too conservative → fewer BUY signals → lower f1_buy.
Trade signal is rare-event learning, not classification confidence calibration — Müller (2019) results assume a high-volume, well-balanced classification task. On rare-event trading signals, the optimization target isn't to "soften overconfident predictions" but to "not miss the rare positive" — these point in opposite directions.
Calibration was already isotonic — production already applies isotonic calibration post-train. Adding smoothing pre-train + calibration post-train double-corrects → over-smoothing.

These are post-hoc rationalizations, not pre-registered hypotheses, so they don't carry weight for decisions but they're a useful prompt for the joint Track 6 (focal loss) decision : if focal also smoothes-by-design, similar caution applies.

6. Decisions¶

6.1 Lock variant¶

No lock. Keep factor_label_smoothing=none as baseline. Console state unchanged.

6.2 Cleanlab branch — also ABANDONED¶

Operator re-triggered the cleanlab variants on 2026-04-29 16:34 UTC after S08 (class-aware cap, PR #769 squash 9ff3966e) + S10 (gRPC fork deadlock fix, PR #777 squash 0f6220fc) + sympy hotfix (PR #775 squash 510b10db) had all landed and deployed.

Run : ftf_20260429_163445_5a99ff_ATR0.5_1.5_H4 — status failed (76 rows logged in finetune_results = 5 cryptos × 5 folds × 3 variants [off, filter, reweight] matrix (75 expected) + 1 fold retry ; 75 useful after excluding 1 training_failed on UNIUSDC fold 3 off baseline, an unrelated cluster transient).

Stats vs off baseline (paired by crypto × fold, n=24 after excluding the failed fold pair) :

Variant	mean Δf1_buy	CI95	Cohen's d	BH p (m=2)
`filter`	-0.0811	[-0.109, -0.057]	-1.21	4.87e-06
`reweight`	-0.0762	[-0.099, -0.056]	-1.38	1.36e-06

6 gates verdict (per F1 plan §6) — both variants fail 4 of 6 :

F1_buy gate : FAIL × 2 (CI95 excludes 0 in wrong direction)
Joint metric : FAIL × 2 (Δsortino negative on both, -0.19 / -0.34)
Stability : PASS × 2 (max var 0.025 / 0.017)
Per-asset : FAIL × 2 (0/5 cryptos improve on either variant)
Sample size : FAIL × 2 (mean n_trades 35.5 / 31.4 vs 50 cap — cleanlab drops samples by design which compounds the BUY-trade scarcity)
MLOps : PASS × 2

Decision : ABANDON for both filter and reweight. The S08 per-class cap fix correctly bounded the BUY drop to ≤ 5 % (visible in production logs buy_drop_pct=4.99 ≤ cap_pct=5.0) — the cap works as designed. But the cap working did NOT make cleanlab a positive intervention on this data : the dropped samples (whatever their distribution) cost more f1_buy than the noise removal earns.

Per-crypto f1_buy summary (all variants regress vs off) :

Crypto	off	filter	reweight
AAVEUSDC	0.4586	0.4141	0.3775
ARBUSDC	0.4552	0.3904	0.4121
LDOUSDC	0.4124	0.2681	0.2952
OPUSDC	0.3891	0.2973	0.3160
UNIUSDC	0.4514	0.3548	0.3566

LDO is the worst regression (~14 pts on filter). No crypto where cleanlab even matches off.

6.3 Joint variant¶

Skipped. A joint mild × cleanlab=* variant has no reasonable path to clearing the gate given mild alone regresses by -7 % on f1_buy. The joint sweep (additional 25 rows) would be wasted compute.

6.4 OP Story closure¶

wp#40 (CVN-N001-EE-S01) → status Closed with verdict ABANDONED (both branches : label_smoothing + cleanlab). Comment links this dossier + the S08 / S10 / sympy fix PRs that unblocked the cleanlab re-run.

6.5 Follow-ups¶

CVN-N011-EA-S09 — fix ECE returning 0 silently on soft-label rows (P2)
CVN-N011-EA-S08 — class-aware cap fix landed (PR #769 merged 2026-04-29 squash 9ff3966e)
CVN-N011-EA-S10 — gRPC fork deadlock blocks HPO/FTF sweeps (P1, #774) — landed before the 2026-04-29 cleanlab re-run (PR #777 merged squash 0f6220fc)

7. Linked context¶

Plan dossier : 2026-04-28-track5-label-smoothing-plan.md
Hotfix v1 (DataFrame coercion) : 2026-04-28-track5-hotfix-dataframe-coercion.md (PR #752)
Hotfix v2 (type preservation) : 2026-04-28-track5-hotfix-v2-type-preservation.md (PR #754)
Bug #1 calibration refactor : 2026-04-28-track5-bug1-calibration-refactor-plan.md (PR #765)
Cleanlab class-aware cap : PR #769 (Story CVN-N011-EA-S08)
Operations incident log : OPERATIONS.md §17.1, §17.2, §17.3
Mission overview : ML Boost — F1_buy 75 mission