Skip to content

Track 5 — Label smoothing results & gate decision

Story : CVN-N001-EE-S01 (wp#40) — Track 5 of F1_buy boost mission Date : 2026-04-29 Authors : Dominique (operator) + Claude Sweep run_id : ftf_20260428_181144_3163de Verdict : ABANDON label_smoothing variants (mild + aggressive) — both regress f1_buy on 5/5 cryptos with statistical significance Lock decision : factor_label_smoothing=none remains baseline. No Console flip.


1. Sweep state

Per psql query on finetune_results (table backing the FTF stats stack), the 2026-04-28 sweep produced :

Factor Variant Useful rows (n_trades > 0) Coverage (cryptos × folds)
label_smoothing none (baseline) 34 5 cryptos × 5 folds + retries
label_smoothing mild 25 5 × 5
label_smoothing aggressive 25 5 × 5
cleanlab off (baseline) 35 partial 3 cryptos × 5 folds + 2 partial
cleanlab filter 0 runtime hang + cap-par-classe bug
cleanlab reweight 0 same

label_smoothing complete : enough data for the per-track gate decision now. cleanlab branch separately blocked by CVN-N011-EA-S08 (per-class cap fix) — to be re-run post-merge.

2. Per-crypto means (5 folds each, baseline = none)

Crypto Variant mean f1_buy mean sortino mean return mean trades mean ECE
AAVEUSDC none 0.4271 0.075 +0.31 36 0.0363
AAVEUSDC mild 0.3949 0.033 -1.19 31 0.0000
AAVEUSDC aggressive 0.3644 0.108 +1.50 32 0.0000
ARBUSDC none 0.4436 3.252 +68.52 70 0.0385
ARBUSDC mild 0.3918 3.298 +65.95 59 0.0000
ARBUSDC aggressive 0.3783 2.472 +51.55 46 0.0000
LDOUSDC none 0.3943 2.070 +50.55 34 0.0500
LDOUSDC mild 0.3476 1.996 +53.97 25 0.0000
LDOUSDC aggressive 0.2994 2.071 +48.75 29 0.0000
OPUSDC none 0.3806 1.993 +37.56 72 0.0645
OPUSDC mild 0.2773 2.004 +32.24 42 0.0000
OPUSDC aggressive 0.3111 1.391 +27.08 35 0.0000
UNIUSDC none 0.4120 1.997 +46.49 64 0.0396
UNIUSDC mild 0.3414 1.372 +27.63 29 0.0000
UNIUSDC aggressive 0.3362 1.479 +29.38 22 0.0000

ECE anomaly : every soft-label row reads exactly 0.00000 (bit-for-bit zero) vs reasonable values on baseline. Documented as separate bug CVN-N011-EA-S09 (wp#87 / #770). Doesn't change the gate decision (the lift on f1_buy is the dominant signal) but means ECE_HOLD ≤ baseline + 0.01 gate would be a no-op for soft variants until S09 lands.

3. Per-fold paired deltas (variant - none)

Statistical analysis on n=25 pairs per variant (5 cryptos × 5 folds). All metrics are paired : same (crypto, fold) seed, only the variant differs.

Variant mean Δf1_buy mean Δsortino mean Δreturn Cohen's d CI95 bootstrap (n=10000)
mild -0.0722 -0.002 n/a -1.11 (large) [-0.098, -0.049] excludes 0
aggressive -0.0849 -0.238 n/a -1.82 (very large) [-0.103, -0.067] excludes 0

Interpretation : both variants have highly significant negative effect on f1_buy. The CI95 excludes 0 from the wrong side ; Cohen's d > 1 is the largest effect size category. This is statistically conclusive evidence of regression, not noise.

4. Gate evaluation per F1_BUY_BOOST_PLAN.md §6

Criterion mild aggressive
f1_buy lift ≥ +0.015 with CI95 excluding 0 ❌ Δ = -0.072, CI [-0.098, -0.049] (significant in WRONG direction) ❌ Δ = -0.085, CI [-0.103, -0.067]
≥ 4/5 cryptos improve f1_buy individually 0/5 0/5
Cohen's d ≥ 0.3 ❌ -1.11 ❌ -1.82
Story-specific Δ f1_buy ≥ +0.02
sortino ≥ baseline ⚠ ARBUSDC slightly improves (+0.42), 4/5 regress ❌ aggressive 4/5 regress
expectancy_net ≥ baseline
max_drawdown ≤ baseline + 1 % (not analysed — moot given f1_buy gate fail) (idem)
ECE_HOLD ≤ baseline + 0.01 ⚠ Cannot evaluate (S09 bug) ⚠ Same
≥ 50 BUY trades / fold ❌ Most cryptos < 50 ❌ Most cryptos < 50

Verdict per criterion : every primary criterion fails for both variants. No further analysis warranted.

5. Why label smoothing fails on this data

Hypotheses worth recording (not actionable, but informs future work) :

  1. Already-balanced training via class_balancingCVN_CLASS_BALANCING=1 (active in the baseline) already weights BUY samples up. Adding label smoothing on top dilutes the BUY signal further → model becomes too conservative → fewer BUY signals → lower f1_buy.
  2. Trade signal is rare-event learning, not classification confidence calibration — Müller (2019) results assume a high-volume, well-balanced classification task. On rare-event trading signals, the optimization target isn't to "soften overconfident predictions" but to "not miss the rare positive" — these point in opposite directions.
  3. Calibration was already isotonic — production already applies isotonic calibration post-train. Adding smoothing pre-train + calibration post-train double-corrects → over-smoothing.

These are post-hoc rationalizations, not pre-registered hypotheses, so they don't carry weight for decisions but they're a useful prompt for the joint Track 6 (focal loss) decision : if focal also smoothes-by-design, similar caution applies.

6. Decisions

6.1 Lock variant

No lock. Keep factor_label_smoothing=none as baseline. Console state unchanged.

6.2 Cleanlab branch — also ABANDONED

Operator re-triggered the cleanlab variants on 2026-04-29 16:34 UTC after S08 (class-aware cap, PR #769 squash 9ff3966e) + S10 (gRPC fork deadlock fix, PR #777 squash 0f6220fc) + sympy hotfix (PR #775 squash 510b10db) had all landed and deployed.

Run : ftf_20260429_163445_5a99ff_ATR0.5_1.5_H4 — status failed (76 rows logged in finetune_results = 5 cryptos × 5 folds × 3 variants [off, filter, reweight] matrix (75 expected) + 1 fold retry ; 75 useful after excluding 1 training_failed on UNIUSDC fold 3 off baseline, an unrelated cluster transient).

Stats vs off baseline (paired by crypto × fold, n=24 after excluding the failed fold pair) :

Variant mean Δf1_buy CI95 Cohen's d BH p (m=2)
filter -0.0811 [-0.109, -0.057] -1.21 4.87e-06
reweight -0.0762 [-0.099, -0.056] -1.38 1.36e-06

6 gates verdict (per F1 plan §6) — both variants fail 4 of 6 :

  • F1_buy gate : FAIL × 2 (CI95 excludes 0 in wrong direction)
  • Joint metric : FAIL × 2 (Δsortino negative on both, -0.19 / -0.34)
  • Stability : PASS × 2 (max var 0.025 / 0.017)
  • Per-asset : FAIL × 2 (0/5 cryptos improve on either variant)
  • Sample size : FAIL × 2 (mean n_trades 35.5 / 31.4 vs 50 cap — cleanlab drops samples by design which compounds the BUY-trade scarcity)
  • MLOps : PASS × 2

Decision : ABANDON for both filter and reweight. The S08 per-class cap fix correctly bounded the BUY drop to ≤ 5 % (visible in production logs buy_drop_pct=4.99cap_pct=5.0) — the cap works as designed. But the cap working did NOT make cleanlab a positive intervention on this data : the dropped samples (whatever their distribution) cost more f1_buy than the noise removal earns.

Per-crypto f1_buy summary (all variants regress vs off) :

Crypto off filter reweight
AAVEUSDC 0.4586 0.4141 0.3775
ARBUSDC 0.4552 0.3904 0.4121
LDOUSDC 0.4124 0.2681 0.2952
OPUSDC 0.3891 0.2973 0.3160
UNIUSDC 0.4514 0.3548 0.3566

LDO is the worst regression (~14 pts on filter). No crypto where cleanlab even matches off.

6.3 Joint variant

Skipped. A joint mild × cleanlab=* variant has no reasonable path to clearing the gate given mild alone regresses by -7 % on f1_buy. The joint sweep (additional 25 rows) would be wasted compute.

6.4 OP Story closure

wp#40 (CVN-N001-EE-S01) → status Closed with verdict ABANDONED (both branches : label_smoothing + cleanlab). Comment links this dossier + the S08 / S10 / sympy fix PRs that unblocked the cleanlab re-run.

6.5 Follow-ups

  • CVN-N011-EA-S09 — fix ECE returning 0 silently on soft-label rows (P2)
  • CVN-N011-EA-S08 — class-aware cap fix landed (PR #769 merged 2026-04-29 squash 9ff3966e)
  • CVN-N011-EA-S10 — gRPC fork deadlock blocks HPO/FTF sweeps (P1, #774) — landed before the 2026-04-29 cleanlab re-run (PR #777 merged squash 0f6220fc)

7. Linked context