Skip to content

Track 6 — Focal loss : results dossier (ABANDON)

Date : 2026-04-29 Story : CVN-N001-EE-S02 (OP wp#41) Track : 6 of F1_BUY_BOOST_PLAN.md — focal loss for XGBoost binary classification Plan dossier : 2026-04-28-track6-focal-loss-plan.md (committee 4ef337af PASSED OK after v1+v2 REJECTED) PR review dossier : 2026-04-28-track6-focal-loss-pr-review.md (committee 13fd89c9 PASSED OK) MLOps readiness : mlops_readiness.md — 6 sections complete per ADR-70 Implementation PR : #767 (squash db33e7dd, 2026-04-29) Sister hotfix PR : #775 (sympy missing, blocked first sweep — see OPERATIONS.md §17.4) FTF run : ftf_20260429_121011_282ec7_ATR0.5_1.5_H4 — 125 rows, 0 errors, status=completed

TL;DR — verdict

ABANDON for all 4 focal variants (mild_focus, standard, aggressive_focus, class_balanced). The hypothesis "focal loss concentrates training on the minority class and lifts f1_buy" is rejected by 5/5 cryptos with large effect size in the wrong direction (Cohen's d ∈ [-1.6, -1.2]).

none (γ=0, equivalent to standard binary cross-entropy) stays as the production champion. Console state unchanged ; no flip in ftf_config.base_env.

1. Sweep configuration

Param Value
Factor focal_loss (Track 6)
Variants (5) none (baseline, γ=0, α=0.5) ; mild_focus (γ=1, α=0.25) ; standard (γ=2, α=0.25) ; aggressive_focus (γ=4, α=0.25) ; class_balanced (γ=2, α=0.75)
Cryptos (5) AAVEUSDC, ARBUSDC, LDOUSDC, OPUSDC, UNIUSDC (defi_top5)
Folds 5 per crypto (purged k-fold per ADR-14)
Total useful rows 5 × 5 × 5 = 125 ✅ (full coverage)
Power mode standard (50 HPO trials per fold)
Strategy ATR0.5_1.5_H4

2. Per-crypto × variant f1_buy (mean across 5 folds)

Crypto none (baseline) mild_focus standard aggressive_focus class_balanced
AAVEUSDC 0.4560 0.3951 0.3541 0.3658 0.3845
ARBUSDC 0.4417 0.3911 0.3592 0.3569 0.3620
LDOUSDC 0.3892 0.3165 0.2910 0.3238 0.3053
OPUSDC 0.3679 0.3095 0.2986 0.2947 0.2772
UNIUSDC 0.4330 0.3270 0.3517 0.3487 0.3311

none is the highest f1_buy on every single crypto. No focal variant ever wins.

3. Paired delta vs baseline (variant - none, paired by crypto × fold)

n=25 paired samples per variant. Bootstrap CI95 (10,000 resamples, seed=42).

Variant mean Δf1_buy CI95 low CI95 high Cohen's d raw paired t p BH p (m=4)
mild_focus -0.0697 -0.0874 -0.0526 -1.541 6.10e-08 1.22e-07
standard -0.0866 -0.1092 -0.0655 -1.510 8.63e-08 1.15e-07
aggressive_focus -0.0796 -0.0988 -0.0611 -1.598 3.24e-08 1.30e-07
class_balanced -0.0855 -0.1146 -0.0616 -1.240 2.09e-06 2.09e-06

Every variant's CI95 excludes 0 in the wrong direction. Cohen's d ≤ -1.2 = very large effect. BH-corrected p-values < 1.3e-07 — overwhelming statistical evidence focal loss regresses f1_buy at our dataset / labelling regime.

4. Per-track gate verdict (F1_BUY_BOOST_PLAN.md §6)

The 6 official gates (every gate must clear for lock) :

# Gate Threshold Verdict
1 F1_buy lift ≥ +0.015 with CI95 excluding 0 every variant ∈ [-0.10, -0.05] with CI95 excluding 0 in the WRONG direction FAIL × 4
2 Joint metric (Δexpectancy ≥ 0 AND Δsortino ≥ 0 AND Δmax_drawdown ≤ +1 %) Δsortino negative on all 4 variants ; Δmax_dd > 1 % on 3/4 FAIL × 4
3 Stability — per-fold f1_buy variance ≤ 0.05 max variance 0.021 (class_balanced on OPUSDC) PASS × 4
4 Per-asset — f1_buy improves on ≥ 4/5 cryptos 0/5 improve on every variant FAIL × 4
5 Sample size — ≥ 50 BUY trades / fold mean n_trades ≈ 34 (vs 47.4 baseline) FAIL × 4
6 MLOps readiness mlops_readiness.md complete (PR #767) PASS × 4

Verdict per variant : FAIL on 4 of 6 gates → all 4 focal variants ABANDON.

5. Supporting metrics (context, not gating)

Joint metric breakdown (mean across cryptos)

Variant Δexpectancy Δsortino Δmax_drawdown
mild_focus +10.7 ✓ -0.13 ✗ +0.68 ✓
standard +18.6 ✓ -0.05 ✗ +1.28 ✗
aggressive_focus +23.9 ✓ -0.13 ✗ +1.21 ✗
class_balanced +17.3 ✓ -0.28 ✗ +1.49 ✗

Δexpectancy > 0 on all variants is misleading — the modest expectancy lift is dwarfed by the large f1_buy regression and the negative sortino delta. The trade quality is worse, not better.

Why focal loss regresses on this data

Three plausible explanations (consistent with the Track 5 label_smoothing pattern documented in 2026-04-29-track5-label-smoothing-results.md §6) :

  1. Class balance is already addressed upstream : the CVN_CLASS_BALANCING=1 setting (sklearn compute_class_weight) already applies α-weighting at the loss level. Adding focal's (1-p_t)^γ modulator on top compounds two corrections that the data does not need.
  2. Rare-event vs calibration tension : f1_buy requires both recall (catch the BUY signal) and precision (don't fire on noise). Focal pushes the gradient onto hard examples, which on noisy crypto data are the noisy labels themselves — so the model down-weights the cleanest signal in favour of training noise.
  3. HPO drift across runs : the focal HPO objective was f1_binary (per CVN_HPO_OBJECTIVE) but the variant produced a different operating point per fold ; none benefits from the most stable HPO landscape.

These are hypotheses ; the gate decision does not depend on which is correct.

6. Decisions

6.1 Lock variant

No lock. Keep CVN_LOSS_FUNCTION=binary:logistic (baseline) as production. Console state unchanged. No flip required in ftf_config.base_env.

6.2 OP Story closure

wp#41 (CVN-N001-EE-S02) → status Closed with verdict ABANDONED. Comment links this dossier + the implementation PR #767.

6.3 Joint variant (Track 5 × Track 6)

Skipped. A joint mild_smoothing × mild_focus variant would compound two abandoned interventions ; the prior on its outcome is essentially zero. The +25 row joint sweep would be wasted compute.

6.4 Track 6 Story Phase

Phase 4 (FTF sweep) and Phase 5 (gate decision) of TEMPLATE_story_phases_ml_ftf.md are now complete with verdict abandon. Closes the Story.

6.5 Follow-ups

  • CVN-N011-EA-S10 — gRPC fork deadlock fix landed (PR #777 squash 0f6220fc). Cleanlab branch of Track 5 unblocked, operator can re-trigger after deploy validates.
  • CVN-N011-EA-S11 — post-mortem of the missing-dep regression that delayed this Track (sympy hotfix PR #775). Open, P2.
  • Tracks 7-13 — sequencing per F1 plan §6 risk #4. Two consecutive ABANDON results (Track 5 mild + Track 6 focal) indicate that loss-function manipulation is not the productive lever on this data ; the next viable track families are data (Track 1, 12) or architecture (Track 11). Operator decision pending.

7. Linked context