Plan dossier — Track 6 Focal loss for XGBoost binary classification¶

Date : 2026-04-28 Story : CVN-N001-EE-S02 (OP wp#41) — Track 6 of F1_buy boost plan GH issue : #713 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P1 — quick-win sequence #2 (gated by Track 5 stable) Revision history : - v1 (committee 7a5c1d73) : REJECTED METHODOLOGY_FLAW (consensus split). Single blocker : numerical-only verification didn't meet Track 7 reco #13's "formal verification" mandate. 3 medium-priority recos. - v2 (committee c90e7749) : REJECTED EXECUTION_RISK (consensus strong). Methodology blocker fixed. 2 NEW blockers : (a) absence of ADR-25 fallback / kill-switch, (b) lack of staged rollout / rollback. Both addressed in §11 — they are platform-level concerns already covered (ADR-71 + wp#56 closed for kill-switch ; FTF sweep + Console flip for staged rollout per ADR-42 / ADR-59). Same EXECUTION_RISK pattern as Bug #1 plan_review session 986ea335 which PASSED with these same concerns acknowledged. - v3 (this revision) : adds §11 explicit response to EXECUTION_RISK blockers + revised question for committee. Re-submitted.

1. Context¶

CVNTrade's chronic ceiling on f1_buy is documented in documentation/F1_BUY_BOOST_PLAN.md (canonical, committee-approved session 9d4942cb PASSED OK 8.96). Track 5 (asymmetric label smoothing) is currently In testing post-FTF sweep.

Track 6 = next quick-win in the bundle — focal loss as XGBoost objective.

2. Hypothesis¶

XGBoost's binary:logistic objective treats easy and hard examples symmetrically (cross-entropy). With ~20 % BUY rate and abundant easy NOT_BUY examples (the model already classifies those correctly with high probability), the gradient is dominated by the easy negatives, leaving the BUY (minority) class under-modeled.

Focal loss (Lin et al. 2017, Focal Loss for Dense Object Detection) introduces a (1 − p_t)^γ modulating factor that down-weights well-classified examples and concentrates training updates on hard / minority cases :

FL(p_t) = -α_t · (1 − p_t)^γ · log(p_t)

where p_t = p if y=1 else 1 - p, γ ≥ 0 is the focusing parameter (γ=0 reduces to weighted cross-entropy), and α_t ∈ [0, 1] is the class-balancing weight.

Expected effect : f1_buy lift via better recall on the BUY class, without sacrificing precision (because the modulating factor down-weights BOTH classes' easy examples — it doesn't bias toward BUY blindly).

Adjacent precedent in the codebase : we already have CVN_CLASS_BALANCING (sklearn's compute_class_weight) which is α-only (no γ focus). Focal loss is the strict generalization with the added focusing component.

3. Variant matrix¶

Per documentation/F1_BUY_BOOST_PLAN.md §4.2 convention (5 unique configs per FTF factor, including baseline) :

Variant	gamma (focusing)	alpha (class weight)	Notes
`none` (baseline)	0.0	0.5	Reduces to standard cross-entropy ; identity short-circuit
`mild_focus`	1.0	0.25	Light down-weighting of easy examples
`standard`	2.0	0.25	Default value from Lin et al. 2017 paper
`aggressive_focus`	4.0	0.25	Strong down-weighting (paper showed gains up to γ=5)
`class_balanced`	2.0	0.75	Standard γ + heavier weight on BUY class

5 variants. Joint config (Track 5 × Track 6) postponed to Phase 4 if individual gates lock.

4. Implementation path¶

4.1 Custom XGBoost objective¶

XGBoost's xgb.train() accepts a custom obj parameter — a callable returning (grad, hess). Closed-form derivatives for focal loss (per Lin et al. 2017 Appendix A + standard derivation) :

def focal_loss_obj(y_true, y_pred, gamma=2.0, alpha=0.25):
    """Returns (grad, hess) for XGBoost binary focal loss.

    y_pred is the raw margin (logit, not probability). XGBoost calls this
    per boosting iteration with the current ensemble's margin output.

    Numerical safety : clip p to [1e-7, 1-1e-7] to avoid log(0) and division
    by zero in the Hessian.
    """
    p = 1.0 / (1.0 + np.exp(-y_pred))
    p = np.clip(p, 1e-7, 1.0 - 1e-7)
    p_t = np.where(y_true == 1, p, 1.0 - p)
    alpha_t = np.where(y_true == 1, alpha, 1.0 - alpha)

    # Gradient of focal loss w.r.t. logit
    grad = alpha_t * (1.0 - p_t) ** gamma * (gamma * p_t * np.log(p_t + 1e-7) + p_t - y_true)

    # Hessian (positive semi-definite by construction)
    hess = alpha_t * (1.0 - p_t) ** gamma * (
        gamma * (gamma - 1) * p_t * (1.0 - p_t) * np.log(p_t + 1e-7)
        + (1.0 - 2 * gamma * p_t) * p_t * (1.0 - p_t)
    )
    hess = np.maximum(hess, 1e-7)  # numerical safety for tree splitting
    return grad, hess

Validation — formal verification (per Track 7 committee veto reco #13) :

Phase 2 implementation MUST include symbolic verification of grad + hess via SymPy, not just numerical finite differences. This addresses Track 7 reco #13's explicit "formal verification" mandate (committee 7a5c1d73 blocker on a previous draft of this dossier).

Two-layer verification :

Symbolic layer (SymPy) — derive grad + hess analytically from the focal loss expression FL(p_t) = -α_t · (1 − p_t)^γ · log(p_t) with respect to the logit z, simplify, and compare element-wise to the closed-form numpy implementation. Tolerance : exact symbolic equivalence after sympy.simplify. Test code lives in tests/unit/training/XGBoost/test_focal_loss_formal_verification.py.
Numerical sanity layer (finite differences) — secondary check on toy data (n=20 points, both classes) with central differences (f(x+h) − f(x−h)) / 2h, h=1e-5, tolerance < 1e-4. Catches implementation bugs (typos, broadcasting errors) that the symbolic layer doesn't see (the symbolic check operates on the math, not the numpy code).

Both layers MUST pass for the variant to be eligible for FTF sweep. If symbolic verification reveals a discrepancy in the closed-form, the implementation MUST be regenerated from SymPy's lambdify output rather than hand-coded — this is the durable mechanism Track 7 reco #13 was demanding.

4.2 Custom eval metric¶

XGBoost's early stopping needs an eval metric that mirrors the loss. Add focal_loss_eval(y_true, y_pred, gamma, alpha) returning the scalar focal loss on dval. Without this, early stopping uses logloss which is misaligned with the optimization target.

4.3 FTF factor + guardrail (per ADR-58)¶

src/commun/finetune/factors/focal_loss.py — registers the factor with its variant matrix (§3)
src/commun/finetune/guardrails/focal_loss_guardrail.py — rejects :
gamma < 0 (math-undefined)
gamma > 10 (gradient instability empirical bound)
alpha ∉ [0, 1] (class weight must be a probability)
gamma=0 AND alpha=0.5 allowed only as baseline (otherwise the variant is identical to baseline)

4.4 Env var wiring (per ADR-56)¶

CVN_LOSS_FUNCTION ∈ {binary:logistic (default, baseline), focal}
CVN_FOCAL_GAMMA (default 2.0, only read when CVN_LOSS_FUNCTION=focal)
CVN_FOCAL_ALPHA (default 0.25, only read when CVN_LOSS_FUNCTION=focal)

Trainer logic in cvntrade_XGBoost_trainer.py :

if os.getenv("CVN_LOSS_FUNCTION", "binary:logistic") == "focal":
    gamma = float(os.getenv("CVN_FOCAL_GAMMA", "2.0"))
    alpha = float(os.getenv("CVN_FOCAL_ALPHA", "0.25"))
    obj = lambda y_pred, dmat: focal_loss_obj(dmat.get_label(), y_pred, gamma, alpha)
    eval_metric = lambda y_pred, dmat: ("focal_loss", focal_loss_eval(dmat.get_label(), y_pred, gamma, alpha))
    self.model = xgb.train(xgb_params, dtrain, ..., obj=obj, custom_metric=eval_metric, ...)
else:
    self.model = xgb.train(xgb_params, dtrain, ...)  # standard binary:logistic

Note : when obj is custom, XGBoost outputs raw margins (not probabilities). We need a sigmoid wrap on predict() for downstream calibration + thresholding. Already handled in XGBWrapper.predict_proba for the calibration path ; needs verification for the direct predict path.

4.5 Calibration strategy — temperature scaling proactive (per committee reco #2)¶

Focal-loss-trained models are known to output sharper probability distributions than cross-entropy models (Mukhoti et al. 2020). Isotonic and Platt calibration may not be optimal — they were designed for cross-entropy outputs. Temperature scaling (Guo et al. 2017) is the standard recalibration technique for sharper outputs : a single learnable temperature parameter T applied as softmax(logits / T).

Per committee 7a5c1d73 reco #2 (Proactive Calibration Implementation) :

Add temperature_scaling as a calibration method alongside isotonic, sigmoid, platt, hybrid, none in XGBoostConfig.calibration
When CVN_LOSS_FUNCTION=focal, the FTF variant matrix MUST include both calibration=isotonic (existing baseline calibration) AND calibration=temperature_scaling so we can compare ECE_HOLD across the two
Temperature scaling implementation : fit T via L-BFGS minimizing NLL on the validation set ; integrates cleanly with the new _invoke_calibration helper (PR #765) and respects the hard-label contract from _assert_calibration_targets_discrete

If temperature scaling outperforms isotonic on ECE_HOLD for focal variants → lock as the default calibration when CVN_LOSS_FUNCTION=focal. If not → keep isotonic, document in results dossier.

4.6 Comprehensive predict() path verification (per committee reco #3)¶

Custom obj outputs raw margins (logits), not probabilities. Every prediction path MUST apply sigmoid where probabilities are expected. Phase 2 acceptance gate :

trainer.predict() (line 680) — verify it returns probabilities in [0, 1] when CVN_LOSS_FUNCTION=focal
XGBWrapper.predict_proba() (line 148) — already applies sigmoid via the binary flag ; verify it works with focal-loss-trained model
XGBWrapper.predict() (line 136) — verify the proba >= 0.5 threshold logic still applies correctly
_evaluate_model() (line 467) — Brier + ECE computation expects probabilities ; verify the if self.calibrated_model branch + the raw xgb.train branch both produce probabilities
Direct downstream consumers : grep model.predict\|booster.predict across src/ and audit each call site
Integration test : end-to-end trainer.train(...) → trainer.predict(X_test) with CVN_LOSS_FUNCTION=focal MUST return values in [0, 1]

This gates merge of the focal loss implementation. Any path that returns raw margins where probabilities are expected = blocker.

4.7 Integration with Track 5 hotfixes¶

The recent calibration refactor (CVN-N011-EA-S07 PR #765) moved calibration to (X_val, y_val) and added _assert_calibration_targets_discrete. Focal loss output is raw margins → calibration sees the same input shape it always did (after sigmoid in XGBWrapper.predict_proba). No interaction expected, but Phase 2 integration test MUST verify variants eps_buy=0.15 × focal_loss=standard (joint Track 5 × Track 6 sample) passes end-to-end through trainer.train(...) with calibration enabled.

5. Acceptance criteria¶

6. Falsifiability criteria¶

The hypothesis "focal loss improves f1_buy by focusing on hard / minority examples" is falsified if any of :

Δ f1_buy < +0.005 across ALL (gamma, alpha) configs → focal loss doesn't help on this data — abandon
Best variant has CI95 including 0 → no statistically significant improvement → keep available, don't lock
Aggressive_focus (γ=4) WORSE than mild_focus (γ=1) → over-focusing destabilizes training → at minimum, restrict variant space or abandon
expectancy_net regresses by ≥ 0.005 vs baseline → loss change broke the precision/recall trade-off in trading-cost space → abandon
ECE_HOLD > baseline + 0.02 → focal loss outputs are systematically miscalibrated even after isotonic/platt calibration → abandon (untradeable)

Any falsifier triggers : either narrow variant scope + retry, or document negative result + abandon Track 6.

7. Risks¶

#	Risk	Probability	Impact	Mitigation
1	Custom obj breaks early stopping signal	high	medium	Implement custom eval_metric (§4.2) mirroring focal loss ; verify in unit test
2	Gradient instability for extreme γ or very small p_t	medium	high	Numerical clipping `p ∈ [1e-7, 1-1e-7]` + `hess = max(hess, 1e-7)` ; reject γ > 10 in guardrail
3	Interaction with Track 5 label smoothing causes new soft-label-on-calibration bug	medium	high	Phase 2 integration test explicitly covers joint variant ; calibration assertion (PR #765) catches if it surfaces
4	Custom obj output is raw margin not probability — downstream code may not handle it	medium	medium	Verify `XGBWrapper.predict_proba` applies sigmoid ; explicit test for `trainer.predict()` path
5	Closed-form math error (silent prod bug)	low (after symbolic check)	critical	Two-layer verification per §4.1 : (a) SymPy symbolic equivalence between closed-form and analytical derivative — exact match after `sympy.simplify`, (b) numerical finite differences as secondary sanity check. If symbolic discrepancy found, regenerate from `sympy.lambdify`. Addresses Track 7 reco #13 + committee `7a5c1d73` blocker.
6	Sweep cost : 5 variants × 5 cryptos × 5 folds = 125 runs ; each ~5-10 min	low	low	Standard FTF cost envelope, no new cluster scaling needed
7	Focal loss outputs sharper distributions ; isotonic/Platt may miscalibrate	high (well-documented in literature)	medium	Proactive temperature scaling per §4.5 — added as Phase 2 deliverable, not follow-up. FTF variant matrix compares isotonic vs temperature_scaling per ECE_HOLD. Best wins. Per committee `7a5c1d73` reco #2.
8	Phase 4 sweep results are statistically significant on f1_buy but variance across cryptos > expected → narrow alpha range may have missed sweet spot	medium	low	If sweep results are promising but suboptimal, follow-up sweep with expanded alpha range `[0.1, 0.25, 0.5, 0.75, 0.9]` × best gamma. Per committee `7a5c1d73` reco #4.

8. Cost estimate¶

Phase	Effort	Wall-clock
1 (Plan + committee)	2 h	1 day (committee + recos round-trip)
2 (Implem + tests + docs)	4 h	1.5 days (custom obj + numerical verification + integration tests + MLOps readiness)
3 (PR + CR + committee pr_review + merge + deploy)	4 h	2 days (CR cycles 4-5 + committee + image build/deploy)
4 (FTF sweep + stats)	30 min effort	1 day (cluster compute)
5 (Results dossier + gate + lock + close)	2 h	1 day
Total	~12 h effort	~6 days wall-clock

Slightly more than Track 5 because of (a) numerical gradient verification work, (b) custom obj/metric being a higher-risk surface than label smoothing.

9. Out of scope (tracked separately)¶

Joint Track 5 × Track 6 lock decision → Phase 5 follow-up if individual gates pass
Temperature scaling as alternative calibration method for focal loss outputs → potential follow-up if Risk #7 materializes
Generalized custom-loss verification framework → potential infra Story under CVN-N013-EA if other Tracks need similar verification (Track 7 vetoed for now)
Track 7 (cost-sensitive P&L loss) — VETOED per F1_BUY_BOOST_PLAN.md §5 unless quick-win bundle delivers positive expectancy AND f1_buy ≥ 0.50

10. Question for the committee¶

Is the variant matrix in §3 (5 configs spanning γ ∈ {0, 1, 2, 4} × α ∈ {0.25, 0.5, 0.75} subset) appropriate, or should it be expanded / restricted ?

Is the numerical gradient verification approach (finite differences in unit tests, tolerance < 1e-4) sufficient given Track 7 committee veto reco #13 mandated formal verification for custom losses ?

Should focal loss explicitly require an alternative calibration method (e.g., temperature scaling) given the known issue that focal-loss-trained models output sharper probabilities that may not be ideal for isotonic/platt calibration — or is the existing ECE_HOLD gate the right way to detect this in-band ?

11. Response to committee `c90e7749` EXECUTION_RISK blockers¶

Committee c90e7749 (v2 plan_review) raised 2 blockers under code EXECUTION_RISK :

Blocker 1 — "Absence of ADR-25 Compliant Fallback / Kill-Switch"¶

Response : already implemented at the platform level, NOT in scope of this Story.

ADR-71 (Trading kill-switch) is the system-wide invariant : single PostgreSQL source, operator-only disengage, fail-safe on connectivity loss, halt latency < 1s. Covers ALL trading decisions regardless of which loss function trained the underlying model.
wp#56 (CVN-N001-EF-S02 — Implement system-wide trading kill-switch) is CLOSED (merged + production-deployed).
Per ADR-25 ("no silent fallback") : the trainer-side _assert_dmatrix_contract (PR #754) and _assert_calibration_targets_discrete (PR #765) are the fail-fast assertions for THIS code path. If focal loss produces an invalid model artifact, training fails loudly ; no silent degradation reaches production.
The "revert to baseline" mechanism at the loss-function level = CVN_LOSS_FUNCTION=binary:logistic (default). Operator flips one Console value, next training run uses cross-entropy. No code change required, no rollback PR needed.

Blocker 2 — "Lack of Staged Rollout and Formal Rollback"¶

Response : FTF sweep + Console-controlled lock IS the project's staged rollout convention.

Phase 4 of this plan (FTF sweep) is shadow-mode by definition : 125 training runs across 5 cryptos × 5 folds × 5 variants, results land in finetune_results PostgreSQL table, NO trading decisions are influenced. This is a 1-day shadow pre-production validation.
Per-track gate decision (Phase 5) requires statistical significance (BH-corrected p < 0.05, Cohen's d ≥ 0.3, CI95 excluding 0) AND business viability (expectancy_net ≥ baseline, sortino ≥ baseline, ECE_HOLD ≤ baseline + 0.01). Failure to clear ANY criterion → keep available or abandon, no lock.
Lock = Console flip in ftf_baseline.json (per ADR-59). Operator controls, atomic, instantly reversible by another flip.
Rollback at the model level : per ADR-42 (atomic crypto-level promotion) + ADR-12 (engine version frozen), every model deployment is snapshot-versioned and the previous champion stays callable. Rollback = MLflow model promotion to previous version — operator action, < 5 min wall-clock.

Why this dossier doesn't add new infra for these concerns¶

Adding kill-switch / canary infrastructure to a single FTF factor Story would (a) duplicate existing platform invariants (ADR-71, ADR-42, ADR-59), (b) bloat scope from a 12-h Story to multi-week infra work, (c) contradict the operator's explicit Pattern C convention (Story phases focus on the deliverable, not on platform infra that's already in place).

The committee's recommendations #1-#3 are LEGITIMATE concerns at the platform level ; they belong in CVN-N013 (Infrastructure & DevX) or CVN-N011-EA (Pipeline Contract Hardening) follow-ups, not in this single FTF factor Story. If platform-level coverage is judged inadequate at re-review, a dedicated Story should be opened in those Epics — not bolted onto Track 6.

Operator decision request : confirm that platform-level kill-switch + Console-flip rollback are sufficient for this Story's scope, OR explicit instruction to expand scope to include focal-loss-specific infra (in which case Story should be re-sized + re-staffed).

12. Linked context¶

Canonical plan : documentation/F1_BUY_BOOST_PLAN.md §5 Track 6 + §4.2 (variant matrix convention) + §6 (per-track gate criteria)
Predecessor Story : CVN-N001-EE-S01 Track 5 (wp#40, in testing post FTF sweep with the recent Bug #1 + #2 fixes)
Pipeline contract hardening : CVN-N011-EA (wp#68) — defense-in-depth applied during Track 5 incidents (_assert_dmatrix_contract, _assert_calibration_targets_discrete, _invoke_calibration helper)
Track 7 committee veto : reco #13 of session 9d4942cb mandates formal verification of custom losses
ADR-15 : theta calibré OOS (precedent for hold-out calibration)
ADR-56 : every change gated by env var + FTF factor (CVN_LOSS_FUNCTION + CVN_FOCAL_*)
ADR-58 : every FTF factor must have guardrail + integration test
ADR-68 : Expert Committee for substantive ML PR (this Story qualifies)
ADR-70 : MLOps readiness template mandatory before merge to main