Plan dossier — Track 6 Focal loss for XGBoost binary classification¶
Date : 2026-04-28
Story : CVN-N001-EE-S02 (OP wp#41) — Track 6 of F1_buy boost plan
GH issue : #713
Author : Dominique (operator) + Claude
Session type : plan_review (per ADR-68)
Severity : P1 — quick-win sequence #2 (gated by Track 5 stable)
Revision history :
- v1 (committee 7a5c1d73) : REJECTED METHODOLOGY_FLAW (consensus split). Single blocker : numerical-only verification didn't meet Track 7 reco #13's "formal verification" mandate. 3 medium-priority recos.
- v2 (committee c90e7749) : REJECTED EXECUTION_RISK (consensus strong). Methodology blocker fixed. 2 NEW blockers : (a) absence of ADR-25 fallback / kill-switch, (b) lack of staged rollout / rollback. Both addressed in §11 — they are platform-level concerns already covered (ADR-71 + wp#56 closed for kill-switch ; FTF sweep + Console flip for staged rollout per ADR-42 / ADR-59). Same EXECUTION_RISK pattern as Bug #1 plan_review session 986ea335 which PASSED with these same concerns acknowledged.
- v3 (this revision) : adds §11 explicit response to EXECUTION_RISK blockers + revised question for committee. Re-submitted.
1. Context¶
CVNTrade's chronic ceiling on f1_buy is documented in documentation/F1_BUY_BOOST_PLAN.md (canonical, committee-approved session 9d4942cb PASSED OK 8.96). Track 5 (asymmetric label smoothing) is currently In testing post-FTF sweep.
Track 6 = next quick-win in the bundle — focal loss as XGBoost objective.
2. Hypothesis¶
XGBoost's binary:logistic objective treats easy and hard examples symmetrically (cross-entropy). With ~20 % BUY rate and abundant easy NOT_BUY examples (the model already classifies those correctly with high probability), the gradient is dominated by the easy negatives, leaving the BUY (minority) class under-modeled.
Focal loss (Lin et al. 2017, Focal Loss for Dense Object Detection) introduces a (1 − p_t)^γ modulating factor that down-weights well-classified examples and concentrates training updates on hard / minority cases :
where p_t = p if y=1 else 1 - p, γ ≥ 0 is the focusing parameter (γ=0 reduces to weighted cross-entropy), and α_t ∈ [0, 1] is the class-balancing weight.
Expected effect : f1_buy lift via better recall on the BUY class, without sacrificing precision (because the modulating factor down-weights BOTH classes' easy examples — it doesn't bias toward BUY blindly).
Adjacent precedent in the codebase : we already have CVN_CLASS_BALANCING (sklearn's compute_class_weight) which is α-only (no γ focus). Focal loss is the strict generalization with the added focusing component.
3. Variant matrix¶
Per documentation/F1_BUY_BOOST_PLAN.md §4.2 convention (5 unique configs per FTF factor, including baseline) :
| Variant | gamma (focusing) | alpha (class weight) | Notes |
|---|---|---|---|
none (baseline) |
0.0 | 0.5 | Reduces to standard cross-entropy ; identity short-circuit |
mild_focus |
1.0 | 0.25 | Light down-weighting of easy examples |
standard |
2.0 | 0.25 | Default value from Lin et al. 2017 paper |
aggressive_focus |
4.0 | 0.25 | Strong down-weighting (paper showed gains up to γ=5) |
class_balanced |
2.0 | 0.75 | Standard γ + heavier weight on BUY class |
5 variants. Joint config (Track 5 × Track 6) postponed to Phase 4 if individual gates lock.
4. Implementation path¶
4.1 Custom XGBoost objective¶
XGBoost's xgb.train() accepts a custom obj parameter — a callable returning (grad, hess). Closed-form derivatives for focal loss (per Lin et al. 2017 Appendix A + standard derivation) :
def focal_loss_obj(y_true, y_pred, gamma=2.0, alpha=0.25):
"""Returns (grad, hess) for XGBoost binary focal loss.
y_pred is the raw margin (logit, not probability). XGBoost calls this
per boosting iteration with the current ensemble's margin output.
Numerical safety : clip p to [1e-7, 1-1e-7] to avoid log(0) and division
by zero in the Hessian.
"""
p = 1.0 / (1.0 + np.exp(-y_pred))
p = np.clip(p, 1e-7, 1.0 - 1e-7)
p_t = np.where(y_true == 1, p, 1.0 - p)
alpha_t = np.where(y_true == 1, alpha, 1.0 - alpha)
# Gradient of focal loss w.r.t. logit
grad = alpha_t * (1.0 - p_t) ** gamma * (gamma * p_t * np.log(p_t + 1e-7) + p_t - y_true)
# Hessian (positive semi-definite by construction)
hess = alpha_t * (1.0 - p_t) ** gamma * (
gamma * (gamma - 1) * p_t * (1.0 - p_t) * np.log(p_t + 1e-7)
+ (1.0 - 2 * gamma * p_t) * p_t * (1.0 - p_t)
)
hess = np.maximum(hess, 1e-7) # numerical safety for tree splitting
return grad, hess
Validation — formal verification (per Track 7 committee veto reco #13) :
Phase 2 implementation MUST include symbolic verification of grad + hess via SymPy, not just numerical finite differences. This addresses Track 7 reco #13's explicit "formal verification" mandate (committee 7a5c1d73 blocker on a previous draft of this dossier).
Two-layer verification :
-
Symbolic layer (SymPy) — derive grad + hess analytically from the focal loss expression
FL(p_t) = -α_t · (1 − p_t)^γ · log(p_t)with respect to the logitz, simplify, and compare element-wise to the closed-form numpy implementation. Tolerance : exact symbolic equivalence aftersympy.simplify. Test code lives intests/unit/training/XGBoost/test_focal_loss_formal_verification.py. -
Numerical sanity layer (finite differences) — secondary check on toy data (n=20 points, both classes) with central differences
(f(x+h) − f(x−h)) / 2h, h=1e-5, tolerance < 1e-4. Catches implementation bugs (typos, broadcasting errors) that the symbolic layer doesn't see (the symbolic check operates on the math, not the numpy code).
Both layers MUST pass for the variant to be eligible for FTF sweep. If symbolic verification reveals a discrepancy in the closed-form, the implementation MUST be regenerated from SymPy's lambdify output rather than hand-coded — this is the durable mechanism Track 7 reco #13 was demanding.
4.2 Custom eval metric¶
XGBoost's early stopping needs an eval metric that mirrors the loss. Add focal_loss_eval(y_true, y_pred, gamma, alpha) returning the scalar focal loss on dval. Without this, early stopping uses logloss which is misaligned with the optimization target.
4.3 FTF factor + guardrail (per ADR-58)¶
src/commun/finetune/factors/focal_loss.py— registers the factor with its variant matrix (§3)src/commun/finetune/guardrails/focal_loss_guardrail.py— rejects :gamma < 0(math-undefined)gamma > 10(gradient instability empirical bound)alpha ∉ [0, 1](class weight must be a probability)gamma=0 AND alpha=0.5allowed only as baseline (otherwise the variant is identical to baseline)
4.4 Env var wiring (per ADR-56)¶
CVN_LOSS_FUNCTION∈ {binary:logistic(default, baseline),focal}CVN_FOCAL_GAMMA(default 2.0, only read whenCVN_LOSS_FUNCTION=focal)CVN_FOCAL_ALPHA(default 0.25, only read whenCVN_LOSS_FUNCTION=focal)
Trainer logic in cvntrade_XGBoost_trainer.py :
if os.getenv("CVN_LOSS_FUNCTION", "binary:logistic") == "focal":
gamma = float(os.getenv("CVN_FOCAL_GAMMA", "2.0"))
alpha = float(os.getenv("CVN_FOCAL_ALPHA", "0.25"))
obj = lambda y_pred, dmat: focal_loss_obj(dmat.get_label(), y_pred, gamma, alpha)
eval_metric = lambda y_pred, dmat: ("focal_loss", focal_loss_eval(dmat.get_label(), y_pred, gamma, alpha))
self.model = xgb.train(xgb_params, dtrain, ..., obj=obj, custom_metric=eval_metric, ...)
else:
self.model = xgb.train(xgb_params, dtrain, ...) # standard binary:logistic
Note : when obj is custom, XGBoost outputs raw margins (not probabilities). We need a sigmoid wrap on predict() for downstream calibration + thresholding. Already handled in XGBWrapper.predict_proba for the calibration path ; needs verification for the direct predict path.
4.5 Calibration strategy — temperature scaling proactive (per committee reco #2)¶
Focal-loss-trained models are known to output sharper probability distributions than cross-entropy models (Mukhoti et al. 2020). Isotonic and Platt calibration may not be optimal — they were designed for cross-entropy outputs. Temperature scaling (Guo et al. 2017) is the standard recalibration technique for sharper outputs : a single learnable temperature parameter T applied as softmax(logits / T).
Per committee 7a5c1d73 reco #2 (Proactive Calibration Implementation) :
- Add
temperature_scalingas a calibration method alongsideisotonic,sigmoid,platt,hybrid,noneinXGBoostConfig.calibration - When
CVN_LOSS_FUNCTION=focal, the FTF variant matrix MUST include bothcalibration=isotonic(existing baseline calibration) ANDcalibration=temperature_scalingso we can compare ECE_HOLD across the two - Temperature scaling implementation : fit T via L-BFGS minimizing NLL on the validation set ; integrates cleanly with the new
_invoke_calibrationhelper (PR #765) and respects the hard-label contract from_assert_calibration_targets_discrete
If temperature scaling outperforms isotonic on ECE_HOLD for focal variants → lock as the default calibration when CVN_LOSS_FUNCTION=focal. If not → keep isotonic, document in results dossier.
4.6 Comprehensive predict() path verification (per committee reco #3)¶
Custom obj outputs raw margins (logits), not probabilities. Every prediction path MUST apply sigmoid where probabilities are expected. Phase 2 acceptance gate :
-
trainer.predict()(line 680) — verify it returns probabilities in [0, 1] whenCVN_LOSS_FUNCTION=focal -
XGBWrapper.predict_proba()(line 148) — already applies sigmoid via thebinaryflag ; verify it works with focal-loss-trained model -
XGBWrapper.predict()(line 136) — verify theproba >= 0.5threshold logic still applies correctly -
_evaluate_model()(line 467) — Brier + ECE computation expects probabilities ; verify theif self.calibrated_modelbranch + the rawxgb.trainbranch both produce probabilities - Direct downstream consumers : grep
model.predict\|booster.predictacrosssrc/and audit each call site - Integration test : end-to-end
trainer.train(...) → trainer.predict(X_test)withCVN_LOSS_FUNCTION=focalMUST return values in [0, 1]
This gates merge of the focal loss implementation. Any path that returns raw margins where probabilities are expected = blocker.
4.7 Integration with Track 5 hotfixes¶
The recent calibration refactor (CVN-N011-EA-S07 PR #765) moved calibration to (X_val, y_val) and added _assert_calibration_targets_discrete. Focal loss output is raw margins → calibration sees the same input shape it always did (after sigmoid in XGBWrapper.predict_proba). No interaction expected, but Phase 2 integration test MUST verify variants eps_buy=0.15 × focal_loss=standard (joint Track 5 × Track 6 sample) passes end-to-end through trainer.train(...) with calibration enabled.
5. Acceptance criteria¶
- Plan dossier reviewed by Expert Committee (
plan_review) — verdict PASSED (committee7a5c1d73initial REJECTED METHODOLOGY_FLAW addressed in this revision) - SymPy symbolic verification of grad + hess (closed-form vs symbolic, exact equivalence after
sympy.simplify) — per Track 7 reco #13 formal verification mandate - Numerical sanity : finite differences against closed-form on toy data, tolerance < 1e-4 (secondary check for implementation bugs)
-
CVN_LOSS_FUNCTIONenv var +CVN_FOCAL_GAMMA+CVN_FOCAL_ALPHAplumbed through trainer -
focal_loss_obj+focal_loss_evalimplemented + unit-tested - FTF factor + guardrail registered (per ADR-58)
- Temperature scaling calibration (per committee reco #2) : added as method alongside isotonic/sigmoid/platt/hybrid ; FTF variant matrix includes both
isotonicandtemperature_scalingfor focal variants → ECE_HOLD comparison drives lock decision - Comprehensive predict() path verification (per committee reco #3) :
trainer.predict(),XGBWrapper.predict_proba,XGBWrapper.predict,_evaluate_model, and any other audited consumer all return probabilities in [0, 1] whenCVN_LOSS_FUNCTION=focal— gate item, blocker if any path returns raw margin - Integration test
tests/integration/test_track6_focal_loss.pymirrorstest_track5_label_smoothing.pypattern : variant matrix end-to-end throughtrainer.train(...)with calibration enabled (regression bar) - Joint variant test :
eps_buy=0.15 × focal_loss=standardno crash + reasonable predictions - MLOps readiness template filled in
documentation/stories/CVN-N001-EE-S02/mlops_readiness.md - Local validation :
make qa+pytest tests/integration/test_track6_focal_loss.py+mkdocs build --strict - Committee
pr_reviewPASSED before merge - Operator triggers
dag_finetune__pte --factor=focal_losspost-merge → 125 rows infinetune_results - Stats : BH-corrected p < 0.05, Cohen's d ≥ 0.3, ECE_HOLD ≤ baseline + 0.01
- Per-track gate cleared (cf.
F1_BUY_BOOST_PLAN.md§6) : - f1_buy lift ≥ +0.015 with CI95 excluding 0
- Story-specific : Δ f1_buy ≥ +0.02
- expectancy_net ≥ baseline, sortino ≥ baseline, max_drawdown ≤ baseline + 1 %
- ≥ 4/5 cryptos improve f1_buy individually
- ≥ 50 BUY trades / fold
- Decision : lock / keep available / abandon (Console flip if lock)
6. Falsifiability criteria¶
The hypothesis "focal loss improves f1_buy by focusing on hard / minority examples" is falsified if any of :
- Δ f1_buy < +0.005 across ALL (gamma, alpha) configs → focal loss doesn't help on this data — abandon
- Best variant has CI95 including 0 → no statistically significant improvement → keep available, don't lock
- Aggressive_focus (γ=4) WORSE than mild_focus (γ=1) → over-focusing destabilizes training → at minimum, restrict variant space or abandon
- expectancy_net regresses by ≥ 0.005 vs baseline → loss change broke the precision/recall trade-off in trading-cost space → abandon
- ECE_HOLD > baseline + 0.02 → focal loss outputs are systematically miscalibrated even after isotonic/platt calibration → abandon (untradeable)
Any falsifier triggers : either narrow variant scope + retry, or document negative result + abandon Track 6.
7. Risks¶
| # | Risk | Probability | Impact | Mitigation |
|---|---|---|---|---|
| 1 | Custom obj breaks early stopping signal | high | medium | Implement custom eval_metric (§4.2) mirroring focal loss ; verify in unit test |
| 2 | Gradient instability for extreme γ or very small p_t | medium | high | Numerical clipping p ∈ [1e-7, 1-1e-7] + hess = max(hess, 1e-7) ; reject γ > 10 in guardrail |
| 3 | Interaction with Track 5 label smoothing causes new soft-label-on-calibration bug | medium | high | Phase 2 integration test explicitly covers joint variant ; calibration assertion (PR #765) catches if it surfaces |
| 4 | Custom obj output is raw margin not probability — downstream code may not handle it | medium | medium | Verify XGBWrapper.predict_proba applies sigmoid ; explicit test for trainer.predict() path |
| 5 | Closed-form math error (silent prod bug) | low (after symbolic check) | critical | Two-layer verification per §4.1 : (a) SymPy symbolic equivalence between closed-form and analytical derivative — exact match after sympy.simplify, (b) numerical finite differences as secondary sanity check. If symbolic discrepancy found, regenerate from sympy.lambdify. Addresses Track 7 reco #13 + committee 7a5c1d73 blocker. |
| 6 | Sweep cost : 5 variants × 5 cryptos × 5 folds = 125 runs ; each ~5-10 min | low | low | Standard FTF cost envelope, no new cluster scaling needed |
| 7 | Focal loss outputs sharper distributions ; isotonic/Platt may miscalibrate | high (well-documented in literature) | medium | Proactive temperature scaling per §4.5 — added as Phase 2 deliverable, not follow-up. FTF variant matrix compares isotonic vs temperature_scaling per ECE_HOLD. Best wins. Per committee 7a5c1d73 reco #2. |
| 8 | Phase 4 sweep results are statistically significant on f1_buy but variance across cryptos > expected → narrow alpha range may have missed sweet spot | medium | low | If sweep results are promising but suboptimal, follow-up sweep with expanded alpha range [0.1, 0.25, 0.5, 0.75, 0.9] × best gamma. Per committee 7a5c1d73 reco #4. |
8. Cost estimate¶
| Phase | Effort | Wall-clock |
|---|---|---|
| 1 (Plan + committee) | 2 h | 1 day (committee + recos round-trip) |
| 2 (Implem + tests + docs) | 4 h | 1.5 days (custom obj + numerical verification + integration tests + MLOps readiness) |
| 3 (PR + CR + committee pr_review + merge + deploy) | 4 h | 2 days (CR cycles 4-5 + committee + image build/deploy) |
| 4 (FTF sweep + stats) | 30 min effort | 1 day (cluster compute) |
| 5 (Results dossier + gate + lock + close) | 2 h | 1 day |
| Total | ~12 h effort | ~6 days wall-clock |
Slightly more than Track 5 because of (a) numerical gradient verification work, (b) custom obj/metric being a higher-risk surface than label smoothing.
9. Out of scope (tracked separately)¶
- Joint Track 5 × Track 6 lock decision → Phase 5 follow-up if individual gates pass
- Temperature scaling as alternative calibration method for focal loss outputs → potential follow-up if Risk #7 materializes
- Generalized custom-loss verification framework → potential infra Story under
CVN-N013-EAif other Tracks need similar verification (Track 7 vetoed for now) - Track 7 (cost-sensitive P&L loss) — VETOED per
F1_BUY_BOOST_PLAN.md§5 unless quick-win bundle delivers positive expectancy ANDf1_buy ≥ 0.50
10. Question for the committee¶
Is the variant matrix in §3 (5 configs spanning γ ∈ {0, 1, 2, 4} × α ∈ {0.25, 0.5, 0.75} subset) appropriate, or should it be expanded / restricted ?
Is the numerical gradient verification approach (finite differences in unit tests, tolerance < 1e-4) sufficient given Track 7 committee veto reco #13 mandated formal verification for custom losses ?
Should focal loss explicitly require an alternative calibration method (e.g., temperature scaling) given the known issue that focal-loss-trained models output sharper probabilities that may not be ideal for isotonic/platt calibration — or is the existing ECE_HOLD gate the right way to detect this in-band ?
11. Response to committee c90e7749 EXECUTION_RISK blockers¶
Committee c90e7749 (v2 plan_review) raised 2 blockers under code EXECUTION_RISK :
Blocker 1 — "Absence of ADR-25 Compliant Fallback / Kill-Switch"¶
Response : already implemented at the platform level, NOT in scope of this Story.
- ADR-71 (Trading kill-switch) is the system-wide invariant : single PostgreSQL source, operator-only disengage, fail-safe on connectivity loss, halt latency < 1s. Covers ALL trading decisions regardless of which loss function trained the underlying model.
- wp#56 (
CVN-N001-EF-S02— Implement system-wide trading kill-switch) is CLOSED (merged + production-deployed). - Per ADR-25 ("no silent fallback") : the trainer-side
_assert_dmatrix_contract(PR #754) and_assert_calibration_targets_discrete(PR #765) are the fail-fast assertions for THIS code path. If focal loss produces an invalid model artifact, training fails loudly ; no silent degradation reaches production. - The "revert to baseline" mechanism at the loss-function level =
CVN_LOSS_FUNCTION=binary:logistic(default). Operator flips one Console value, next training run uses cross-entropy. No code change required, no rollback PR needed.
Blocker 2 — "Lack of Staged Rollout and Formal Rollback"¶
Response : FTF sweep + Console-controlled lock IS the project's staged rollout convention.
- Phase 4 of this plan (FTF sweep) is shadow-mode by definition : 125 training runs across 5 cryptos × 5 folds × 5 variants, results land in
finetune_resultsPostgreSQL table, NO trading decisions are influenced. This is a 1-day shadow pre-production validation. - Per-track gate decision (Phase 5) requires statistical significance (BH-corrected p < 0.05, Cohen's d ≥ 0.3, CI95 excluding 0) AND business viability (expectancy_net ≥ baseline, sortino ≥ baseline, ECE_HOLD ≤ baseline + 0.01). Failure to clear ANY criterion →
keep availableorabandon, no lock. - Lock = Console flip in
ftf_baseline.json(per ADR-59). Operator controls, atomic, instantly reversible by another flip. - Rollback at the model level : per ADR-42 (atomic crypto-level promotion) + ADR-12 (engine version frozen), every model deployment is snapshot-versioned and the previous champion stays callable. Rollback = MLflow model promotion to previous version — operator action, < 5 min wall-clock.
Why this dossier doesn't add new infra for these concerns¶
Adding kill-switch / canary infrastructure to a single FTF factor Story would (a) duplicate existing platform invariants (ADR-71, ADR-42, ADR-59), (b) bloat scope from a 12-h Story to multi-week infra work, (c) contradict the operator's explicit Pattern C convention (Story phases focus on the deliverable, not on platform infra that's already in place).
The committee's recommendations #1-#3 are LEGITIMATE concerns at the platform level ; they belong in CVN-N013 (Infrastructure & DevX) or CVN-N011-EA (Pipeline Contract Hardening) follow-ups, not in this single FTF factor Story. If platform-level coverage is judged inadequate at re-review, a dedicated Story should be opened in those Epics — not bolted onto Track 6.
Operator decision request : confirm that platform-level kill-switch + Console-flip rollback are sufficient for this Story's scope, OR explicit instruction to expand scope to include focal-loss-specific infra (in which case Story should be re-sized + re-staffed).
12. Linked context¶
- Canonical plan :
documentation/F1_BUY_BOOST_PLAN.md§5 Track 6 + §4.2 (variant matrix convention) + §6 (per-track gate criteria) - Predecessor Story :
CVN-N001-EE-S01Track 5 (wp#40, in testing post FTF sweep with the recent Bug #1 + #2 fixes) - Pipeline contract hardening :
CVN-N011-EA(wp#68) — defense-in-depth applied during Track 5 incidents (_assert_dmatrix_contract,_assert_calibration_targets_discrete,_invoke_calibrationhelper) - Track 7 committee veto : reco #13 of session
9d4942cbmandates formal verification of custom losses - ADR-15 : theta calibré OOS (precedent for hold-out calibration)
- ADR-56 : every change gated by env var + FTF factor (CVN_LOSS_FUNCTION + CVN_FOCAL_*)
- ADR-58 : every FTF factor must have guardrail + integration test
- ADR-68 : Expert Committee for substantive ML PR (this Story qualifies)
- ADR-70 : MLOps readiness template mandatory before merge to main