Committee pr_review dossier — Track 6 Focal loss (PR #767)¶
Date : 2026-04-28
Story : CVN-N001-EE-S02 (OP wp#41)
GH issue : #713
PR : #767 commit b2bbb513
Session type : pr_review (per ADR-68)
Predecessor : plan_review session 4ef337af PASSED OK consensus strong (after v1+v2 REJECTED rounds)
Revision history :
- v1 (committee c47b2afa) : REJECTED INSUFFICIENT_EVIDENCE consensus strong. 2 blockers : (a) MLOps readiness lacked detail (committee couldn't fetch the external mlops_readiness.md file referenced by link), (b) OOS validation evidence missing (ADR-14/15 mandate not visible in artifact).
- v2 (committee 9c529319) : REJECTED EXECUTION_RISK consensus strong. Single blocker : runbooks for P1 + P2 alerts marked "Phase 3 deliverable" not yet in repo — operational risk on deployment.
- v3 (this revision) : runbooks runbook_focal_loss_negative_expectancy.md + runbook_focal_training_crash.md authored + committed in this PR. Both with diagnosis steps, 3 remediation paths each, escalation criteria, post-incident checklist. Re-submitted.
Context¶
Implementation of focal loss as XGBoost custom objective per the plan dossier validated by 4ef337af. This dossier focuses on the implementation faithfulness + tests coverage, not on the methodology itself (which was settled in plan_review).
Diff summary¶
9 files changed, ~1300 insertions
- src/training/XGBoost/focal_loss.py (NEW, 177 lines)
- src/training/XGBoost/cvntrade_XGBoost_trainer.py (+~150 lines : focal wiring + temp scaling + 4 predict() audit sites)
- src/commun/finetune/ablation_matrix.py (+40 lines : focal_loss factor)
- src/commun/finetune/guardrails.py (+40 lines : focal_loss validator)
- tests/unit/training/XGBoost/test_focal_loss_formal_verification.py (NEW, 32 tests)
- tests/integration/test_track6_focal_loss.py (NEW, 19 tests)
- documentation/stories/CVN-N001-EE-S02/mlops_readiness.md (NEW, ADR-70)
Implementation summary¶
Formal verification (committee 7a5c1d73 blocker — Track 7 reco #13)¶
focal_loss.pyderives grad + hess viasympy.lambdify; ZERO hand-coded math_build_focal_lambdas()runs at module import (~50ms), produces 4 numpy callables- Test
TestFocalLossSymbolicEquivalence: 20 parametrized cases asserting symbolic equivalence at rel=1e-12 - Test
TestFocalLossNumericalSanity: 8 finite-difference cases (grad tol 1e-4, hess tol 1e-3) - Test
TestFocalLossInvariants: math invariants (gamma=0 reduces to weighted CE, hess > 0, no NaN/inf at extreme logits, focal rewards correct predictions)
Predict() path audit (committee reco #3, plan §4.6)¶
Snapshot disclaimer : line numbers below reflect commit 57b366f6 (the
state reviewed by committee 13fd89c9). For post-merge reading, use
grep -n "raw = self.model.predict" to locate the current sites — the
contract behavior matters more than the literal line numbers.
4 sites updated to handle focal raw margin output :
- XGBWrapper.predict_proba (line 210) : applies sigmoid when raw_output_logits=True
- XGBWrapper.predict (line 195) : uses logit threshold 0 instead of P=0.5
- _evaluate_model (line 665) : sigmoid before reshape to (n, 2)
- trainer.predict() (line 881) : sigmoid before reshape to (n, 2)
Temperature scaling (committee reco #2, plan §4.5)¶
- New
_apply_temperature_scalingmethod usingscipy.optimize.minimize_scalar(method="bounded"over T ∈ [0.1, 10.0]) on calibration set NLL - Wrapped in
TemperatureScaledModel(module-level class — extracted in CR pass 1+2 fix for MLflow pickle compat) exposing sklearn-compatiblepredict_proba/predict - Handles both focal-active (raw logits direct) and binary:logistic (logit recovered from probability)
FTF factor + guardrail (ADR-58)¶
ablation_matrix.MODEL_FACTORS:focal_lossfactor with 5 variantsguardrails._validate_focal_loss: rejects loss_function ∉ {binary:logistic, focal}, gamma ∉ [0, 10], alpha ∉ [0, 1]
MLOps readiness (ADR-70)¶
6 sections filled : monitoring (5 metrics), alerting (2 with runbooks), drift (3 detectors including focal-specific output sharpness), staged rollout (FTF sweep + canary BTCUSDC + full), rollback (Console flip < 1min), DRI sunset 2026-07-28.
Validation¶
Snapshot disclaimer : the test count below is a snapshot at commit
b2bbb513 (the state v1 of the dossier was reviewed against). Latest count
is 284/284 (1 new MLflow pickle test added in commit 57b366f6 per CR pass
1+2). Refer to the merged PR for the definitive state.
pytest tests/unit/training/XGBoost/ tests/integration/test_track6_focal_loss.py tests/integration/test_track5_label_smoothing.py tests/unit/training/labels/ tests/unit/test_ftf_guardrails.py tests/unit/test_calibration_metrics.py tests/unit/test_oos_calibrator.py tests/unit/test_model_trainers.py→ 284 passed (originally 283, +1 pickle regression test)black --line-length=120→ OKmkdocs build --strict→ OK (only pre-existing INFO warnings)
Notable design choices¶
- SymPy lambdify over hand-coded math : per Track 7 reco #13. Trade-off : ~50ms import-time cost (one-shot) vs zero math drift risk forever. Numerical perf is identical (lambdify produces vectorized numpy code).
- Temperature scaling proactive (not follow-up) : per committee
4ef337afreco #2. Mukhoti et al. 2020 documents that focal-trained models output sharper distributions than CE-trained — isotonic/Platt may not be the right calibrator. We add temperature_scaling alongside isotonic/sigmoid/platt/hybrid so the FTF sweep can compare ECE_HOLD across calibration methods. - Predict() raw_output_logits flag on XGBWrapper : the wrapper is sklearn-facing (CalibratedClassifierCV uses it), but XGBoost's custom obj makes its
predict()return raw margins. We add a flag rather than auto-detecting at call time so the contract is explicit at construction time (the trainer knows whether it trained with focal or not). - Variant matrix kept at 5 configs : per plan §3, follows F1_BUY_BOOST_PLAN.md §4.2 convention. If results are promising but suboptimal, follow-up sweep with expanded alpha range per Risk #8.
Risks acknowledged in plan + carried into PR¶
- Risk #1 (custom obj breaks early stopping) : mitigated via custom_metric. Verified in integration tests (no crash at training time).
- Risk #2 (gradient instability extreme γ or small p_t) : numerical clipping
p ∈ [1e-7, 1-1e-7]+hess >= 1e-7; guardrail rejects γ > 10. Verified intest_grad_hess_finite_at_extreme_logits. - Risk #5 (closed-form math error) : ELIMINATED by lambdify approach + symbolic equivalence test.
- Risk #7 (focal output too sharp for isotonic) : mitigated proactively via temperature_scaling Phase 2 deliverable.
- Risk #8 (variant matrix may miss sweet spot) : Phase 4 follow-up if needed.
6. MLOps readiness — full content (per committee c47b2afa blocker #1)¶
Embedded inline so the committee can assess directly without external file fetch :
6.1 Production monitoring (5 metrics)¶
| Metric | Type | Source | Threshold (warn / crit) | Owner |
|---|---|---|---|---|
xgboost_training_failed{loss_function=focal} count |
counter | log_event in trainer | warn 1/h, crit 5/h | dococeven |
focal_loss_active{gamma,alpha} flag |
gauge | log_event focal_loss_active |
informational | dococeven |
signals.buy_proba quantile distribution per crypto |
histogram | OTel span prediction.buy_proba |
warn if drift > 0.10 vs training | dococeven |
expectancy_net_realized per crypto |
gauge | trade journal Postgres | warn < 0 over 24h, CRIT < 0 over 48h | dococeven |
inference_latency_p99 |
histogram | OTel span inference_api.predict |
warn > 100ms p99, crit > 500ms p99 | dococeven |
6.2 Alerts (2)¶
| Alert | Trigger | Severity | Channel | Runbook |
|---|---|---|---|---|
expectancy_negative_focal |
expectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1 |
P1 | Slack #cvntrade-alerts + SMS DRI | runbook_focal_loss_negative_expectancy.md (committed in this PR) |
focal_training_crash |
xgboost_training_failed{loss_function=focal} > 5/h |
P2 | Slack #cvntrade-alerts | runbook_focal_training_crash.md (committed in this PR) |
6.3 Drift detection (3 detectors)¶
| Drift type | Method | Window | Threshold (warn/crit) | Action on trigger |
|---|---|---|---|---|
| Data drift | PSI on top-5 features by FI | rolling 7d vs training | PSI > 0.2 / 0.5 | shadow re-train candidate ; NO silent action (per ADR-25) |
| Concept drift | rolling perf gap (live f1_buy vs training OOS f1_buy) |
14d window, per crypto | gap > 0.05 / 0.10 | retrain candidate, hold live, alert DRI |
| Output sharpness drift (focal-specific) | rolling mean of max(p, 1-p) (predicted confidence) |
7d window | warn if > training_mean + 0.10 | check temperature scaling fit ; recalibrate if persistent ; reverts to baseline if 2 consecutive triggers |
6.4 Kill-switch (per committee c47b2afa reco #3)¶
Two-tier rollback :
- Factor-level kill-switch (PRIMARY, < 1min) : flip
CVN_LOSS_FUNCTION=binary:logisticinftf_config.base_envvia Console (per ADR-59). Effect : next training run uses cross-entropy ; in-flight model unaffected (operator may also re-promote previous champion via MLflow alias for instant effect). - System-wide kill-switch (per ADR-71) : already implemented (wp#56 closed). Halts ALL trading regardless of which loss trained the underlying model. Sub-1s halt latency, fail-safe on connectivity loss.
Activation criteria :
- Auto-trigger : 2 consecutive P1 alerts on expectancy_negative_focal within 1h → operator paged + factor-level kill-switch flipped automatically (operator confirms within 15min or revert is held)
- Manual : DRI dococeven invokes Console flip when drift / sharpness / canary fails
7. OOS validation infrastructure (per committee c47b2afa blocker #2)¶
The FTF sweep IS the OOS validation — it leverages the existing infrastructure proven on Track 5 (PR #734) and committee-validated (session 989a6567 PASSED OK). For Track 6 :
7.1 Multi-fold evaluation (ADR-14)¶
- 5 cryptos × 5 folds × 5 variants = 125 training runs per FTF sweep
- Each fold is a purged + embargoed time-series split (per ADR-14, López de Prado AFML Ch 7) — implemented in
commun.cv.purged_kfold.PurgedKFold - Purge horizon = label horizon (4 candles for SL0.5_TP1_H4 PTE strategy, configurable via
CVN_PURGE_HORIZON) - Embargo = 0.5% of dataset size (configurable via
CVN_EMBARGO_PCT) - Splits are deterministic (seeded) to ensure reproducibility across factor variants
7.2 Theta calibrated OOS (ADR-15)¶
- Threshold for BUY decision (
CVN_THRESHOLD_BUY) is calibrated on a separate hold-out fold that's neither train nor val — isolated from any per-fold optimization to avoid leakage - Bug #1 fix (PR #765) explicitly aligned
_apply_calibrationto use(X_val, y_val)per ADR-15 precedent - Track 6 inherits this contract — focal loss outputs go through the same calibration on hold-out path
- Temperature scaling adds an OOS recalibration option specifically for focal-trained models
7.3 Statistical gates (per F1_BUY_BOOST_PLAN.md §6)¶
After 125 rows in finetune_results, the sweep runs the standard analysis pipeline (existing infrastructure) :
- BH-corrected p-values (Benjamini-Hochberg FDR control across the 4 non-baseline variants × 5 folds)
- Bootstrap CI 95% on f1_buy delta vs baseline
- Cohen's d effect size on f1_buy
- Per-asset breakdown : ≥ 4/5 cryptos must individually improve
- Per-fold variance reported alongside median (per F1_BUY_BOOST_PLAN.md §7)
- ECE_HOLD (calibration on holdout fold) ≤ baseline + 0.01
7.4 PBO / CSCV considerations¶
The FTF sweep at 5 variants is small enough that PBO (Probability of Backtest Overfitting) risk is low (López de Prado et al. 2014 — PBO rises with the number of variants explored). The 4 non-baseline variants vs 1 baseline = 4 hypotheses ; with 5-fold CV, false-discovery probability under H0 is well-controlled by BH correction.
If a follow-up sweep expands the alpha range (Risk #8 mitigation), PBO analysis MUST be added to the results dossier explicitly. Tracked as Phase 4 conditional acceptance criterion.
7.5 Pre-merge OOS evidence¶
This PR ships the implementation. The OOS results will land in Phase 4 (sweep) → Phase 5 (results dossier 2026-MM-DD-track6-focal-loss-results.md). The integration tests in this PR are NOT a substitute for the FTF sweep — they verify code correctness, not statistical validity. Lock decision (Phase 5) requires the full statistical evidence.
Question for the committee¶
Is the focal loss implementation faithful to the plan validated by
4ef337af, with appropriate formal verification (SymPy + numerical) and predict() path audit — or does any of (a) the temperature scaling implementation choices (L-BFGS over T, single param), (b) the XGBWrapper raw_output_logits flag design, (c) the variant matrix at 5 configs, raise a blocker before merge?Bonus : are the MLOps readiness 5 metrics + 2 alerts + 3 drift detectors sufficient for shadow + canary stages, or should additional production observability be wired BEFORE the FTF sweep launches (Phase 4) ?