Committee `pr_review` dossier — Track 6 Focal loss (PR #767)¶

Date : 2026-04-28 Story : CVN-N001-EE-S02 (OP wp#41) GH issue : #713 PR : #767 commit b2bbb513 Session type : pr_review (per ADR-68) Predecessor : plan_review session 4ef337af PASSED OK consensus strong (after v1+v2 REJECTED rounds) Revision history : - v1 (committee c47b2afa) : REJECTED INSUFFICIENT_EVIDENCE consensus strong. 2 blockers : (a) MLOps readiness lacked detail (committee couldn't fetch the external mlops_readiness.md file referenced by link), (b) OOS validation evidence missing (ADR-14/15 mandate not visible in artifact). - v2 (committee 9c529319) : REJECTED EXECUTION_RISK consensus strong. Single blocker : runbooks for P1 + P2 alerts marked "Phase 3 deliverable" not yet in repo — operational risk on deployment. - v3 (this revision) : runbooks runbook_focal_loss_negative_expectancy.md + runbook_focal_training_crash.md authored + committed in this PR. Both with diagnosis steps, 3 remediation paths each, escalation criteria, post-incident checklist. Re-submitted.

Context¶

Implementation of focal loss as XGBoost custom objective per the plan dossier validated by 4ef337af. This dossier focuses on the implementation faithfulness + tests coverage, not on the methodology itself (which was settled in plan_review).

Diff summary¶

9 files changed, ~1300 insertions
- src/training/XGBoost/focal_loss.py (NEW, 177 lines)
- src/training/XGBoost/cvntrade_XGBoost_trainer.py (+~150 lines : focal wiring + temp scaling + 4 predict() audit sites)
- src/commun/finetune/ablation_matrix.py (+40 lines : focal_loss factor)
- src/commun/finetune/guardrails.py (+40 lines : focal_loss validator)
- tests/unit/training/XGBoost/test_focal_loss_formal_verification.py (NEW, 32 tests)
- tests/integration/test_track6_focal_loss.py (NEW, 19 tests)
- documentation/stories/CVN-N001-EE-S02/mlops_readiness.md (NEW, ADR-70)

Implementation summary¶

Formal verification (committee `7a5c1d73` blocker — Track 7 reco #13)¶

focal_loss.py derives grad + hess via sympy.lambdify ; ZERO hand-coded math
_build_focal_lambdas() runs at module import (~50ms), produces 4 numpy callables
Test TestFocalLossSymbolicEquivalence : 20 parametrized cases asserting symbolic equivalence at rel=1e-12
Test TestFocalLossNumericalSanity : 8 finite-difference cases (grad tol 1e-4, hess tol 1e-3)
Test TestFocalLossInvariants : math invariants (gamma=0 reduces to weighted CE, hess > 0, no NaN/inf at extreme logits, focal rewards correct predictions)

Predict() path audit (committee reco #3, plan §4.6)¶

Snapshot disclaimer : line numbers below reflect commit 57b366f6 (the state reviewed by committee 13fd89c9). For post-merge reading, use grep -n "raw = self.model.predict" to locate the current sites — the contract behavior matters more than the literal line numbers.

4 sites updated to handle focal raw margin output : - XGBWrapper.predict_proba (line 210) : applies sigmoid when raw_output_logits=True - XGBWrapper.predict (line 195) : uses logit threshold 0 instead of P=0.5 - _evaluate_model (line 665) : sigmoid before reshape to (n, 2) - trainer.predict() (line 881) : sigmoid before reshape to (n, 2)

Temperature scaling (committee reco #2, plan §4.5)¶

New _apply_temperature_scaling method using scipy.optimize.minimize_scalar (method="bounded" over T ∈ [0.1, 10.0]) on calibration set NLL
Wrapped in TemperatureScaledModel (module-level class — extracted in CR pass 1+2 fix for MLflow pickle compat) exposing sklearn-compatible predict_proba / predict
Handles both focal-active (raw logits direct) and binary:logistic (logit recovered from probability)

FTF factor + guardrail (ADR-58)¶

ablation_matrix.MODEL_FACTORS : focal_loss factor with 5 variants
guardrails._validate_focal_loss : rejects loss_function ∉ {binary:logistic, focal}, gamma ∉ [0, 10], alpha ∉ [0, 1]

MLOps readiness (ADR-70)¶

6 sections filled : monitoring (5 metrics), alerting (2 with runbooks), drift (3 detectors including focal-specific output sharpness), staged rollout (FTF sweep + canary BTCUSDC + full), rollback (Console flip < 1min), DRI sunset 2026-07-28.

Validation¶

Snapshot disclaimer : the test count below is a snapshot at commit b2bbb513 (the state v1 of the dossier was reviewed against). Latest count is 284/284 (1 new MLflow pickle test added in commit 57b366f6 per CR pass 1+2). Refer to the merged PR for the definitive state.

pytest tests/unit/training/XGBoost/ tests/integration/test_track6_focal_loss.py tests/integration/test_track5_label_smoothing.py tests/unit/training/labels/ tests/unit/test_ftf_guardrails.py tests/unit/test_calibration_metrics.py tests/unit/test_oos_calibrator.py tests/unit/test_model_trainers.py → 284 passed (originally 283, +1 pickle regression test)
black --line-length=120 → OK
mkdocs build --strict → OK (only pre-existing INFO warnings)

Notable design choices¶

SymPy lambdify over hand-coded math : per Track 7 reco #13. Trade-off : ~50ms import-time cost (one-shot) vs zero math drift risk forever. Numerical perf is identical (lambdify produces vectorized numpy code).
Temperature scaling proactive (not follow-up) : per committee 4ef337af reco #2. Mukhoti et al. 2020 documents that focal-trained models output sharper distributions than CE-trained — isotonic/Platt may not be the right calibrator. We add temperature_scaling alongside isotonic/sigmoid/platt/hybrid so the FTF sweep can compare ECE_HOLD across calibration methods.
Predict() raw_output_logits flag on XGBWrapper : the wrapper is sklearn-facing (CalibratedClassifierCV uses it), but XGBoost's custom obj makes its predict() return raw margins. We add a flag rather than auto-detecting at call time so the contract is explicit at construction time (the trainer knows whether it trained with focal or not).
Variant matrix kept at 5 configs : per plan §3, follows F1_BUY_BOOST_PLAN.md §4.2 convention. If results are promising but suboptimal, follow-up sweep with expanded alpha range per Risk #8.

Risks acknowledged in plan + carried into PR¶

Risk #1 (custom obj breaks early stopping) : mitigated via custom_metric. Verified in integration tests (no crash at training time).
Risk #2 (gradient instability extreme γ or small p_t) : numerical clipping p ∈ [1e-7, 1-1e-7] + hess >= 1e-7 ; guardrail rejects γ > 10. Verified in test_grad_hess_finite_at_extreme_logits.
Risk #5 (closed-form math error) : ELIMINATED by lambdify approach + symbolic equivalence test.
Risk #7 (focal output too sharp for isotonic) : mitigated proactively via temperature_scaling Phase 2 deliverable.
Risk #8 (variant matrix may miss sweet spot) : Phase 4 follow-up if needed.

6. MLOps readiness — full content (per committee `c47b2afa` blocker #1)¶

Embedded inline so the committee can assess directly without external file fetch :

6.1 Production monitoring (5 metrics)¶

Metric	Type	Source	Threshold (warn / crit)	Owner
`xgboost_training_failed{loss_function=focal}` count	counter	log_event in trainer	warn 1/h, crit 5/h	dococeven
`focal_loss_active{gamma,alpha}` flag	gauge	log_event `focal_loss_active`	informational	dococeven
`signals.buy_proba` quantile distribution per crypto	histogram	OTel span `prediction.buy_proba`	warn if drift > 0.10 vs training	dococeven
`expectancy_net_realized` per crypto	gauge	trade journal Postgres	warn < 0 over 24h, CRIT < 0 over 48h	dococeven
`inference_latency_p99`	histogram	OTel span `inference_api.predict`	warn > 100ms p99, crit > 500ms p99	dococeven

6.2 Alerts (2)¶

Alert	Trigger	Severity	Channel	Runbook
`expectancy_negative_focal`	`expectancy_net_realized < 0 over 48h on >= 2 cryptos` AND `focal_loss_active=1`	P1	Slack #cvntrade-alerts + SMS DRI	`runbook_focal_loss_negative_expectancy.md` (committed in this PR)
`focal_training_crash`	`xgboost_training_failed{loss_function=focal} > 5/h`	P2	Slack #cvntrade-alerts	`runbook_focal_training_crash.md` (committed in this PR)

6.3 Drift detection (3 detectors)¶

Drift type	Method	Window	Threshold (warn/crit)	Action on trigger
Data drift	PSI on top-5 features by FI	rolling 7d vs training	PSI > 0.2 / 0.5	shadow re-train candidate ; NO silent action (per ADR-25)
Concept drift	rolling perf gap (live `f1_buy` vs training OOS `f1_buy`)	14d window, per crypto	gap > 0.05 / 0.10	retrain candidate, hold live, alert DRI
Output sharpness drift (focal-specific)	rolling mean of `max(p, 1-p)` (predicted confidence)	7d window	warn if > training_mean + 0.10	check temperature scaling fit ; recalibrate if persistent ; reverts to baseline if 2 consecutive triggers

6.4 Kill-switch (per committee `c47b2afa` reco #3)¶

Two-tier rollback :

Factor-level kill-switch (PRIMARY, < 1min) : flip CVN_LOSS_FUNCTION=binary:logistic in ftf_config.base_env via Console (per ADR-59). Effect : next training run uses cross-entropy ; in-flight model unaffected (operator may also re-promote previous champion via MLflow alias for instant effect).
System-wide kill-switch (per ADR-71) : already implemented (wp#56 closed). Halts ALL trading regardless of which loss trained the underlying model. Sub-1s halt latency, fail-safe on connectivity loss.

Activation criteria : - Auto-trigger : 2 consecutive P1 alerts on expectancy_negative_focal within 1h → operator paged + factor-level kill-switch flipped automatically (operator confirms within 15min or revert is held) - Manual : DRI dococeven invokes Console flip when drift / sharpness / canary fails

7. OOS validation infrastructure (per committee `c47b2afa` blocker #2)¶

The FTF sweep IS the OOS validation — it leverages the existing infrastructure proven on Track 5 (PR #734) and committee-validated (session 989a6567 PASSED OK). For Track 6 :

7.1 Multi-fold evaluation (ADR-14)¶

5 cryptos × 5 folds × 5 variants = 125 training runs per FTF sweep
Each fold is a purged + embargoed time-series split (per ADR-14, López de Prado AFML Ch 7) — implemented in commun.cv.purged_kfold.PurgedKFold
Purge horizon = label horizon (4 candles for SL0.5_TP1_H4 PTE strategy, configurable via CVN_PURGE_HORIZON)
Embargo = 0.5% of dataset size (configurable via CVN_EMBARGO_PCT)
Splits are deterministic (seeded) to ensure reproducibility across factor variants

7.2 Theta calibrated OOS (ADR-15)¶

Threshold for BUY decision (CVN_THRESHOLD_BUY) is calibrated on a separate hold-out fold that's neither train nor val — isolated from any per-fold optimization to avoid leakage
Bug #1 fix (PR #765) explicitly aligned _apply_calibration to use (X_val, y_val) per ADR-15 precedent
Track 6 inherits this contract — focal loss outputs go through the same calibration on hold-out path
Temperature scaling adds an OOS recalibration option specifically for focal-trained models

7.3 Statistical gates (per F1_BUY_BOOST_PLAN.md §6)¶

After 125 rows in finetune_results, the sweep runs the standard analysis pipeline (existing infrastructure) : - BH-corrected p-values (Benjamini-Hochberg FDR control across the 4 non-baseline variants × 5 folds) - Bootstrap CI 95% on f1_buy delta vs baseline - Cohen's d effect size on f1_buy - Per-asset breakdown : ≥ 4/5 cryptos must individually improve - Per-fold variance reported alongside median (per F1_BUY_BOOST_PLAN.md §7) - ECE_HOLD (calibration on holdout fold) ≤ baseline + 0.01

7.4 PBO / CSCV considerations¶

The FTF sweep at 5 variants is small enough that PBO (Probability of Backtest Overfitting) risk is low (López de Prado et al. 2014 — PBO rises with the number of variants explored). The 4 non-baseline variants vs 1 baseline = 4 hypotheses ; with 5-fold CV, false-discovery probability under H0 is well-controlled by BH correction.

If a follow-up sweep expands the alpha range (Risk #8 mitigation), PBO analysis MUST be added to the results dossier explicitly. Tracked as Phase 4 conditional acceptance criterion.

7.5 Pre-merge OOS evidence¶

This PR ships the implementation. The OOS results will land in Phase 4 (sweep) → Phase 5 (results dossier 2026-MM-DD-track6-focal-loss-results.md). The integration tests in this PR are NOT a substitute for the FTF sweep — they verify code correctness, not statistical validity. Lock decision (Phase 5) requires the full statistical evidence.

Question for the committee¶

Is the focal loss implementation faithful to the plan validated by 4ef337af, with appropriate formal verification (SymPy + numerical) and predict() path audit — or does any of (a) the temperature scaling implementation choices (L-BFGS over T, single param), (b) the XGBWrapper raw_output_logits flag design, (c) the variant matrix at 5 configs, raise a blocker before merge?

Bonus : are the MLOps readiness 5 metrics + 2 alerts + 3 drift detectors sufficient for shadow + canary stages, or should additional production observability be wired BEFORE the FTF sweep launches (Phase 4) ?

Committee pr_review dossier — Track 6 Focal loss (PR #767)¶