Skip to content

Committee pr_review dossier — Track 6 Focal loss (PR #767)

Date : 2026-04-28 Story : CVN-N001-EE-S02 (OP wp#41) GH issue : #713 PR : #767 commit b2bbb513 Session type : pr_review (per ADR-68) Predecessor : plan_review session 4ef337af PASSED OK consensus strong (after v1+v2 REJECTED rounds) Revision history : - v1 (committee c47b2afa) : REJECTED INSUFFICIENT_EVIDENCE consensus strong. 2 blockers : (a) MLOps readiness lacked detail (committee couldn't fetch the external mlops_readiness.md file referenced by link), (b) OOS validation evidence missing (ADR-14/15 mandate not visible in artifact). - v2 (committee 9c529319) : REJECTED EXECUTION_RISK consensus strong. Single blocker : runbooks for P1 + P2 alerts marked "Phase 3 deliverable" not yet in repo — operational risk on deployment. - v3 (this revision) : runbooks runbook_focal_loss_negative_expectancy.md + runbook_focal_training_crash.md authored + committed in this PR. Both with diagnosis steps, 3 remediation paths each, escalation criteria, post-incident checklist. Re-submitted.

Context

Implementation of focal loss as XGBoost custom objective per the plan dossier validated by 4ef337af. This dossier focuses on the implementation faithfulness + tests coverage, not on the methodology itself (which was settled in plan_review).

Diff summary

9 files changed, ~1300 insertions
- src/training/XGBoost/focal_loss.py (NEW, 177 lines)
- src/training/XGBoost/cvntrade_XGBoost_trainer.py (+~150 lines : focal wiring + temp scaling + 4 predict() audit sites)
- src/commun/finetune/ablation_matrix.py (+40 lines : focal_loss factor)
- src/commun/finetune/guardrails.py (+40 lines : focal_loss validator)
- tests/unit/training/XGBoost/test_focal_loss_formal_verification.py (NEW, 32 tests)
- tests/integration/test_track6_focal_loss.py (NEW, 19 tests)
- documentation/stories/CVN-N001-EE-S02/mlops_readiness.md (NEW, ADR-70)

Implementation summary

Formal verification (committee 7a5c1d73 blocker — Track 7 reco #13)

  • focal_loss.py derives grad + hess via sympy.lambdify ; ZERO hand-coded math
  • _build_focal_lambdas() runs at module import (~50ms), produces 4 numpy callables
  • Test TestFocalLossSymbolicEquivalence : 20 parametrized cases asserting symbolic equivalence at rel=1e-12
  • Test TestFocalLossNumericalSanity : 8 finite-difference cases (grad tol 1e-4, hess tol 1e-3)
  • Test TestFocalLossInvariants : math invariants (gamma=0 reduces to weighted CE, hess > 0, no NaN/inf at extreme logits, focal rewards correct predictions)

Predict() path audit (committee reco #3, plan §4.6)

Snapshot disclaimer : line numbers below reflect commit 57b366f6 (the state reviewed by committee 13fd89c9). For post-merge reading, use grep -n "raw = self.model.predict" to locate the current sites — the contract behavior matters more than the literal line numbers.

4 sites updated to handle focal raw margin output : - XGBWrapper.predict_proba (line 210) : applies sigmoid when raw_output_logits=True - XGBWrapper.predict (line 195) : uses logit threshold 0 instead of P=0.5 - _evaluate_model (line 665) : sigmoid before reshape to (n, 2) - trainer.predict() (line 881) : sigmoid before reshape to (n, 2)

Temperature scaling (committee reco #2, plan §4.5)

  • New _apply_temperature_scaling method using scipy.optimize.minimize_scalar (method="bounded" over T ∈ [0.1, 10.0]) on calibration set NLL
  • Wrapped in TemperatureScaledModel (module-level class — extracted in CR pass 1+2 fix for MLflow pickle compat) exposing sklearn-compatible predict_proba / predict
  • Handles both focal-active (raw logits direct) and binary:logistic (logit recovered from probability)

FTF factor + guardrail (ADR-58)

  • ablation_matrix.MODEL_FACTORS : focal_loss factor with 5 variants
  • guardrails._validate_focal_loss : rejects loss_function ∉ {binary:logistic, focal}, gamma ∉ [0, 10], alpha ∉ [0, 1]

MLOps readiness (ADR-70)

6 sections filled : monitoring (5 metrics), alerting (2 with runbooks), drift (3 detectors including focal-specific output sharpness), staged rollout (FTF sweep + canary BTCUSDC + full), rollback (Console flip < 1min), DRI sunset 2026-07-28.

Validation

Snapshot disclaimer : the test count below is a snapshot at commit b2bbb513 (the state v1 of the dossier was reviewed against). Latest count is 284/284 (1 new MLflow pickle test added in commit 57b366f6 per CR pass 1+2). Refer to the merged PR for the definitive state.

  • pytest tests/unit/training/XGBoost/ tests/integration/test_track6_focal_loss.py tests/integration/test_track5_label_smoothing.py tests/unit/training/labels/ tests/unit/test_ftf_guardrails.py tests/unit/test_calibration_metrics.py tests/unit/test_oos_calibrator.py tests/unit/test_model_trainers.py284 passed (originally 283, +1 pickle regression test)
  • black --line-length=120 → OK
  • mkdocs build --strict → OK (only pre-existing INFO warnings)

Notable design choices

  • SymPy lambdify over hand-coded math : per Track 7 reco #13. Trade-off : ~50ms import-time cost (one-shot) vs zero math drift risk forever. Numerical perf is identical (lambdify produces vectorized numpy code).
  • Temperature scaling proactive (not follow-up) : per committee 4ef337af reco #2. Mukhoti et al. 2020 documents that focal-trained models output sharper distributions than CE-trained — isotonic/Platt may not be the right calibrator. We add temperature_scaling alongside isotonic/sigmoid/platt/hybrid so the FTF sweep can compare ECE_HOLD across calibration methods.
  • Predict() raw_output_logits flag on XGBWrapper : the wrapper is sklearn-facing (CalibratedClassifierCV uses it), but XGBoost's custom obj makes its predict() return raw margins. We add a flag rather than auto-detecting at call time so the contract is explicit at construction time (the trainer knows whether it trained with focal or not).
  • Variant matrix kept at 5 configs : per plan §3, follows F1_BUY_BOOST_PLAN.md §4.2 convention. If results are promising but suboptimal, follow-up sweep with expanded alpha range per Risk #8.

Risks acknowledged in plan + carried into PR

  • Risk #1 (custom obj breaks early stopping) : mitigated via custom_metric. Verified in integration tests (no crash at training time).
  • Risk #2 (gradient instability extreme γ or small p_t) : numerical clipping p ∈ [1e-7, 1-1e-7] + hess >= 1e-7 ; guardrail rejects γ > 10. Verified in test_grad_hess_finite_at_extreme_logits.
  • Risk #5 (closed-form math error) : ELIMINATED by lambdify approach + symbolic equivalence test.
  • Risk #7 (focal output too sharp for isotonic) : mitigated proactively via temperature_scaling Phase 2 deliverable.
  • Risk #8 (variant matrix may miss sweet spot) : Phase 4 follow-up if needed.

6. MLOps readiness — full content (per committee c47b2afa blocker #1)

Embedded inline so the committee can assess directly without external file fetch :

6.1 Production monitoring (5 metrics)

Metric Type Source Threshold (warn / crit) Owner
xgboost_training_failed{loss_function=focal} count counter log_event in trainer warn 1/h, crit 5/h dococeven
focal_loss_active{gamma,alpha} flag gauge log_event focal_loss_active informational dococeven
signals.buy_proba quantile distribution per crypto histogram OTel span prediction.buy_proba warn if drift > 0.10 vs training dococeven
expectancy_net_realized per crypto gauge trade journal Postgres warn < 0 over 24h, CRIT < 0 over 48h dococeven
inference_latency_p99 histogram OTel span inference_api.predict warn > 100ms p99, crit > 500ms p99 dococeven

6.2 Alerts (2)

Alert Trigger Severity Channel Runbook
expectancy_negative_focal expectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1 P1 Slack #cvntrade-alerts + SMS DRI runbook_focal_loss_negative_expectancy.md (committed in this PR)
focal_training_crash xgboost_training_failed{loss_function=focal} > 5/h P2 Slack #cvntrade-alerts runbook_focal_training_crash.md (committed in this PR)

6.3 Drift detection (3 detectors)

Drift type Method Window Threshold (warn/crit) Action on trigger
Data drift PSI on top-5 features by FI rolling 7d vs training PSI > 0.2 / 0.5 shadow re-train candidate ; NO silent action (per ADR-25)
Concept drift rolling perf gap (live f1_buy vs training OOS f1_buy) 14d window, per crypto gap > 0.05 / 0.10 retrain candidate, hold live, alert DRI
Output sharpness drift (focal-specific) rolling mean of max(p, 1-p) (predicted confidence) 7d window warn if > training_mean + 0.10 check temperature scaling fit ; recalibrate if persistent ; reverts to baseline if 2 consecutive triggers

6.4 Kill-switch (per committee c47b2afa reco #3)

Two-tier rollback :

  1. Factor-level kill-switch (PRIMARY, < 1min) : flip CVN_LOSS_FUNCTION=binary:logistic in ftf_config.base_env via Console (per ADR-59). Effect : next training run uses cross-entropy ; in-flight model unaffected (operator may also re-promote previous champion via MLflow alias for instant effect).
  2. System-wide kill-switch (per ADR-71) : already implemented (wp#56 closed). Halts ALL trading regardless of which loss trained the underlying model. Sub-1s halt latency, fail-safe on connectivity loss.

Activation criteria : - Auto-trigger : 2 consecutive P1 alerts on expectancy_negative_focal within 1h → operator paged + factor-level kill-switch flipped automatically (operator confirms within 15min or revert is held) - Manual : DRI dococeven invokes Console flip when drift / sharpness / canary fails

7. OOS validation infrastructure (per committee c47b2afa blocker #2)

The FTF sweep IS the OOS validation — it leverages the existing infrastructure proven on Track 5 (PR #734) and committee-validated (session 989a6567 PASSED OK). For Track 6 :

7.1 Multi-fold evaluation (ADR-14)

  • 5 cryptos × 5 folds × 5 variants = 125 training runs per FTF sweep
  • Each fold is a purged + embargoed time-series split (per ADR-14, López de Prado AFML Ch 7) — implemented in commun.cv.purged_kfold.PurgedKFold
  • Purge horizon = label horizon (4 candles for SL0.5_TP1_H4 PTE strategy, configurable via CVN_PURGE_HORIZON)
  • Embargo = 0.5% of dataset size (configurable via CVN_EMBARGO_PCT)
  • Splits are deterministic (seeded) to ensure reproducibility across factor variants

7.2 Theta calibrated OOS (ADR-15)

  • Threshold for BUY decision (CVN_THRESHOLD_BUY) is calibrated on a separate hold-out fold that's neither train nor val — isolated from any per-fold optimization to avoid leakage
  • Bug #1 fix (PR #765) explicitly aligned _apply_calibration to use (X_val, y_val) per ADR-15 precedent
  • Track 6 inherits this contract — focal loss outputs go through the same calibration on hold-out path
  • Temperature scaling adds an OOS recalibration option specifically for focal-trained models

7.3 Statistical gates (per F1_BUY_BOOST_PLAN.md §6)

After 125 rows in finetune_results, the sweep runs the standard analysis pipeline (existing infrastructure) : - BH-corrected p-values (Benjamini-Hochberg FDR control across the 4 non-baseline variants × 5 folds) - Bootstrap CI 95% on f1_buy delta vs baseline - Cohen's d effect size on f1_buy - Per-asset breakdown : ≥ 4/5 cryptos must individually improve - Per-fold variance reported alongside median (per F1_BUY_BOOST_PLAN.md §7) - ECE_HOLD (calibration on holdout fold) ≤ baseline + 0.01

7.4 PBO / CSCV considerations

The FTF sweep at 5 variants is small enough that PBO (Probability of Backtest Overfitting) risk is low (López de Prado et al. 2014 — PBO rises with the number of variants explored). The 4 non-baseline variants vs 1 baseline = 4 hypotheses ; with 5-fold CV, false-discovery probability under H0 is well-controlled by BH correction.

If a follow-up sweep expands the alpha range (Risk #8 mitigation), PBO analysis MUST be added to the results dossier explicitly. Tracked as Phase 4 conditional acceptance criterion.

7.5 Pre-merge OOS evidence

This PR ships the implementation. The OOS results will land in Phase 4 (sweep) → Phase 5 (results dossier 2026-MM-DD-track6-focal-loss-results.md). The integration tests in this PR are NOT a substitute for the FTF sweep — they verify code correctness, not statistical validity. Lock decision (Phase 5) requires the full statistical evidence.

Question for the committee

Is the focal loss implementation faithful to the plan validated by 4ef337af, with appropriate formal verification (SymPy + numerical) and predict() path audit — or does any of (a) the temperature scaling implementation choices (L-BFGS over T, single param), (b) the XGBWrapper raw_output_logits flag design, (c) the variant matrix at 5 configs, raise a blocker before merge?

Bonus : are the MLOps readiness 5 metrics + 2 alerts + 3 drift detectors sufficient for shadow + canary stages, or should additional production observability be wired BEFORE the FTF sweep launches (Phase 4) ?