Skip to content

MLOps readiness — CVN-N001-EE-S02 — Focal loss (Track 6, F1_buy boost)

Story: CVN-N001-EE-S02GH #713 + OP wp#41 Owner: dococeven (DRI for production behaviour of this change) Filled on: 2026-04-28 Reviewed by committee: plan_review session 4ef337af (verdict PASSED OK consensus strong) ; pr_review pending Phase 3


1. Production monitoring (MUST)

Metric Type Source Dashboard Threshold (warn / crit) Owner
xgboost_training_failed count counter log_event in cvntrade_XGBoost_trainer.py Grafana → Pipeline Health panel "Trainer errors" warn 1/h, crit 5/h dococeven
focal_loss_active flag gauge log_event event=focal_loss_active gamma=X alpha=Y Grafana → MLOps Overview panel "Active loss function per crypto" informational dococeven
signals.buy_proba quantile distribution per crypto histogram OTel span prediction.buy_proba per inference call Grafana → Testing & Backtest panel "Probability distribution" warn if drift > 0.10 dococeven
expectancy_net_realized per crypto gauge trade journal Postgres realized_pnl aggregation Grafana → MLOps Overview panel "Per-crypto expectancy" warn < 0 over 24h, crit < 0 over 48h dococeven
inference_latency_p99 histogram OTel span inference_api.predict Grafana → Pipeline Health panel "Inference latency" warn > 100ms p99, crit > 500ms p99 dococeven

Required minima ✅ all 3 covered : - Prediction-rate metric : signals.buy_proba quantile distribution per crypto - Outcome metric : expectancy_net_realized per crypto - Health metric : inference_latency_p99 + xgboost_training_failed count - All metrics tagged with FTF factor focal_loss=<variant> per ADR-30

Cost monitoring : focal_loss adds ~5-10% training time (Hessian via SymPy lambdify is slightly heavier than analytical CE) — track via existing pipeline_latency golden signal (OPERATIONS §2). No new dashboard needed.

2. Alerting & runbooks (MUST)

Alert Trigger Runbook Severity Notification channel
expectancy_negative_focal expectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1 documentation/runbooks/focal_loss_negative_expectancy.md (to be created Phase 3) P1 Slack #cvntrade-alerts + SMS dococeven
focal_training_crash xgboost_training_failed{loss_function="focal"} > 5/h documentation/runbooks/focal_training_crash.md (to be created Phase 3) P2 Slack #cvntrade-alerts

Required minima ✅ : 1 P1 alert covering silent revenue loss path. Runbooks to be authored in Phase 3 alongside the implementation PR.

3. Drift detection (MUST)

Drift type Method Window Threshold Action on trigger
Data drift PSI on top-5 features by FI rolling 7d vs training distribution PSI > 0.2 (warn), PSI > 0.5 (crit) shadow re-train candidate
Concept drift rolling perf gap (live f1_buy vs training OOS f1_buy) 14d window, per crypto gap > 0.05 (warn), gap > 0.10 (crit) retrain candidate, hold live
Output sharpness drift (focal-specific) rolling mean of max(p, 1-p) (predicted confidence) 7d window warn if > training_mean + 0.10 check temperature scaling fit ; recalibrate if persistent

Required minima ✅ : data + concept drift implemented. Output sharpness drift is focal-specific and addresses the known issue that focal-trained models drift toward sharper output distributions over time (Mukhoti et al. 2020).

Drift metrics emitted as OpenTelemetry spans per ADR-30. Triggered drift produces a drift_alert runbook entry, NOT silent retraining (ADR-25).

4. Staged rollout (MUST)

Stage Traffic % Duration Pass criteria Rollback trigger
Shadow 0% (predictions logged via FTF sweep, not acted) 1 sweep run (~24h cluster compute) 125 rows in finetune_results ; statistical gate : f1_buy lift CI95 excludes 0, ECE_HOLD ≤ baseline + 0.01, Cohen's d ≥ 0.3, BH p < 0.05 gate KO → keep available (no lock)
Canary 1 crypto out of 5 ≥ 7d live expectancy_net ≥ baseline OR within CI95 ; n_trades ≥ 50 ; no drift alert ; per-crypto ECE_HOLD ≤ baseline + 0.01 live expectancy_net < 0 over 48h
Full 100% (all 5 cryptos) all canary criteria + Sortino ≥ baseline per ADR-26 oncall procedure

Required minima ✅ : - Shadow stage = the FTF sweep itself (industry-standard for ML factor experiments per project convention) - Canary stage MUST run on a single crypto first — canary crypto = BTCUSDC (deepest order book, highest signal-to-noise, lowest blast radius if focal regresses) - Full rollout requires explicit operator sign-off in OP Story closure ; no auto-promotion

5. Rollback plan (MUST)

Mechanism Description Revert SLA
FTF factor flag (PRIMARY) CVN_LOSS_FUNCTION=binary:logistic set in ftf_config.base_env Console (per ADR-59) < 1 min
Model version pinning cvntrade_XGBoost_<SYMBOL>_<STRATEGY> MLflow alias points to previous champion < 5 min
Feature artefact pinning FeatureEngineeringAPI.from_mlflow_run(run_id=<previous>) < 5 min

Required minima ✅ : - The change is config-only revertable — flip CVN_LOSS_FUNCTION in Console without code deploy - Specific config value : set factor_focal_loss=none in ftf_config.base_env (or directly CVN_LOSS_FUNCTION=binary:logistic) - Tested in shadow stage : the FTF sweep includes the none baseline variant which exercises the rollback path

Secondary path (emergency only) : hotfix flow per OPERATIONS.md §16.7 if FTF runner is itself broken.

6. Owner & DRI (MUST)

  • DRI: dococeven (single human, accountable for production behaviour of focal_loss)
  • Backup DRI: dococeven (currently sole maintainer ; backup engagement deferred to staffing milestone)
  • Decision authority (rollback flag flip without committee) : dococeven
  • Sunset date: 2026-07-28 (90 days post full rollout) — by this date focal_loss is either accepted as part of the champion stack OR revisited

If sole-maintainer status changes, this file is updated and OP Story comment posted.


Sign-off checklist (gate before PR merge)

  • §1 monitoring : ≥ 1 prediction + ≥ 1 outcome + ≥ 1 health metric defined ✅
  • §2 alerting : ≥ 1 P1 alert exists ✅ — runbook files authored in Phase 3
  • §3 drift : data + concept (+ output sharpness focal-specific) wired ✅
  • §4 rollout : shadow (FTF sweep) + canary (BTCUSDC) + full ✅
  • §5 rollback : CVN_LOSS_FUNCTION=binary:logistic Console flip < 1min ✅
  • §6 DRI : dococeven, sunset 2026-07-28 ✅
  • Committee pr_review session reviews this filled template (Phase 3 — pending)
  • OP Story comment links the committee session id (Phase 3 — pending)

The PR description MUST link this file. Per ADR-70 + ADR-25 — required for merge.