MLOps readiness — CVN-N001-EE-S02 — Focal loss (Track 6, F1_buy boost)¶
Story: CVN-N001-EE-S02 — GH #713 + OP wp#41
Owner: dococeven (DRI for production behaviour of this change)
Filled on: 2026-04-28
Reviewed by committee: plan_review session 4ef337af (verdict PASSED OK consensus strong) ; pr_review pending Phase 3
1. Production monitoring (MUST)¶
| Metric | Type | Source | Dashboard | Threshold (warn / crit) | Owner |
|---|---|---|---|---|---|
xgboost_training_failed count |
counter | log_event in cvntrade_XGBoost_trainer.py |
Grafana → Pipeline Health panel "Trainer errors" |
warn 1/h, crit 5/h | dococeven |
focal_loss_active flag |
gauge | log_event event=focal_loss_active gamma=X alpha=Y |
Grafana → MLOps Overview panel "Active loss function per crypto" |
informational | dococeven |
signals.buy_proba quantile distribution per crypto |
histogram | OTel span prediction.buy_proba per inference call |
Grafana → Testing & Backtest panel "Probability distribution" |
warn if drift > 0.10 | dococeven |
expectancy_net_realized per crypto |
gauge | trade journal Postgres realized_pnl aggregation |
Grafana → MLOps Overview panel "Per-crypto expectancy" |
warn < 0 over 24h, crit < 0 over 48h | dococeven |
inference_latency_p99 |
histogram | OTel span inference_api.predict |
Grafana → Pipeline Health panel "Inference latency" |
warn > 100ms p99, crit > 500ms p99 | dococeven |
Required minima ✅ all 3 covered :
- Prediction-rate metric : signals.buy_proba quantile distribution per crypto
- Outcome metric : expectancy_net_realized per crypto
- Health metric : inference_latency_p99 + xgboost_training_failed count
- All metrics tagged with FTF factor focal_loss=<variant> per ADR-30
Cost monitoring : focal_loss adds ~5-10% training time (Hessian via SymPy lambdify is slightly heavier than analytical CE) — track via existing pipeline_latency golden signal (OPERATIONS §2). No new dashboard needed.
2. Alerting & runbooks (MUST)¶
| Alert | Trigger | Runbook | Severity | Notification channel |
|---|---|---|---|---|
expectancy_negative_focal |
expectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1 |
documentation/runbooks/focal_loss_negative_expectancy.md (to be created Phase 3) |
P1 | Slack #cvntrade-alerts + SMS dococeven |
focal_training_crash |
xgboost_training_failed{loss_function="focal"} > 5/h |
documentation/runbooks/focal_training_crash.md (to be created Phase 3) |
P2 | Slack #cvntrade-alerts |
Required minima ✅ : 1 P1 alert covering silent revenue loss path. Runbooks to be authored in Phase 3 alongside the implementation PR.
3. Drift detection (MUST)¶
| Drift type | Method | Window | Threshold | Action on trigger |
|---|---|---|---|---|
| Data drift | PSI on top-5 features by FI | rolling 7d vs training distribution | PSI > 0.2 (warn), PSI > 0.5 (crit) | shadow re-train candidate |
| Concept drift | rolling perf gap (live f1_buy vs training OOS f1_buy) |
14d window, per crypto | gap > 0.05 (warn), gap > 0.10 (crit) | retrain candidate, hold live |
| Output sharpness drift (focal-specific) | rolling mean of max(p, 1-p) (predicted confidence) |
7d window | warn if > training_mean + 0.10 | check temperature scaling fit ; recalibrate if persistent |
Required minima ✅ : data + concept drift implemented. Output sharpness drift is focal-specific and addresses the known issue that focal-trained models drift toward sharper output distributions over time (Mukhoti et al. 2020).
Drift metrics emitted as OpenTelemetry spans per ADR-30. Triggered drift produces a drift_alert runbook entry, NOT silent retraining (ADR-25).
4. Staged rollout (MUST)¶
| Stage | Traffic % | Duration | Pass criteria | Rollback trigger |
|---|---|---|---|---|
| Shadow | 0% (predictions logged via FTF sweep, not acted) | 1 sweep run (~24h cluster compute) | 125 rows in finetune_results ; statistical gate : f1_buy lift CI95 excludes 0, ECE_HOLD ≤ baseline + 0.01, Cohen's d ≥ 0.3, BH p < 0.05 |
gate KO → keep available (no lock) |
| Canary | 1 crypto out of 5 | ≥ 7d | live expectancy_net ≥ baseline OR within CI95 ; n_trades ≥ 50 ; no drift alert ; per-crypto ECE_HOLD ≤ baseline + 0.01 |
live expectancy_net < 0 over 48h |
| Full | 100% (all 5 cryptos) | — | all canary criteria + Sortino ≥ baseline | per ADR-26 oncall procedure |
Required minima ✅ :
- Shadow stage = the FTF sweep itself (industry-standard for ML factor experiments per project convention)
- Canary stage MUST run on a single crypto first — canary crypto = BTCUSDC (deepest order book, highest signal-to-noise, lowest blast radius if focal regresses)
- Full rollout requires explicit operator sign-off in OP Story closure ; no auto-promotion
5. Rollback plan (MUST)¶
| Mechanism | Description | Revert SLA |
|---|---|---|
| FTF factor flag (PRIMARY) | CVN_LOSS_FUNCTION=binary:logistic set in ftf_config.base_env Console (per ADR-59) |
< 1 min |
| Model version pinning | cvntrade_XGBoost_<SYMBOL>_<STRATEGY> MLflow alias points to previous champion |
< 5 min |
| Feature artefact pinning | FeatureEngineeringAPI.from_mlflow_run(run_id=<previous>) |
< 5 min |
Required minima ✅ :
- The change is config-only revertable — flip CVN_LOSS_FUNCTION in Console without code deploy
- Specific config value : set factor_focal_loss=none in ftf_config.base_env (or directly CVN_LOSS_FUNCTION=binary:logistic)
- Tested in shadow stage : the FTF sweep includes the none baseline variant which exercises the rollback path
Secondary path (emergency only) : hotfix flow per OPERATIONS.md §16.7 if FTF runner is itself broken.
6. Owner & DRI (MUST)¶
- DRI: dococeven (single human, accountable for production behaviour of focal_loss)
- Backup DRI: dococeven (currently sole maintainer ; backup engagement deferred to staffing milestone)
- Decision authority (rollback flag flip without committee) : dococeven
- Sunset date: 2026-07-28 (90 days post full rollout) — by this date focal_loss is either accepted as part of the champion stack OR revisited
If sole-maintainer status changes, this file is updated and OP Story comment posted.
Sign-off checklist (gate before PR merge)¶
- §1 monitoring : ≥ 1 prediction + ≥ 1 outcome + ≥ 1 health metric defined ✅
- §2 alerting : ≥ 1 P1 alert exists ✅ — runbook files authored in Phase 3
- §3 drift : data + concept (+ output sharpness focal-specific) wired ✅
- §4 rollout : shadow (FTF sweep) + canary (BTCUSDC) + full ✅
- §5 rollback :
CVN_LOSS_FUNCTION=binary:logisticConsole flip < 1min ✅ - §6 DRI : dococeven, sunset 2026-07-28 ✅
- Committee
pr_reviewsession reviews this filled template (Phase 3 — pending) - OP Story comment links the committee session id (Phase 3 — pending)
The PR description MUST link this file. Per ADR-70 + ADR-25 — required for merge.