Runbook — Focal loss model produces negative expectancy¶
Alert : expectancy_negative_focal — expectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1
Severity : P1 (page on-call DRI)
Impact : focal-loss-trained model is generating losing trades in production. Direct revenue impact ; if not addressed within 24h, can erode > 1% of portfolio per affected crypto/day.
Owner : dococeven (DRI Track 6)
Affected dashboards :
- MLOps Overview panel "Per-crypto expectancy"
- Testing & Backtest panel "Probability distribution"
- Per-crypto trade journal in Trading Engines
Immediate human actions (0–5 min)¶
1. Confirm the alert is real¶
# Query the source : trade journal Postgres aggregation over 48h window
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
SELECT crypto_symbol, AVG(realized_pnl_pct) AS expectancy_48h, COUNT(*) AS n_trades
FROM trade_journal
WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '48 hours'
AND model_loss_function = 'focal'
GROUP BY crypto_symbol
HAVING AVG(realized_pnl_pct) < 0;
"
Expected output : ≥ 2 rows with negative expectancy_48h. If 0–1 rows → alert was a transient false positive ; check Grafana for noisy data, log via incident_log.
2. Check the active config¶
# Verify focal is currently active per ftf_config (single-row JSONB table —
# see infra/migrations/007_ftf_config.sql)
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
SELECT
base_env->>'CVN_LOSS_FUNCTION' AS loss_function,
base_env->>'CVN_FOCAL_GAMMA' AS focal_gamma,
base_env->>'CVN_FOCAL_ALPHA' AS focal_alpha,
version,
updated_at
FROM ftf_config
WHERE id = 1;
"
If CVN_LOSS_FUNCTION != 'focal' → alert is stale (focal was already rolled back). Acknowledge alert, no action.
3. PRIMARY ROLLBACK (< 1 min revert SLA)¶
Flip CVN_LOSS_FUNCTION back to baseline via Console UI :
- URL : https://console.cvntrade.eu/config
- Find CVN_LOSS_FUNCTION key
- Set value to binary:logistic
- Submit (the change is atomic ; next training run picks it up)
For instant effect on live inference (rather than next training run), also revert the model alias in MLflow. Use a templated symbol — do NOT run on a single crypto if multiple are affected (CR pass 4 Major fix). Model name follows the canonical convention CVNTrade_{MODEL_TYPE}_{SYMBOL}_{STRATEGY} — note the capitalized C (CR pass 5 Major).
MLflow version IDs are per-model (not shared across symbols), so the rollback is a two-step lookup-then-flip per crypto.
Step A — discover the previous version ID per affected crypto (the version BEFORE the focal-loss one, typically N-1 if focal is N) :
for s in BTCUSDC ETHUSDC SOLUSDC XRPUSDC ADAUSDC; do
echo "=== $s ==="
mlflow models list-versions --name CVNTrade_XGBoost_${s}_SL0.5_TP1_H4 \
--order-by version_number DESC --max-results 5
done
# → record the previous_version_id for each affected symbol from this output.
# Example output (BTCUSDC) :
# version=42 run_id=abc123 tags={loss_function: focal} ← current focal
# version=41 run_id=def456 tags={loss_function: binary} ← previous_version_id (target)
Step B — flip the champion alias per crypto (substitute the IDs collected in Step A — one explicit invocation per affected symbol, NOT a templated loop, to leave a clear audit trail) :
mlflow models set-alias --name CVNTrade_XGBoost_BTCUSDC_SL0.5_TP1_H4 --alias champion --version <prev_id_BTCUSDC>
mlflow models set-alias --name CVNTrade_XGBoost_ETHUSDC_SL0.5_TP1_H4 --alias champion --version <prev_id_ETHUSDC>
# … repeat for each affected crypto identified in §1 above
The previous champion (binary:logistic) takes effect on the next inference call (< 30s propagation via cache invalidation).
Diagnosis (5–30 min)¶
Why did focal regress in production after passing the FTF sweep gate ?¶
Likely causes (in priority order) :
- Distribution drift — input features / labels drifted since training. Check Grafana →
Data Driftpanel for PSI per crypto. - Calibration drift — focal output sharpness shifted. Check Grafana →
Output Sharpness Driftpanel ; if alerting, recalibrate via temperature scaling. - Concept drift — market regime changed, focal's hard-example focus is hurting (e.g. choppy market with no clear hard examples to learn from).
- Implementation regression — a code change post-merge silently broke focal output handling. Check
git log src/training/XGBoost/focal_loss.pyfor recent edits.
Confirm via per-fold analysis¶
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
SELECT
crypto_symbol,
AVG(realized_pnl_pct) FILTER (WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '7 days') AS exp_7d,
AVG(realized_pnl_pct) FILTER (WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '48 hours') AS exp_48h,
AVG(realized_pnl_pct) FILTER (WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '14 days' AND entry_time < (NOW() AT TIME ZONE 'UTC') - INTERVAL '48 hours') AS exp_14d_baseline
FROM trade_journal
WHERE model_loss_function = 'focal'
GROUP BY crypto_symbol;
"
If exp_48h < exp_14d_baseline - 0.01 → distribution / regime shift, expected behavior to roll back.
Remediation paths¶
Path A — Immediate rollback to baseline (default)¶
Already done in step 3 above. Document the incident in OPERATIONS.md §17.X (next available number).
Path B — Recalibrate focal model with temperature scaling¶
If diagnosis points to calibration drift only (not concept drift) :
# Retrigger training with temperature_scaling calibration via Console
# Set CVN_CALIBRATION=temperature_scaling
# Trigger : Airflow DAG launch__training (single crypto, recalibrate only)
Verify post-recalibration ECE_HOLD via Grafana → Calibration Quality panel.
Path C — Lower the focal aggressiveness¶
If diagnosis is "focal was too aggressive for current market" :
Console flip :
- CVN_FOCAL_GAMMA : current → 1.0 (mild) instead of 2.0 (standard)
- Trigger retraining via Airflow
If still negative, fall back to baseline (Path A).
Escalation¶
If rollback (Path A) does NOT restore expectancy within 4h post-revert : - Escalate to Operator (dococeven) for portfolio-wide kill-switch decision - ADR-71 system-wide halt is the last resort (trading completely stopped) - File P1 incident in OPERATIONS.md §17.X with full timeline + dashboard screenshots
Post-incident checklist¶
- Incident logged in OPERATIONS.md §17.X
- Root cause analyzed (path A/B/C decision)
- Per-track gate criteria from F1_BUY_BOOST_PLAN.md §6 re-checked
- If focal abandoned : OP Story
CVN-N001-EE-S02reopened with verdictabandoned - Committee
pr_reviewsession opened if abandoning the variant (audit trail)