Runbook — Focal loss training crash rate elevated¶
Alert : focal_training_crash — xgboost_training_failed{loss_function="focal"} > 5/h
Severity : P2 (Slack #cvntrade-alerts ; respond within 1 business hour)
Impact : focal-loss training jobs are failing in HPO / production retrain pipeline. Models are not refreshing, but no direct trading impact (existing models keep running). Risk : if not addressed, model staleness alert (model_freshness > 7d) triggers within ~7 days.
Owner : dococeven (DRI Track 6)
Affected dashboards :
- Pipeline Health panel "Trainer errors"
- Model Registry panel "Model freshness per crypto"
Immediate human actions (0–15 min)¶
1. Identify the failure mode¶
# Tail the most recent failed training pods
kubectl -n airflow logs -l app=airflow-worker --tail=100 | grep -E "xgboost_training_failed|ValueError|Calibration contract|Hessian"
Common failure modes (in priority order) :
| Symptom | Likely cause | Fix path |
|---|---|---|
ValueError: Calibration contract violation: target y must be discrete |
Bug regression — soft labels reaching calibration | Check if apply_label_pipeline was modified post-merge ; check _invoke_calibration is intact |
Hessian not positive definite or XGBoost split failed |
Numerical instability with extreme γ or unusual data | Lower CVN_FOCAL_GAMMA (Path B below) |
ModuleNotFoundError: No module named 'sympy' |
Cluster image missing dep | Check airflow_docker/requirements.txt includes sympy ; rebuild image (per CVN-N013-EA-S01 pattern) |
numpy overflow encountered in exp |
Logits exceeding float64 range | Should be auto-clipped by _P_EPS ; if seen, file P1 — clipping logic broken |
2. Check the active config¶
# Single-row ftf_config table with JSONB base_env column
# (see infra/migrations/007_ftf_config.sql)
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
SELECT
base_env->>'CVN_LOSS_FUNCTION' AS loss_function,
base_env->>'CVN_FOCAL_GAMMA' AS focal_gamma,
base_env->>'CVN_FOCAL_ALPHA' AS focal_alpha
FROM ftf_config
WHERE id = 1;
"
If CVN_FOCAL_GAMMA > 6.0 AND failure mode is "Hessian not positive definite" → push gamma DOWN (Path B).
3. Quantify the blast radius¶
# How many training jobs are currently failing per crypto ?
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
SELECT crypto_symbol,
COUNT(*) FILTER (WHERE status='FAILED' AND loss_function='focal') AS failed_focal,
COUNT(*) FILTER (WHERE status='SUCCESS' AND loss_function='focal') AS success_focal
FROM training_runs
WHERE created_at >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '4 hours'
GROUP BY crypto_symbol
ORDER BY failed_focal DESC;
"
If failure rate per crypto > 50% → roll back to baseline (Path A). If < 50% → investigate root cause first (Path C).
Remediation paths¶
Path A — Rollback to baseline if failures are widespread (> 50% per crypto)¶
Same as Path A in runbook_focal_loss_negative_expectancy.md :
- Console flip CVN_LOSS_FUNCTION=binary:logistic
- Verify : training jobs restart successfully on baseline obj
Path B — Lower focal aggressiveness if numerical instability¶
If failures are concentrated on Hessian not positive definite :
Console flip :
- CVN_FOCAL_GAMMA : reduce by 1.0 step (e.g., 4.0 → 3.0 → 2.0 → 1.0)
- Re-trigger 1 training job manually (Airflow UI → launch__training with single crypto override)
- If success → leave reduced gamma in production
- If still failing → fall back to Path A
Path C — Investigate without immediate rollback¶
If failures are sparse (< 20% per crypto, no clear pattern) :
# Save logs from the latest failed job for offline analysis
kubectl -n airflow cp <failed_pod>:/opt/airflow/logs/xgboost_failure.log /tmp/focal_failure_$(date +%s).log
# Run the formal verification test locally to rule out math drift
.venv_airflow/bin/python -m pytest tests/unit/training/XGBoost/test_focal_loss_formal_verification.py -v
If formal verification fails → P1 escalation, this is a math regression and must be rolled back AND a hotfix PR must land before the next training cycle (per OPERATIONS.md §17.3 pattern from Bug #1).
Escalation¶
| Condition | Action |
|---|---|
| All 5 cryptos failing | P1 — page DRI immediately, Path A within 5min |
| Formal verification test fails locally | P1 — Path A + escalate to math review (committee pr_review re-opened on Track 6) |
| Path A doesn't restore training success rate | P1 — broader investigation, suspect cluster / image / dep issue (cf. CVN-N013-EA Stories) |
Post-incident checklist¶
- Failure mode classified (calibration / Hessian / dep / numerical)
- Path A/B/C decision documented
- If math regression → OP Story
CVN-N011-EA-S07linked + hotfix PR opened - If dep issue → OP Story under
CVN-N013-EAopened (sister to the cleanlab issue) - Incident logged in OPERATIONS.md §17.X