Skip to content

Runbook — Focal loss training crash rate elevated

Alert : focal_training_crashxgboost_training_failed{loss_function="focal"} > 5/h

Severity : P2 (Slack #cvntrade-alerts ; respond within 1 business hour)

Impact : focal-loss training jobs are failing in HPO / production retrain pipeline. Models are not refreshing, but no direct trading impact (existing models keep running). Risk : if not addressed, model staleness alert (model_freshness > 7d) triggers within ~7 days.

Owner : dococeven (DRI Track 6)

Affected dashboards : - Pipeline Health panel "Trainer errors" - Model Registry panel "Model freshness per crypto"


Immediate human actions (0–15 min)

1. Identify the failure mode

# Tail the most recent failed training pods
kubectl -n airflow logs -l app=airflow-worker --tail=100 | grep -E "xgboost_training_failed|ValueError|Calibration contract|Hessian"

Common failure modes (in priority order) :

Symptom Likely cause Fix path
ValueError: Calibration contract violation: target y must be discrete Bug regression — soft labels reaching calibration Check if apply_label_pipeline was modified post-merge ; check _invoke_calibration is intact
Hessian not positive definite or XGBoost split failed Numerical instability with extreme γ or unusual data Lower CVN_FOCAL_GAMMA (Path B below)
ModuleNotFoundError: No module named 'sympy' Cluster image missing dep Check airflow_docker/requirements.txt includes sympy ; rebuild image (per CVN-N013-EA-S01 pattern)
numpy overflow encountered in exp Logits exceeding float64 range Should be auto-clipped by _P_EPS ; if seen, file P1 — clipping logic broken

2. Check the active config

# Single-row ftf_config table with JSONB base_env column
# (see infra/migrations/007_ftf_config.sql)
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT
    base_env->>'CVN_LOSS_FUNCTION' AS loss_function,
    base_env->>'CVN_FOCAL_GAMMA'   AS focal_gamma,
    base_env->>'CVN_FOCAL_ALPHA'   AS focal_alpha
  FROM ftf_config
  WHERE id = 1;
"

If CVN_FOCAL_GAMMA > 6.0 AND failure mode is "Hessian not positive definite" → push gamma DOWN (Path B).

3. Quantify the blast radius

# How many training jobs are currently failing per crypto ?
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT crypto_symbol,
         COUNT(*) FILTER (WHERE status='FAILED' AND loss_function='focal') AS failed_focal,
         COUNT(*) FILTER (WHERE status='SUCCESS' AND loss_function='focal') AS success_focal
  FROM training_runs
  WHERE created_at >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '4 hours'
  GROUP BY crypto_symbol
  ORDER BY failed_focal DESC;
"

If failure rate per crypto > 50% → roll back to baseline (Path A). If < 50% → investigate root cause first (Path C).


Remediation paths

Path A — Rollback to baseline if failures are widespread (> 50% per crypto)

Same as Path A in runbook_focal_loss_negative_expectancy.md : - Console flip CVN_LOSS_FUNCTION=binary:logistic - Verify : training jobs restart successfully on baseline obj

Path B — Lower focal aggressiveness if numerical instability

If failures are concentrated on Hessian not positive definite :

Console flip : - CVN_FOCAL_GAMMA : reduce by 1.0 step (e.g., 4.03.02.01.0) - Re-trigger 1 training job manually (Airflow UI → launch__training with single crypto override) - If success → leave reduced gamma in production - If still failing → fall back to Path A

Path C — Investigate without immediate rollback

If failures are sparse (< 20% per crypto, no clear pattern) :

# Save logs from the latest failed job for offline analysis
kubectl -n airflow cp <failed_pod>:/opt/airflow/logs/xgboost_failure.log /tmp/focal_failure_$(date +%s).log

# Run the formal verification test locally to rule out math drift
.venv_airflow/bin/python -m pytest tests/unit/training/XGBoost/test_focal_loss_formal_verification.py -v

If formal verification fails → P1 escalation, this is a math regression and must be rolled back AND a hotfix PR must land before the next training cycle (per OPERATIONS.md §17.3 pattern from Bug #1).


Escalation

Condition Action
All 5 cryptos failing P1 — page DRI immediately, Path A within 5min
Formal verification test fails locally P1 — Path A + escalate to math review (committee pr_review re-opened on Track 6)
Path A doesn't restore training success rate P1 — broader investigation, suspect cluster / image / dep issue (cf. CVN-N013-EA Stories)

Post-incident checklist

  • Failure mode classified (calibration / Hessian / dep / numerical)
  • Path A/B/C decision documented
  • If math regression → OP Story CVN-N011-EA-S07 linked + hotfix PR opened
  • If dep issue → OP Story under CVN-N013-EA opened (sister to the cleanlab issue)
  • Incident logged in OPERATIONS.md §17.X