Runbook — Focal loss training crash rate elevated¶

Alert : focal_training_crash — xgboost_training_failed{loss_function="focal"} > 5/h

Severity : P2 (Slack #cvntrade-alerts ; respond within 1 business hour)

Impact : focal-loss training jobs are failing in HPO / production retrain pipeline. Models are not refreshing, but no direct trading impact (existing models keep running). Risk : if not addressed, model staleness alert (model_freshness > 7d) triggers within ~7 days.

Owner : dococeven (DRI Track 6)

Affected dashboards : - Pipeline Health panel "Trainer errors" - Model Registry panel "Model freshness per crypto"

Immediate human actions (0–15 min)¶

1. Identify the failure mode¶

# Tail the most recent failed training pods
kubectl -n airflow logs -l app=airflow-worker --tail=100 | grep -E "xgboost_training_failed|ValueError|Calibration contract|Hessian"

Common failure modes (in priority order) :

Symptom	Likely cause	Fix path
`ValueError: Calibration contract violation: target y must be discrete`	Bug regression — soft labels reaching calibration	Check if `apply_label_pipeline` was modified post-merge ; check `_invoke_calibration` is intact
`Hessian not positive definite` or `XGBoost split failed`	Numerical instability with extreme γ or unusual data	Lower `CVN_FOCAL_GAMMA` (Path B below)
`ModuleNotFoundError: No module named 'sympy'`	Cluster image missing dep	Check `airflow_docker/requirements.txt` includes sympy ; rebuild image (per CVN-N013-EA-S01 pattern)
`numpy overflow encountered in exp`	Logits exceeding float64 range	Should be auto-clipped by `_P_EPS` ; if seen, file P1 — clipping logic broken

2. Check the active config¶

# Single-row ftf_config table with JSONB base_env column
# (see infra/migrations/007_ftf_config.sql)
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT
    base_env->>'CVN_LOSS_FUNCTION' AS loss_function,
    base_env->>'CVN_FOCAL_GAMMA'   AS focal_gamma,
    base_env->>'CVN_FOCAL_ALPHA'   AS focal_alpha
  FROM ftf_config
  WHERE id = 1;
"

If CVN_FOCAL_GAMMA > 6.0 AND failure mode is "Hessian not positive definite" → push gamma DOWN (Path B).

3. Quantify the blast radius¶

# How many training jobs are currently failing per crypto ?
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT crypto_symbol,
         COUNT(*) FILTER (WHERE status='FAILED' AND loss_function='focal') AS failed_focal,
         COUNT(*) FILTER (WHERE status='SUCCESS' AND loss_function='focal') AS success_focal
  FROM training_runs
  WHERE created_at >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '4 hours'
  GROUP BY crypto_symbol
  ORDER BY failed_focal DESC;
"

If failure rate per crypto > 50% → roll back to baseline (Path A). If < 50% → investigate root cause first (Path C).

Remediation paths¶

Path A — Rollback to baseline if failures are widespread (> 50% per crypto)¶

Same as Path A in runbook_focal_loss_negative_expectancy.md : - Console flip CVN_LOSS_FUNCTION=binary:logistic - Verify : training jobs restart successfully on baseline obj

Path B — Lower focal aggressiveness if numerical instability¶

If failures are concentrated on Hessian not positive definite :

Console flip : - CVN_FOCAL_GAMMA : reduce by 1.0 step (e.g., 4.0 → 3.0 → 2.0 → 1.0) - Re-trigger 1 training job manually (Airflow UI → launch__training with single crypto override) - If success → leave reduced gamma in production - If still failing → fall back to Path A

Path C — Investigate without immediate rollback¶

If failures are sparse (< 20% per crypto, no clear pattern) :

# Save logs from the latest failed job for offline analysis
kubectl -n airflow cp <failed_pod>:/opt/airflow/logs/xgboost_failure.log /tmp/focal_failure_$(date +%s).log

# Run the formal verification test locally to rule out math drift
.venv_airflow/bin/python -m pytest tests/unit/training/XGBoost/test_focal_loss_formal_verification.py -v

If formal verification fails → P1 escalation, this is a math regression and must be rolled back AND a hotfix PR must land before the next training cycle (per OPERATIONS.md §17.3 pattern from Bug #1).

Escalation¶

Condition	Action
All 5 cryptos failing	P1 — page DRI immediately, Path A within 5min
Formal verification test fails locally	P1 — Path A + escalate to math review (committee `pr_review` re-opened on Track 6)
Path A doesn't restore training success rate	P1 — broader investigation, suspect cluster / image / dep issue (cf. CVN-N013-EA Stories)

Post-incident checklist¶

Failure mode classified (calibration / Hessian / dep / numerical)
Path A/B/C decision documented
If math regression → OP Story CVN-N011-EA-S07 linked + hotfix PR opened
If dep issue → OP Story under CVN-N013-EA opened (sister to the cleanlab issue)
Incident logged in OPERATIONS.md §17.X