Skip to content

Runbook — Focal loss model produces negative expectancy

Alert : expectancy_negative_focalexpectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1

Severity : P1 (page on-call DRI)

Impact : focal-loss-trained model is generating losing trades in production. Direct revenue impact ; if not addressed within 24h, can erode > 1% of portfolio per affected crypto/day.

Owner : dococeven (DRI Track 6)

Affected dashboards : - MLOps Overview panel "Per-crypto expectancy" - Testing & Backtest panel "Probability distribution" - Per-crypto trade journal in Trading Engines


Immediate human actions (0–5 min)

1. Confirm the alert is real

# Query the source : trade journal Postgres aggregation over 48h window
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT crypto_symbol, AVG(realized_pnl_pct) AS expectancy_48h, COUNT(*) AS n_trades
  FROM trade_journal
  WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '48 hours'
    AND model_loss_function = 'focal'
  GROUP BY crypto_symbol
  HAVING AVG(realized_pnl_pct) < 0;
"

Expected output : ≥ 2 rows with negative expectancy_48h. If 0–1 rows → alert was a transient false positive ; check Grafana for noisy data, log via incident_log.

2. Check the active config

# Verify focal is currently active per ftf_config (single-row JSONB table —
# see infra/migrations/007_ftf_config.sql)
psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT
    base_env->>'CVN_LOSS_FUNCTION' AS loss_function,
    base_env->>'CVN_FOCAL_GAMMA'   AS focal_gamma,
    base_env->>'CVN_FOCAL_ALPHA'   AS focal_alpha,
    version,
    updated_at
  FROM ftf_config
  WHERE id = 1;
"

If CVN_LOSS_FUNCTION != 'focal' → alert is stale (focal was already rolled back). Acknowledge alert, no action.

3. PRIMARY ROLLBACK (< 1 min revert SLA)

Flip CVN_LOSS_FUNCTION back to baseline via Console UI : - URL : https://console.cvntrade.eu/config - Find CVN_LOSS_FUNCTION key - Set value to binary:logistic - Submit (the change is atomic ; next training run picks it up)

For instant effect on live inference (rather than next training run), also revert the model alias in MLflow. Use a templated symbol — do NOT run on a single crypto if multiple are affected (CR pass 4 Major fix). Model name follows the canonical convention CVNTrade_{MODEL_TYPE}_{SYMBOL}_{STRATEGY} — note the capitalized C (CR pass 5 Major).

MLflow version IDs are per-model (not shared across symbols), so the rollback is a two-step lookup-then-flip per crypto.

Step A — discover the previous version ID per affected crypto (the version BEFORE the focal-loss one, typically N-1 if focal is N) :

for s in BTCUSDC ETHUSDC SOLUSDC XRPUSDC ADAUSDC; do
  echo "=== $s ==="
  mlflow models list-versions --name CVNTrade_XGBoost_${s}_SL0.5_TP1_H4 \
    --order-by version_number DESC --max-results 5
done
# → record the previous_version_id for each affected symbol from this output.
# Example output (BTCUSDC) :
#   version=42  run_id=abc123  tags={loss_function: focal}     ← current focal
#   version=41  run_id=def456  tags={loss_function: binary}    ← previous_version_id (target)

Step B — flip the champion alias per crypto (substitute the IDs collected in Step A — one explicit invocation per affected symbol, NOT a templated loop, to leave a clear audit trail) :

mlflow models set-alias --name CVNTrade_XGBoost_BTCUSDC_SL0.5_TP1_H4 --alias champion --version <prev_id_BTCUSDC>
mlflow models set-alias --name CVNTrade_XGBoost_ETHUSDC_SL0.5_TP1_H4 --alias champion --version <prev_id_ETHUSDC>
# … repeat for each affected crypto identified in §1 above

The previous champion (binary:logistic) takes effect on the next inference call (< 30s propagation via cache invalidation).


Diagnosis (5–30 min)

Why did focal regress in production after passing the FTF sweep gate ?

Likely causes (in priority order) :

  1. Distribution drift — input features / labels drifted since training. Check Grafana → Data Drift panel for PSI per crypto.
  2. Calibration drift — focal output sharpness shifted. Check Grafana → Output Sharpness Drift panel ; if alerting, recalibrate via temperature scaling.
  3. Concept drift — market regime changed, focal's hard-example focus is hurting (e.g. choppy market with no clear hard examples to learn from).
  4. Implementation regression — a code change post-merge silently broke focal output handling. Check git log src/training/XGBoost/focal_loss.py for recent edits.

Confirm via per-fold analysis

psql -h cvntrade-pg -U cvntrade -d cvntrade -c "
  SELECT
    crypto_symbol,
    AVG(realized_pnl_pct) FILTER (WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '7 days') AS exp_7d,
    AVG(realized_pnl_pct) FILTER (WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '48 hours') AS exp_48h,
    AVG(realized_pnl_pct) FILTER (WHERE entry_time >= (NOW() AT TIME ZONE 'UTC') - INTERVAL '14 days' AND entry_time < (NOW() AT TIME ZONE 'UTC') - INTERVAL '48 hours') AS exp_14d_baseline
  FROM trade_journal
  WHERE model_loss_function = 'focal'
  GROUP BY crypto_symbol;
"

If exp_48h < exp_14d_baseline - 0.01 → distribution / regime shift, expected behavior to roll back.


Remediation paths

Path A — Immediate rollback to baseline (default)

Already done in step 3 above. Document the incident in OPERATIONS.md §17.X (next available number).

Path B — Recalibrate focal model with temperature scaling

If diagnosis points to calibration drift only (not concept drift) :

# Retrigger training with temperature_scaling calibration via Console
# Set CVN_CALIBRATION=temperature_scaling
# Trigger : Airflow DAG launch__training (single crypto, recalibrate only)

Verify post-recalibration ECE_HOLD via Grafana → Calibration Quality panel.

Path C — Lower the focal aggressiveness

If diagnosis is "focal was too aggressive for current market" :

Console flip : - CVN_FOCAL_GAMMA : current → 1.0 (mild) instead of 2.0 (standard) - Trigger retraining via Airflow

If still negative, fall back to baseline (Path A).


Escalation

If rollback (Path A) does NOT restore expectancy within 4h post-revert : - Escalate to Operator (dococeven) for portfolio-wide kill-switch decision - ADR-71 system-wide halt is the last resort (trading completely stopped) - File P1 incident in OPERATIONS.md §17.X with full timeline + dashboard screenshots

Post-incident checklist

  • Incident logged in OPERATIONS.md §17.X
  • Root cause analyzed (path A/B/C decision)
  • Per-track gate criteria from F1_BUY_BOOST_PLAN.md §6 re-checked
  • If focal abandoned : OP Story CVN-N001-EE-S02 reopened with verdict abandoned
  • Committee pr_review session opened if abandoning the variant (audit trail)