Runbooks — incident response¶
Runbooks are the operator's first reference during an incident. Each is structured the same way :
- Alert trigger + severity + impact
- Immediate human actions (0–15 min) — confirm + diagnose + immediate rollback
- Diagnosis (5-30 min) — root cause hypotheses + queries
- Remediation paths (typically 3) — full rollback / partial mitigation / investigate without rollback
- Escalation criteria — when to wake whom
- Post-incident checklist — incident log + audit trail
All runbooks are linked from their corresponding alert in the MLOps readiness templates for the relevant Story.
Active runbooks¶
| Runbook | Severity | Triggering alert | Owner |
|---|---|---|---|
runbook_focal_loss_negative_expectancy.md |
P1 | expectancy_net_realized < 0 over 48h on >= 2 cryptos AND focal_loss_active=1 |
dococeven |
runbook_hpo_fork_deadlock.md |
P1 | finetune_pod_idle_seconds > 300 AND finetune_rows_persisted_per_min == 0 (cleanlab variants) |
dococeven |
runbook_focal_training_crash.md |
P2 | xgboost_training_failed{loss_function=focal} > 5/h |
dococeven |
runbook_otel_collector_down.md |
P2 | OTel collector pods down → structured events stop landing in Loki | platform |
break-glass-hyperparams.md |
P1 (FTF blocked) / P2 (Console-only outage) | ftf_config.base_env corrupted OR Console UI down + urgent edit needed OR id=1 row missing — invoke ONLY during ADR-90 transition window (until Epic CCP wp#149 ships). 4 procedures (A/B/C/D) + 30d dry-run drill + post-mortem template. |
dococeven |
Where to add a new runbook¶
When opening a new Story that introduces production-monitored behavior, the MLOps readiness template §2 mandates a runbook for every alert. The runbook is :
- Authored alongside the Story PR (not as a Phase 5 deliverable — committee enforces this, see
2026-04-28-track6-focal-loss-pr-review.mdv2 EXECUTION_RISK rejection for the precedent) - Filed under
documentation/runbooks/runbook_<slug>.md - Referenced from the alert definition AND the MLOps readiness file of the Story
- Indexed here
Operational principle¶
Per ADR-26, Grafana is the single entry point for observability. Runbooks always reference the Grafana dashboard for the affected service, then drill down to logs (Loki) or traces (Tempo) as needed. Never ask the operator to open multiple tools as the first step — the runbook hands them the queries.