Runbook — Ensemble diversity (Track 11, P2)¶
Severity : P2 (production governance + drift on the ensemble surface ; alerts but no immediate trade halt — the champion_xgb_only rollback path is symmetric and operator-controlled. P1 only on H2 violation OR feature_names SHA256 mismatch which trip ADR-25 fail-fast.)
Owner : @dococeven
Story : CVN-N001-EE-S06 (wp#45) · plan dossier 2026-05-01-track11-ensemble-diversity-plan.md
Linked code : src/commun/trading/ensemble_aggregator.py · src/commun/trading/ensemble_inference.py · src/commun/finetune/guardrails.py
This runbook covers five symptoms specific to the Track 11 ensemble path. The symmetric rollback for all of them is the registered champion_xgb_only fallback model — no runtime env-flag toggle (per ADR-23 + Track 1 lessons).
1. Symptom : event=h2_gate_violation (single base model > 80 % weight)¶
Detection : Grafana panel cvntrade-track11-ensemble shows the H2 alert firing OR the FTF results dossier reports max_normalised_weight > 0.80 for the locked variant. Loki query : {event="h2_gate_violation"} | model=....
Likely causes :
- One base model genuinely dominates — XGB / LGB / CB has a structural edge for this crypto + dataset, and the LogReg meta correctly weighs the others near zero.
- Base-model staleness — two of the three base models are trained on outdated data ; the third saw the freshest distribution and dominates by accident.
- Hyper-parameter mis-calibration — one base model overfits, producing extreme probabilities the meta model amplifies.
Action :
- Pull the locked aggregator's normalised weights from the MLflow registry (artefact
aggregator.json). - Identify which base model is dominating + whether the weight vector is consistent across the 5 cryptos OR varies (single-crypto bias suggests cause 1, all-cryptos same dominant model suggests cause 2 / 3).
- If genuine dominance (cause 1) → ABANDON the stack, lock
lgb_only/cb_only/ XGB-baseline (whichever dominated) via Console promotion. The stacking variant is removed from the FTF matrix for follow-up sweeps. Rollback latency < 5 min. - If staleness (cause 2) → re-train the lagging base models on the same data snapshot as the dominant one (committee reco R8). Re-fit the aggregator. Re-evaluate H2 + 5 % floor before re-locking.
- If overfitting (cause 3) → bisect recent commits to the base-model adapter (xgboost / lightgbm / catboost) ; rollback to
champion_xgb_onlywhile investigating.
Quantitative threshold for abandoning Track 11 :
- H2 violation on > 60 % of cryptos in the locked sweep → ABANDON Track 11, F1 plan §6 sequencing implications apply (defer Track 8 sequence model).
2. Symptom : feature_names mismatch on base-model load (ADR-25 fail-fast)¶
Detection : RuntimeError raised by commun.trading.ensemble_inference.run_ensemble_inference at inference time. Loki query : {event="ensemble_inference"} |~ "feature_names drift". Grafana panel : cvntrade-track11-ensemble "Feature contract integrity".
Likely causes :
- One base model was retrained under a different enrichment config (e.g., btc_features toggled between training the 3 bases).
- MLflow registry tag drift — the
feature_namesSHA256 in the registry doesn't match the artefact's actualfeature_names(committee reco R8). - Manual artefact replacement bypassing the autotrainer.
Action :
- Identify which base model has the mismatched
feature_namesfrom the error message. - Immediate : revert via Console promotion of
champion_xgb_only(atomic per-crypto, ADR-15 + ADR-42). Rollback latency < 5 min. - Investigation : inspect the offending base model's MLflow run :
mlflow.get_run(run_id).data.tags["feature_names_sha256"]vs the artefact's actualfeature_names- Re-train the offending base model under the unified enrichment config
- Re-fit the aggregator on the consistent base set
- Document the root cause in the OP
Committeework_package created post-Track-11 (reference the linked Story).
3. Symptom : LogReg coefficient drift > 30 % (only _logreg_shrink locked)¶
Detection : Prometheus alert fires when normalised weight of any base shifts > 30 % from training-time aggregator. Grafana panel cvntrade-track11-ensemble "LogReg meta drift". Committee reco R6 — automated, not manual review.
Likely causes :
- Base-model probability distributions shifted (regime change → one base now dominates the meta linear combo).
- Concept drift — the relationship between base-model outputs and ground truth changed.
- A base model was silently retrained without re-fitting the aggregator.
Action :
- Auto-fallback to
_avgis the runbook's first line (committee reco R6) — the runtime check fires this automatically when drift exceeds 30 %. Verify in Loki :{event="ensemble_aggregator_fallback"} reason="logreg_drift". - If
_avgstill produces acceptable f1_buy → keep the fallback ; schedule a quarterly re-fit of_logreg_shrinkfor the next FTF sweep. - If
_avgdoesn't beat single-model performance either → revert tochampion_xgb_onlyvia Console. - Always : weekly Grafana review of LogReg coefficients across the 5 cryptos ; if ≥ 2 cryptos drift > 30 % in the same week → Track 11 stability concern, file investigation Story.
Note : stack_3model_avg has no learned weights ; this drift mode is impossible for the canonical v2 variant. The auto-fallback path applies only when _logreg_shrink is the locked variant.
4. Symptom : Inference-latency p95 > 3.0× single-model SLO¶
Detection : Grafana panel cvntrade-track11-ensemble "Ensemble inference latency SLO" exceeds the threshold. Loki query : {event="ensemble_inference"} | total_latency_ms > 3 * single_model_baseline_ms.
Likely causes :
- 3-base sequential predict path (MVP implementation runs them sequentially) is the bottleneck.
- One base model has a model-size regression (committed without bench).
- Aggregator latency is non-trivial (LogReg with very large weight magnitudes can slow numpy multiply on cold caches — rare).
Action :
- Per committee reco R10 : the InferenceAPI extension SHOULD parallelise the 3 base predicts via
concurrent.futures.ThreadPoolExecutor(max_workers=3). Verify wiring is enabled (envCVN_ENSEMBLE_PARALLEL_PREDICT=1). - If parallelism doesn't recover budget → identify the slowest base via the
per_base_latency_msfield inevent=ensemble_inference; rollback tolgb_only/cb_only(the simpler variant locked in the prior FTF sweep) via Console promotion. - If all 3 base models slow simultaneously → infrastructure regression (Kubernetes pod sizing, OPA latency, MLflow load) ; escalate to
@cvntrade-opsper §6 of the readiness doc. - Long-term : if Track 11 LOCKs but latency is borderline → distillation back to single model is plan §6 OOS for v1, but a follow-up Story under Track 14 if latency budgets tighten.
5. Kill-switch interaction (ADR-71)¶
Detection : event=ensemble_inference_aborted reason=kill_switch_engaged fires. This is not a Track-11-specific failure — it's the inherited ADR-71 halt path surfacing through the ensemble layer for traceability.
Action :
- Verify halt latency : Loki query :
{event="ensemble_inference_aborted"} | total_latency_msshould be < 100 ms p95 (committee reco R11). The integration testtest_kill_switch_short_circuits_ensembleenforces this at CI time. - No Track-11-specific intervention required — the kill-switch state lives in PG per ADR-71 ; operator-only disengage per the ADR-71 runbook (Epic CVN-N001-EG implementation).
- If halt latency exceeds 100 ms → Track-11-specific bug (the kill-switch check should be the FIRST operation in
run_ensemble_inference, before any base-model load OR predict). Bisect recent commits tocommun.trading.ensemble_inference; verify the integration test still passes. - If the halt event repeats without operator-driven engage/disengage cycles → suspect a flapping PG kill-switch state ; escalate to
@cvntrade-ops.
Cross-reference : symmetric rollback path¶
Every Track-11 LOCK ships with a registered champion_xgb_only fallback model. Reversion = Console promotion of the fallback (atomic per-crypto per ADR-15 + ADR-42). The runtime env CVN_ENSEMBLE_VARIANT is training-time only — flipping at inference would dimension-mismatch the loaded bundle (caught by ADR-23 contract pinning).
| Symptom | Rollback | Latency |
|---|---|---|
| §1 H2 violation | Console promotion of champion_xgb_only |
< 5 min |
| §2 feature_names mismatch | Console promotion of champion_xgb_only |
< 5 min |
| §3 LogReg drift > 30 % | Auto-fallback to _avg (runtime check) ; OR Console promotion if _avg insufficient |
automatic / < 5 min |
| §4 latency SLO breach | Threadpool parallelisation OR Console promotion of single-model variant | < 5 min |
| §5 kill-switch engaged | inherited halt ; no Track-11-specific action | < 1 s (ADR-71 SLO) |
Cross-reference : known follow-up¶
The autotrainer + InferenceAPI auto-routing + MLflow registry pattern for the aggregator artefact land in the Track 11 follow-up PR per the readiness doc §7. Until that ships, models trained with CVN_ENSEMBLE_VARIANT=stack_* MUST NOT be deployed to inference. The follow-up PR closes this gap (mirrors Track 1's PR #792 → follow-up pattern).