Skip to content

Runbook — Ensemble diversity (Track 11, P2)

Severity : P2 (production governance + drift on the ensemble surface ; alerts but no immediate trade halt — the champion_xgb_only rollback path is symmetric and operator-controlled. P1 only on H2 violation OR feature_names SHA256 mismatch which trip ADR-25 fail-fast.) Owner : @dococeven Story : CVN-N001-EE-S06 (wp#45) · plan dossier 2026-05-01-track11-ensemble-diversity-plan.md Linked code : src/commun/trading/ensemble_aggregator.py · src/commun/trading/ensemble_inference.py · src/commun/finetune/guardrails.py

This runbook covers five symptoms specific to the Track 11 ensemble path. The symmetric rollback for all of them is the registered champion_xgb_only fallback model — no runtime env-flag toggle (per ADR-23 + Track 1 lessons).


1. Symptom : event=h2_gate_violation (single base model > 80 % weight)

Detection : Grafana panel cvntrade-track11-ensemble shows the H2 alert firing OR the FTF results dossier reports max_normalised_weight > 0.80 for the locked variant. Loki query : {event="h2_gate_violation"} | model=....

Likely causes :

  1. One base model genuinely dominates — XGB / LGB / CB has a structural edge for this crypto + dataset, and the LogReg meta correctly weighs the others near zero.
  2. Base-model staleness — two of the three base models are trained on outdated data ; the third saw the freshest distribution and dominates by accident.
  3. Hyper-parameter mis-calibration — one base model overfits, producing extreme probabilities the meta model amplifies.

Action :

  1. Pull the locked aggregator's normalised weights from the MLflow registry (artefact aggregator.json).
  2. Identify which base model is dominating + whether the weight vector is consistent across the 5 cryptos OR varies (single-crypto bias suggests cause 1, all-cryptos same dominant model suggests cause 2 / 3).
  3. If genuine dominance (cause 1) → ABANDON the stack, lock lgb_only / cb_only / XGB-baseline (whichever dominated) via Console promotion. The stacking variant is removed from the FTF matrix for follow-up sweeps. Rollback latency < 5 min.
  4. If staleness (cause 2) → re-train the lagging base models on the same data snapshot as the dominant one (committee reco R8). Re-fit the aggregator. Re-evaluate H2 + 5 % floor before re-locking.
  5. If overfitting (cause 3) → bisect recent commits to the base-model adapter (xgboost / lightgbm / catboost) ; rollback to champion_xgb_only while investigating.

Quantitative threshold for abandoning Track 11 :

  • H2 violation on > 60 % of cryptos in the locked sweep → ABANDON Track 11, F1 plan §6 sequencing implications apply (defer Track 8 sequence model).

2. Symptom : feature_names mismatch on base-model load (ADR-25 fail-fast)

Detection : RuntimeError raised by commun.trading.ensemble_inference.run_ensemble_inference at inference time. Loki query : {event="ensemble_inference"} |~ "feature_names drift". Grafana panel : cvntrade-track11-ensemble "Feature contract integrity".

Likely causes :

  1. One base model was retrained under a different enrichment config (e.g., btc_features toggled between training the 3 bases).
  2. MLflow registry tag drift — the feature_names SHA256 in the registry doesn't match the artefact's actual feature_names (committee reco R8).
  3. Manual artefact replacement bypassing the autotrainer.

Action :

  1. Identify which base model has the mismatched feature_names from the error message.
  2. Immediate : revert via Console promotion of champion_xgb_only (atomic per-crypto, ADR-15 + ADR-42). Rollback latency < 5 min.
  3. Investigation : inspect the offending base model's MLflow run :
  4. mlflow.get_run(run_id).data.tags["feature_names_sha256"] vs the artefact's actual feature_names
  5. Re-train the offending base model under the unified enrichment config
  6. Re-fit the aggregator on the consistent base set
  7. Document the root cause in the OP Committee work_package created post-Track-11 (reference the linked Story).

3. Symptom : LogReg coefficient drift > 30 % (only _logreg_shrink locked)

Detection : Prometheus alert fires when normalised weight of any base shifts > 30 % from training-time aggregator. Grafana panel cvntrade-track11-ensemble "LogReg meta drift". Committee reco R6 — automated, not manual review.

Likely causes :

  1. Base-model probability distributions shifted (regime change → one base now dominates the meta linear combo).
  2. Concept drift — the relationship between base-model outputs and ground truth changed.
  3. A base model was silently retrained without re-fitting the aggregator.

Action :

  1. Auto-fallback to _avg is the runbook's first line (committee reco R6) — the runtime check fires this automatically when drift exceeds 30 %. Verify in Loki : {event="ensemble_aggregator_fallback"} reason="logreg_drift".
  2. If _avg still produces acceptable f1_buy → keep the fallback ; schedule a quarterly re-fit of _logreg_shrink for the next FTF sweep.
  3. If _avg doesn't beat single-model performance either → revert to champion_xgb_only via Console.
  4. Always : weekly Grafana review of LogReg coefficients across the 5 cryptos ; if ≥ 2 cryptos drift > 30 % in the same week → Track 11 stability concern, file investigation Story.

Note : stack_3model_avg has no learned weights ; this drift mode is impossible for the canonical v2 variant. The auto-fallback path applies only when _logreg_shrink is the locked variant.

4. Symptom : Inference-latency p95 > 3.0× single-model SLO

Detection : Grafana panel cvntrade-track11-ensemble "Ensemble inference latency SLO" exceeds the threshold. Loki query : {event="ensemble_inference"} | total_latency_ms > 3 * single_model_baseline_ms.

Likely causes :

  1. 3-base sequential predict path (MVP implementation runs them sequentially) is the bottleneck.
  2. One base model has a model-size regression (committed without bench).
  3. Aggregator latency is non-trivial (LogReg with very large weight magnitudes can slow numpy multiply on cold caches — rare).

Action :

  1. Per committee reco R10 : the InferenceAPI extension SHOULD parallelise the 3 base predicts via concurrent.futures.ThreadPoolExecutor(max_workers=3). Verify wiring is enabled (env CVN_ENSEMBLE_PARALLEL_PREDICT=1).
  2. If parallelism doesn't recover budget → identify the slowest base via the per_base_latency_ms field in event=ensemble_inference ; rollback to lgb_only / cb_only (the simpler variant locked in the prior FTF sweep) via Console promotion.
  3. If all 3 base models slow simultaneously → infrastructure regression (Kubernetes pod sizing, OPA latency, MLflow load) ; escalate to @cvntrade-ops per §6 of the readiness doc.
  4. Long-term : if Track 11 LOCKs but latency is borderline → distillation back to single model is plan §6 OOS for v1, but a follow-up Story under Track 14 if latency budgets tighten.

5. Kill-switch interaction (ADR-71)

Detection : event=ensemble_inference_aborted reason=kill_switch_engaged fires. This is not a Track-11-specific failure — it's the inherited ADR-71 halt path surfacing through the ensemble layer for traceability.

Action :

  1. Verify halt latency : Loki query : {event="ensemble_inference_aborted"} | total_latency_ms should be < 100 ms p95 (committee reco R11). The integration test test_kill_switch_short_circuits_ensemble enforces this at CI time.
  2. No Track-11-specific intervention required — the kill-switch state lives in PG per ADR-71 ; operator-only disengage per the ADR-71 runbook (Epic CVN-N001-EG implementation).
  3. If halt latency exceeds 100 ms → Track-11-specific bug (the kill-switch check should be the FIRST operation in run_ensemble_inference, before any base-model load OR predict). Bisect recent commits to commun.trading.ensemble_inference ; verify the integration test still passes.
  4. If the halt event repeats without operator-driven engage/disengage cycles → suspect a flapping PG kill-switch state ; escalate to @cvntrade-ops.

Cross-reference : symmetric rollback path

Every Track-11 LOCK ships with a registered champion_xgb_only fallback model. Reversion = Console promotion of the fallback (atomic per-crypto per ADR-15 + ADR-42). The runtime env CVN_ENSEMBLE_VARIANT is training-time only — flipping at inference would dimension-mismatch the loaded bundle (caught by ADR-23 contract pinning).

Symptom Rollback Latency
§1 H2 violation Console promotion of champion_xgb_only < 5 min
§2 feature_names mismatch Console promotion of champion_xgb_only < 5 min
§3 LogReg drift > 30 % Auto-fallback to _avg (runtime check) ; OR Console promotion if _avg insufficient automatic / < 5 min
§4 latency SLO breach Threadpool parallelisation OR Console promotion of single-model variant < 5 min
§5 kill-switch engaged inherited halt ; no Track-11-specific action < 1 s (ADR-71 SLO)

Cross-reference : known follow-up

The autotrainer + InferenceAPI auto-routing + MLflow registry pattern for the aggregator artefact land in the Track 11 follow-up PR per the readiness doc §7. Until that ships, models trained with CVN_ENSEMBLE_VARIANT=stack_* MUST NOT be deployed to inference. The follow-up PR closes this gap (mirrors Track 1's PR #792 → follow-up pattern).