Runbook — BTC cross-asset features drift (Track 1, P2)¶
Severity : P2 (production drift on input features ; alerts but no immediate trade halt — the champion_btc_blind rollback path is symmetric and operator-controlled)
Owner : @dococeven
Story : CVN-N001-EE-S04 (wp#43) · plan dossier 2026-04-30-track1-btc-features-plan.md
Linked code : src/commun/pipeline/btc_features.py · src/commun/finetune/guardrails.py
This runbook covers four drift / quality symptoms specific to the BTC cross-asset feature path (Track 1). The symmetric rollback for all of them is the registered champion_btc_blind fallback model — no runtime env-flag toggle (per ADR-23).
1. Symptom : KS test p < 0.01 on a BTC feature over 14 days¶
Detection : Grafana panel cvntrade-track1-btc-features shows distribution drift for any of the 6 BTC features against its training-time distribution. Loki query : {event="btc_features_drift_alert"} | feature=....
Likely causes :
- Concept drift — BTC's correlation with altcoins shifted (e.g., post-halving regime, ETF flows changing the autocorrelation structure). This is the expected drift mode, validates Track 1 robustness.
- BTC OHLCV feed change — Binance schema or aggregation window changed, altering the realised vol calc. Cross-check with §3 below.
- Bug in
compute_btc_features— recent change altered the window or shift logic.
Action :
- Pull last 14 days of
event=btc_features_appliedfrom Loki + the Grafana KS panel. - Identify which feature(s) drifted (often it's the lagged correlation
btc_correlation_15m_lag5first, since correlation regimes shift faster than absolute returns). - If concept drift only (signal is real, model can't track it) → schedule a re-fit (re-train + re-calibrate) for the affected crypto in the next sprint.
- If multiple features drift simultaneously AND the BTC ohlcv quality alerts are silent → suspect a code regression. Bisect recent commits to
commun/pipeline/btc_features.py. - If sustained > 30 days with no operator action → revert to
champion_btc_blindmodel via Console promotion (atomic per-crypto, ADR-15 + ADR-42). Rollback latency < 5 min.
Quantitative threshold for revisiting Track 1 :
- Per-feature drift > 3σ from training distribution AND sustained > 14 days → file a Story to revisit (re-train OR consider abandoning Track 1 in favour of a richer cross-asset feature set OR pivot to Track 12 fractional-diff features).
2. Symptom : BTC-altcoin correlation drift > 3σ¶
Detection : rolling 30d Pearson correlation of target altcoin's pct_change(1) vs BTC's pct_change(1).shift(5) (the lag the feature uses) drifts > 3σ from training-time correlation.
Action :
- This is the deeper drift signal — it says the underlying assumption of Track 1 (altcoins follow BTC's macro regime) is weakening.
- If a single altcoin drifts → revert that altcoin's deployment to
champion_btc_blind. The other 4 cryptos can stay on the BTC-enabled model. - If multiple altcoins drift simultaneously → quarterly re-fit triggered ; possibly Track 1 ABANDONED in the next quarter if drift sustained.
- Document the drift in the next quarterly model report. Operator decides re-fit vs revert vs full ABANDON.
3. Symptom : event=btc_ohlcv_quality_alert fires (committee reco #5)¶
Detection : Loki query {event="btc_ohlcv_quality_alert"} fires > 5 times in 1 hour.
Reasons :
outlier_returns: returns > 5σ vs 30d baseline (typically Binance flash crashes, halt+resume, exchange API hiccup)outlier_volume: volume drops > 80% vs 30d median (typically API throttling, hours-long Binance maintenance)wick_to_body: wick > 5× body on a single bar (typically wash trading / liquidation cascade)
Action :
- Verify Binance status at https://binance.com/en/support/announcement and the
binance-apiSlack channel. If it's an exchange-side issue, no action — alerts will subside when feed normalises. - If sustained > 1 hour AND data quality affecting our enrichment → temporarily revert affected cryptos to
champion_btc_blindvia Console promotion. Re-revert when feed is healthy again. - Hot-fix path : if our outlier detection thresholds are too tight (false positives), tune
commun.pipeline.btc_featuresconstants in a separate PR. Don't suppress alerts in the runbook itself — that masks real issues.
4. Symptom : event=enrichment_config_mismatch (P1, ADR-25 fail-fast)¶
Detection : RuntimeError from InferenceAPI on model load. Message includes the model run_id, the env value, and the artefact value.
Cause : the runtime env (CVN_BTC_FEATURES_*) doesn't match what's pinned in the model's enrichment_config.json artefact. Most common reasons :
- Operator deployed a BTC-enabled model under env that says
CVN_BTC_FEATURES_ENABLED=0(or vice versa) enrichment_config.jsonartefact was corrupted / partially uploaded → SHA256 mismatch (committee reco v2.1)- The env was deliberately changed mid-deploy without retraining
Action :
- Don't restart the pod loop — it'll just re-fail. Stop the deploy.
- Compare the env vars in
cvntrade-env-configConfigMap (per CLAUDE.md §13) vs the model'senrichment_config.json(mlflow artifacts download --run-id <run> --artifact-path enrichment_config.json). - If env wrong → rollback the env (revert the Helm change). Promotion of correct env → restart pods.
- If artefact corrupted →
champion_btc_blindpromotion as immediate revert ; investigate the partial upload (check themlflow_promotionworkflow logs). - Never silently override — the ADR-25 fail-fast is doing its job. If you hit this, the contract is broken somewhere.
5. Pre-LOCK dry run failure (committee reco v2.4)¶
Detection : at LOCK decision time, the operator runs the registered champion_btc_blind for 24h shadow on the same day's data ; one of these gates fails :
feature_namesschema mismatch betweenchampion_btc_blindand current inference enrichment- f1_buy on the shadow window < baseline - 0.01
Action :
- LOCK decision is BLOCKED. Don't promote the new BTC-enabled model.
- Diagnose root cause :
feature_namesmismatch →champion_btc_blindmay have been trained under an older feature contract. Re-train a baseline model on the latest contract OR keep the previous champion as fallback.- f1_buy degraded > 0.01 →
champion_btc_blinditself has degraded since registration (concept drift on baseline). Re-train baseline ; defer Track 1 LOCK until baseline is known-good. - Document the dry-run failure in the LOCK decision audit trail (this Story's
mlops_readiness.md§5).
Rollback decision tree¶
BTC features production behaviour wrong ?
│
├── Was it a code change (window misalignment, off-by-one) ? → Hot-fix PR ; revert via Console promotion of champion_btc_blind in the meantime.
│
├── Is `event=enrichment_config_mismatch` (P1) ? → Stop deploy. Diagnose env vs artefact. Revert env OR promote champion_btc_blind. ADR-25 fail-fast working as intended.
│
├── Is it BTC OHLCV quality (Binance feed) ? → Verify exchange status. If sustained, revert affected cryptos via Console promotion.
│
├── Is it concept drift (KS p < 0.01 over 14d, > 3σ) ? → Revert affected cryptos. Schedule re-fit. If sustained > 30d, file Story to abandon Track 1.
│
└── Is it Pre-LOCK dry-run failure ? → BLOCK the LOCK. Do not promote. Investigate champion_btc_blind health.
The rollback for every symptom is a model swap via Console promotion (atomic per-crypto per ADR-15 + ADR-42), NOT a runtime env-flag toggle. Per ADR-23, the feature contract travels with the model artefact — flipping the env on a deployed model would dimension-mismatch (caught) or silently impute zero (ruled out by §4.1bis pinning, would trigger ADR-25 fail-fast at load).
Linked context¶
- Plan dossier —
2026-04-30-track1-btc-features-plan.md - MLOps readiness —
stories/CVN-N001-EE-S04/mlops_readiness.md - Source code —
src/commun/pipeline/btc_features.py - ADR-14 (purging+embargo) —
BTC at t uses only data ≤ t - purge_barsinvariant - ADR-15 (atomic promotion) — model-artefact rollback path
- ADR-23 (features version-pinned) — feature contract pinned in
enrichment_config.json - ADR-25 (no silent fallback) — fail-fast on env↔artefact mismatch
- ADR-32 (event=key=value structured logs) — every BTC feature decision emits an event
- ADR-42 (atomic per-crypto promotion) — rollback workflow
- ADR-59 (Console-only param flips) — operator UX