Runbook — Track 12 frac-diff + domain interactions (P2)¶
Severity : P2 (production drift / staged-launch oversight on the new feature ; alerts + operator action, no automatic trade halt — the champion_pre_track12 rollback path is symmetric and Console-driven)
Owner : @dococeven
Story : CVN-N001-EE-S05 (wp#44) · plan dossier 2026-05-03-cvn-n001-ee-s05-track12-frac-diff-plan.md · amendment 2026-05-08-cvn-n001-ee-s05-track12-warmup-gap-amendment-plan.md
Committee : plan_review session 2681aa97 (OP Meeting #120) ; pr_review session df09258b (#121) ; pr_review round 2 98b16083 (#122)
Linked code : src/commun/pipeline/frac_diff.py · src/commun/pipeline/enrichment_api.py (step 3c) · src/ETL/post_enrichment/cvntrade_xgboost_feature_generator.py::_add_domain_interaction_features · src/commun/finetune/guardrails.py::_validate_frac_diff · src/commun/finetune/ablation_matrix.py (factor frac_diff, 8 variants)
This runbook covers the operator workflow for the 2-stage launch protocol (round 2 reco #3) AND four drift / quality symptoms specific to the Track 12 path. The symmetric rollback for all of them is the registered champion_pre_track12 fallback model — no runtime env-flag toggle (per ADR-23 + ADR-15 + ADR-42).
1. 2-stage launch protocol (round 2 reco #3 — operator workflow)¶
Why a 2-stage protocol : the FTF factor frac_diff registers 8 variants, but 2 of them (frac_diff_d04_w1e3, frac_diff_d04_w1e5) are sensitivity rows that only make sense if the production default frac_diff_d04 is itself productive. Always running them inflates the BH-correction (8 vs 6 simultaneous tests) AND wastes ~25 % compute when Stage 1 fails. The committee mandated conditional gating ; for now this is operator-driven via the FTF DAG --variants filter, with automation tracked as a follow-up Story.
Stage 1 — always run (6 variants)¶
Operator triggers dag_finetune__pte with :
factor = frac_diff
crypto_group = defi_top5
phase = manual
power_mode = standard
confirm_long_run = true
variants = none,frac_diff_d04,frac_diff_d05,interactions_only,combined_d04,combined_d04_purge0
Wall-clock estimate : ~3 hours (90 fits at ~2 s/fit on the FTF runner).
Stage 1 PASS criteria (all must hold) :
frac_diff_d04clears the F1 plan §6 standard gates : Δ f1_buy ≥ +0.020 with 95 % bootstrap CI excluding 0, Cohen's d ≥ 0.3, BH-corrected p ≤ 0.05 ;- ≥ 4/5 cryptos individually improve ;
- ≥ 50 BUY trades per fold ;
- expectancy_net ≥ baseline, sortino ≥ baseline, max_drawdown ≤ baseline + 1 % ;
- mandatory leakage check :
combined_d04_purge0does NOT outperformcombined_d04on the f1_buy paired t-test BH-corrected p < 0.05. If it does → mandatory ABANDON Track 12 per plan §4.4 (do NOT proceed to Stage 2).
Stage 1 FAIL : ABANDON Track 12 ; document the verdict in documentation/missions/F1_buy_boost/reports/2026-XX-XX-track12-frac-diff-results.md. Do NOT run Stage 2.
Stage 2 — gated (2 sensitivity variants)¶
Triggered only on Stage 1 PASS :
factor = frac_diff
crypto_group = defi_top5
phase = manual
power_mode = standard
confirm_long_run = true
variants = frac_diff_d04_w1e3,frac_diff_d04_w1e5
Wall-clock estimate : ~1 hour (30 fits).
Stage 2 outcomes :
- All three (
_w1e3,_d04(5e-4),_w1e5) within ±0.005 of each other → keep production default at 5e-4 (Option 2 per committee 2681aa97). LOCKfrac_diff_d04. _w1e3outperforms_d04by > 0.005 on f1_buy → flip production default to 1e-3, LOCKfrac_diff_d04_w1e3. Re-evaluate the warmup-gap amendment dossier §11 follow-up #1 conclusion._w1e5outperforms by > 0.005 → revert to AFML default 1e-5, LOCKfrac_diff_d04_w1e5; the loss-budget hit (16 % training rows) becomes acceptable in light of the empirical lift.
Whichever variant LOCKs, the operator updates ftf_config.base_env via Console (ADR-59) so the production EnrichmentConfig honours the LOCK'd min_w_threshold. The MLflow enrichment_config.json artefact pin happens automatically once the Track-1-follow-up loader PR is merged.
2. Symptom : KS test p < 0.01 on frac_diff_close_d{NN} over 14 days¶
Detection : Grafana panel "Frac-diff drift" shows distribution drift for the production frac_diff_close_d04 (or _d05) against its training-time distribution. Loki query : {event="frac_diff_drift_alert"} | feature=....
Likely causes :
- Volatility-regime shift — frac-diff captures long-memory ; a sustained vol regime change (e.g., crypto-wide capitulation or a structural break post-halving) pushes the feature distribution.
- Close-price feed quality — Binance close-aggregation change, exchange outage filling NaN gaps the frac-diff convolution interprets as zeros.
- Bug in
frac_diff.compute— recent change altered the weight recurrence or the warmup gap.
Action :
- Pull last 14 days of
event=frac_diff_applied d=... min_w=...from Loki + the Grafana KS panel. - If single feature drifts (just
_d04) AND vol regime is normal → check feed quality first (§4 below). - If both
_d04and_d05drift in the same direction → likely volatility-regime shift, real signal change. Schedule re-fit + re-calibrate in the next sprint. - If sustained > 30 days with no operator action → revert to
champion_pre_track12via Console (atomic per-crypto, ADR-15 + ADR-42). RTO < 5 min.
3. Symptom : warmup-row loss > 20 % per-crypto¶
Detection : Loki event=frac_diff_warmup_dropped_rows pct_dropped=... exceeds 20 % for any crypto in the FTF run logs. Grafana panel "Warmup-row loss %".
Likely causes :
min_w_thresholdtypo — operator setCVN_FRAC_DIFF_MIN_W_THRESHOLD=1e-5(AFML canonical) without realising the warmup cost. This should be caught by the FTF guardrail_validate_frac_diff; if the alert fires, the guardrail was bypassed.- Short training window — operator reduced
CVN_TRAIN_WINDOW_MONTHS; a 3-month window at 5m can be > 90 % wiped atmin_w=1e-5, d=0.4. - Bug in
frac_diff.compute— weight recurrence regression.
Action :
- Cross-check the env vars in the FTF run conf vs the loss-budget table in the amendment dossier §2.
- If
min_w_threshold=1e-5was intentional, accept the loss but document it as a sweep choice. - If unintentional, re-launch the sweep with the production default 5e-4.
4. Symptom : event=enrichment_config_mismatch field=frac_diff_d|frac_diff_min_w_threshold¶
Detection : Loki event=enrichment_config_mismatch fires AT inference time — the runtime EnrichmentConfig disagrees with the model's pinned enrichment_config.json artefact. P1 severity — model is deployed under wrong env config (ADR-23 violation).
Action :
- Halt new inferences immediately : Console-side flip the crypto's status to "paused" on the trading dashboard.
- Pull the model run's
enrichment_config.jsonfrom MLflow vs the runtime env vars. - Identify which field mismatches (
frac_diff_dorfrac_diff_min_w_threshold). - Fix the runtime to match the artefact, OR rollback to
champion_pre_track12if the runtime is the intended state. - Post-mortem : how did the runtime drift ? If it was an operator-set env via Console, document the change vector.
5. Symptom : domain interactions raise RuntimeError at training time¶
Detection : FTF sweep fails with RuntimeError: Track 12 domain-interactions enabled but required source columns are missing: [...].
Likely causes :
- Upstream FE pipeline column rename — RSI / MACD / ADX / BB column naming shifted in
cvntrade_enrich.py. - CVN_DOMAIN_INTERACTIONS_ENABLED=1` set in a context that doesn't run the full enrichment (e.g., a unit-test fixture or a custom-config sweep).
- Operator passes a custom
crypto_groupwhose feed lacks volume — extreme low-liquidity tokens.
Action :
- Read the error message — the missing columns are listed with canonical prefixes (RSI_, atr_normalized, volume, close, MACD_, BB_upper/lower, momentum_, ADX_*).
- Cross-check the upstream enrichment config (
cvntrade_enrich.py) for renames. - If the rename is intentional, update
_add_domain_interaction_featuressource-column lookup AND re-run the sweep. - Do NOT bypass by disabling the factor — that masks the real bug. Per ADR-25 (committee df09258b P0 #1).
6. Cross-references¶
- Plan dossier :
documentation/reviews/2026-05-03-cvn-n001-ee-s05-track12-frac-diff-plan.md - Amendment dossier :
documentation/reviews/2026-05-08-cvn-n001-ee-s05-track12-warmup-gap-amendment-plan.md - PR review dossier :
documentation/reviews/2026-05-08-cvn-n001-ee-s05-track12-pr-review-dossier.md - MLOps readiness :
documentation/stories/CVN-N001-EE-S05/mlops_readiness.md - Sibling Track 1 runbook (style + 4-symptom layout reference) :
documentation/runbooks/runbook_btc_features_drift.md - F1 plan canonical :
documentation/F1_BUY_BOOST_PLAN.md§5 Track 12, §6 sequencing, §7 reporting standard