Skip to content

Runbook — Track 12 frac-diff + domain interactions (P2)

Severity : P2 (production drift / staged-launch oversight on the new feature ; alerts + operator action, no automatic trade halt — the champion_pre_track12 rollback path is symmetric and Console-driven) Owner : @dococeven Story : CVN-N001-EE-S05 (wp#44) · plan dossier 2026-05-03-cvn-n001-ee-s05-track12-frac-diff-plan.md · amendment 2026-05-08-cvn-n001-ee-s05-track12-warmup-gap-amendment-plan.md Committee : plan_review session 2681aa97 (OP Meeting #120) ; pr_review session df09258b (#121) ; pr_review round 2 98b16083 (#122) Linked code : src/commun/pipeline/frac_diff.py · src/commun/pipeline/enrichment_api.py (step 3c) · src/ETL/post_enrichment/cvntrade_xgboost_feature_generator.py::_add_domain_interaction_features · src/commun/finetune/guardrails.py::_validate_frac_diff · src/commun/finetune/ablation_matrix.py (factor frac_diff, 8 variants)

This runbook covers the operator workflow for the 2-stage launch protocol (round 2 reco #3) AND four drift / quality symptoms specific to the Track 12 path. The symmetric rollback for all of them is the registered champion_pre_track12 fallback model — no runtime env-flag toggle (per ADR-23 + ADR-15 + ADR-42).


1. 2-stage launch protocol (round 2 reco #3 — operator workflow)

Why a 2-stage protocol : the FTF factor frac_diff registers 8 variants, but 2 of them (frac_diff_d04_w1e3, frac_diff_d04_w1e5) are sensitivity rows that only make sense if the production default frac_diff_d04 is itself productive. Always running them inflates the BH-correction (8 vs 6 simultaneous tests) AND wastes ~25 % compute when Stage 1 fails. The committee mandated conditional gating ; for now this is operator-driven via the FTF DAG --variants filter, with automation tracked as a follow-up Story.

Stage 1 — always run (6 variants)

Operator triggers dag_finetune__pte with :

factor = frac_diff
crypto_group = defi_top5
phase = manual
power_mode = standard
confirm_long_run = true
variants = none,frac_diff_d04,frac_diff_d05,interactions_only,combined_d04,combined_d04_purge0

Wall-clock estimate : ~3 hours (90 fits at ~2 s/fit on the FTF runner).

Stage 1 PASS criteria (all must hold) :

  • frac_diff_d04 clears the F1 plan §6 standard gates : Δ f1_buy ≥ +0.020 with 95 % bootstrap CI excluding 0, Cohen's d ≥ 0.3, BH-corrected p ≤ 0.05 ;
  • ≥ 4/5 cryptos individually improve ;
  • ≥ 50 BUY trades per fold ;
  • expectancy_net ≥ baseline, sortino ≥ baseline, max_drawdown ≤ baseline + 1 % ;
  • mandatory leakage check : combined_d04_purge0 does NOT outperform combined_d04 on the f1_buy paired t-test BH-corrected p < 0.05. If it does → mandatory ABANDON Track 12 per plan §4.4 (do NOT proceed to Stage 2).

Stage 1 FAIL : ABANDON Track 12 ; document the verdict in documentation/missions/F1_buy_boost/reports/2026-XX-XX-track12-frac-diff-results.md. Do NOT run Stage 2.

Stage 2 — gated (2 sensitivity variants)

Triggered only on Stage 1 PASS :

factor = frac_diff
crypto_group = defi_top5
phase = manual
power_mode = standard
confirm_long_run = true
variants = frac_diff_d04_w1e3,frac_diff_d04_w1e5

Wall-clock estimate : ~1 hour (30 fits).

Stage 2 outcomes :

  • All three (_w1e3, _d04 (5e-4), _w1e5) within ±0.005 of each other → keep production default at 5e-4 (Option 2 per committee 2681aa97). LOCK frac_diff_d04.
  • _w1e3 outperforms _d04 by > 0.005 on f1_buy → flip production default to 1e-3, LOCK frac_diff_d04_w1e3. Re-evaluate the warmup-gap amendment dossier §11 follow-up #1 conclusion.
  • _w1e5 outperforms by > 0.005 → revert to AFML default 1e-5, LOCK frac_diff_d04_w1e5 ; the loss-budget hit (16 % training rows) becomes acceptable in light of the empirical lift.

Whichever variant LOCKs, the operator updates ftf_config.base_env via Console (ADR-59) so the production EnrichmentConfig honours the LOCK'd min_w_threshold. The MLflow enrichment_config.json artefact pin happens automatically once the Track-1-follow-up loader PR is merged.


2. Symptom : KS test p < 0.01 on frac_diff_close_d{NN} over 14 days

Detection : Grafana panel "Frac-diff drift" shows distribution drift for the production frac_diff_close_d04 (or _d05) against its training-time distribution. Loki query : {event="frac_diff_drift_alert"} | feature=....

Likely causes :

  1. Volatility-regime shift — frac-diff captures long-memory ; a sustained vol regime change (e.g., crypto-wide capitulation or a structural break post-halving) pushes the feature distribution.
  2. Close-price feed quality — Binance close-aggregation change, exchange outage filling NaN gaps the frac-diff convolution interprets as zeros.
  3. Bug in frac_diff.compute — recent change altered the weight recurrence or the warmup gap.

Action :

  1. Pull last 14 days of event=frac_diff_applied d=... min_w=... from Loki + the Grafana KS panel.
  2. If single feature drifts (just _d04) AND vol regime is normal → check feed quality first (§4 below).
  3. If both _d04 and _d05 drift in the same direction → likely volatility-regime shift, real signal change. Schedule re-fit + re-calibrate in the next sprint.
  4. If sustained > 30 days with no operator action → revert to champion_pre_track12 via Console (atomic per-crypto, ADR-15 + ADR-42). RTO < 5 min.

3. Symptom : warmup-row loss > 20 % per-crypto

Detection : Loki event=frac_diff_warmup_dropped_rows pct_dropped=... exceeds 20 % for any crypto in the FTF run logs. Grafana panel "Warmup-row loss %".

Likely causes :

  1. min_w_threshold typo — operator set CVN_FRAC_DIFF_MIN_W_THRESHOLD=1e-5 (AFML canonical) without realising the warmup cost. This should be caught by the FTF guardrail _validate_frac_diff ; if the alert fires, the guardrail was bypassed.
  2. Short training window — operator reduced CVN_TRAIN_WINDOW_MONTHS ; a 3-month window at 5m can be > 90 % wiped at min_w=1e-5, d=0.4.
  3. Bug in frac_diff.compute — weight recurrence regression.

Action :

  1. Cross-check the env vars in the FTF run conf vs the loss-budget table in the amendment dossier §2.
  2. If min_w_threshold=1e-5 was intentional, accept the loss but document it as a sweep choice.
  3. If unintentional, re-launch the sweep with the production default 5e-4.

4. Symptom : event=enrichment_config_mismatch field=frac_diff_d|frac_diff_min_w_threshold

Detection : Loki event=enrichment_config_mismatch fires AT inference time — the runtime EnrichmentConfig disagrees with the model's pinned enrichment_config.json artefact. P1 severity — model is deployed under wrong env config (ADR-23 violation).

Action :

  1. Halt new inferences immediately : Console-side flip the crypto's status to "paused" on the trading dashboard.
  2. Pull the model run's enrichment_config.json from MLflow vs the runtime env vars.
  3. Identify which field mismatches (frac_diff_d or frac_diff_min_w_threshold).
  4. Fix the runtime to match the artefact, OR rollback to champion_pre_track12 if the runtime is the intended state.
  5. Post-mortem : how did the runtime drift ? If it was an operator-set env via Console, document the change vector.

5. Symptom : domain interactions raise RuntimeError at training time

Detection : FTF sweep fails with RuntimeError: Track 12 domain-interactions enabled but required source columns are missing: [...].

Likely causes :

  1. Upstream FE pipeline column rename — RSI / MACD / ADX / BB column naming shifted in cvntrade_enrich.py.
  2. CVN_DOMAIN_INTERACTIONS_ENABLED=1` set in a context that doesn't run the full enrichment (e.g., a unit-test fixture or a custom-config sweep).
  3. Operator passes a custom crypto_group whose feed lacks volume — extreme low-liquidity tokens.

Action :

  1. Read the error message — the missing columns are listed with canonical prefixes (RSI_, atr_normalized, volume, close, MACD_, BB_upper/lower, momentum_, ADX_*).
  2. Cross-check the upstream enrichment config (cvntrade_enrich.py) for renames.
  3. If the rename is intentional, update _add_domain_interaction_features source-column lookup AND re-run the sweep.
  4. Do NOT bypass by disabling the factor — that masks the real bug. Per ADR-25 (committee df09258b P0 #1).

6. Cross-references

  • Plan dossier : documentation/reviews/2026-05-03-cvn-n001-ee-s05-track12-frac-diff-plan.md
  • Amendment dossier : documentation/reviews/2026-05-08-cvn-n001-ee-s05-track12-warmup-gap-amendment-plan.md
  • PR review dossier : documentation/reviews/2026-05-08-cvn-n001-ee-s05-track12-pr-review-dossier.md
  • MLOps readiness : documentation/stories/CVN-N001-EE-S05/mlops_readiness.md
  • Sibling Track 1 runbook (style + 4-symptom layout reference) : documentation/runbooks/runbook_btc_features_drift.md
  • F1 plan canonical : documentation/F1_BUY_BOOST_PLAN.md §5 Track 12, §6 sequencing, §7 reporting standard