FTF Run Guardrails — Pre-Flight Checks & Runtime Guards¶
Version: 2.1 Design date: 2026-04-15 · Implementation status last reviewed: 2026-04-23 Issues: #551 (original), #608 (mission guardrails), #630 (PTE envelope guardrails) Status: PASSED (OK, 7.9/10) — V2 incorporates 6 committee recommendations ADRs: ADR-25 (no silent fallback), ADR-56 (A/B testable by design), ADR-58 (every factor must have a guardrail) Committee: Session 62046fc5 (2026-04-15)
Read me first (2026-04-23): this doc describes the §1-13 design from #551. The actual implementation in
src/commun/finetune/guardrails.pytoday differs — see §0 below for what's running. The design sections are kept as reference for the roadmap, but don't describe live behavior.
0. Implementation status (as of 2026-04-23)¶
Actually live in src/commun/finetune/guardrails.py (called from ablation_runner._run_training_variant + _run_runtime_variant before os.environ mutation):
| Check | Raises | Source |
|---|---|---|
CVN_THRESHOLD_METHOD=f1_binary without CVN_BINARY_CLASSIFICATION=1 |
VariantGuardrailError |
PR #626 δ-2 round 1 |
CVN_HPO_OBJECTIVE=f1_binary without CVN_BINARY_CLASSIFICATION=1 |
VariantGuardrailError |
PR #626 δ-2 round 1 (hoisted from per-trial check) |
CVN_SL_MULT out of bounds [0.1, 3.0] |
VariantGuardrailError |
PR #630 |
CVN_TP_MULT out of bounds [0.3, 5.0] |
VariantGuardrailError |
PR #630 |
TP < SL (strictly inverted economics; TP == SL accepted for 1:1 RR variant) |
VariantGuardrailError |
PR #630 (relaxed round 4) |
| Non-numeric / non-finite SL, TP raw | VariantGuardrailError |
PR #630 |
Overlap warning: pte_envelope + sl_multiplier |
audit_factor_selection (logger.warning, not raise) |
PR #630 |
Overlap warning: pte_envelope + tp_multiplier |
audit_factor_selection (logger.warning, not raise) |
PR #630 |
Runtime invariant (tested): validate_variant_env runs before the os.environ[k] = v loop in _run_training_variant (src/commun/finetune/ablation_runner.py). A raise leaves the operator's process env byte-identical to the pre-call snapshot. Locked by TestRunnerInvariant (6 parametrized tests).
What's NOT yet implemented (still in the §1-13 design below, tracked for future):
| Guardrail | Design §§ | Status |
|---|---|---|
preflight_check Airflow task |
§3 | NOT implemented. Sample count / memory / time estimates not verified pre-training. |
quick_sample_count helper |
§3 | Not implemented. |
| Runtime slow-run detection | §4 | Not implemented. Relies on Airflow task timeout only. |
CUSUM env var clarity (CVN_CUSUM_TRAINING_ENABLED split) |
§5 | Not done. Current flag CVN_CUSUM_TRAINING_MODE has 5 values. |
| Grafana slow-run alert | §8 | Not implemented. |
Guardrail override mechanism (FORCE_GUARDRAIL_BYPASS) |
§10 | Not implemented. |
| Staged rollout | §11 | Not implemented. |
| Sensitive value masking in logs | §12 | Not implemented. |
The 2026-04-14/15 post-mortem (§1) triggered this design. Since then, new classes of bugs emerged (f1_binary × 3-class mode fail-fast; PTE envelope bounds) and got their own guardrails via ADR-58. The §1-13 design is still the North Star; §0 is the ground truth.
1. Problem Statement¶
Post-mortem: 2 sessions lost¶
| Session | Duration | Samples | Expected | Root cause | Cost |
|---|---|---|---|---|---|
| 14/04 09:25–23:00 | 14h | 28,000 | ~1,000 | CUSUM default=disabled in code | ~€5 |
| 15/04 00:00–06:15 | 6h | 28,000 | ~1,000 | CUSUM mode="enabled" interpreted as "stable" | ~€3.50 |
| Total | 21h | ~€8.50 |
No guardrail caught the problem. Pods trained for hours before anyone noticed the sample count was 28× expected. The pipeline happily ran 400× slower without any alert.
Why this matters¶
At 2-3 FTF sessions per week, each misconfigured session wastes: - €4-8 compute cost - 4-8h calendar time waiting for results - Iteration velocity: 1 day lost per incident
2. Design Principle: Block Defaults, Allow Explicit Choices¶
┌─────────────────────────────────────┐
│ Is the unusual value set by the │
│ FTF factor explicitly? │
│ │
│ _active_factor_env_vars contains │
│ the env var that caused the alert? │
└──────────┬──────────────────────────┘
│
┌───────┴───────┐
│ │
YES NO
│ │
┌─────┴─────┐ ┌────┴─────┐
│ ALLOW │ │ BLOCK │
│ Log info │ │ Fail │
│ Continue │ │ fast │
└───────────┘ └──────────┘
Rationale: When the FTF ablation_runner explicitly sets CVN_CUSUM_TRAINING_MODE=disabled, it's a deliberate A/B test (ADR-56). When the same value appears because of a code default bug, it's an accident that should be caught immediately.
Implementation: Factor Env Var Tracking¶
# In run_factor_crypto (DAG task):
for env_key, env_value in factor.env_vars[variant_name].items():
os.environ[env_key] = env_value
# Pass active factor env vars to all downstream checks
validated["_active_factor_env_vars"] = factor.env_vars[variant_name]
# In any guardrail:
def _is_explicit_choice(env_var: str, validated: dict) -> bool:
"""True if env_var was set by the active FTF factor (not a default)."""
return env_var in validated.get("_active_factor_env_vars", {})
3. Guardrail 1: Pre-Flight Validation Task¶
Position in DAG¶
validate_params → resolve_factors → build_pairs
→ preflight_check (NEW, ~30s) ← fails fast here
→ run_factor_crypto (expensive, ~3 min/run)
→ aggregate → report → notify
Checks¶
| Check | Threshold | Default behavior | Explicit FTF choice |
|---|---|---|---|
| Training samples > 10,000 | 10K | BLOCK | ALLOW (log info) |
| Training samples < 50 | 50 | BLOCK (always) | BLOCK (always) |
| Features > 200 | 200 | WARN (log) | ALLOW (log info) |
| Estimated memory > 24Gi | 24Gi | BLOCK (always) | BLOCK (always) |
| Unknown env var value | whitelist | BLOCK (always) | BLOCK (always) |
| Estimated time > 30 min/run | 30 min | WARN + Slack | ALLOW (log info) |
Implementation¶
@task()
def preflight_check(validated: dict, pair: dict) -> dict:
"""Fast validation before expensive training. ~30 seconds."""
import logging
import os
import sys
logger = logging.getLogger("airflow.task")
for p in ["/opt/airflow/src", "/opt/airflow"]:
if p not in sys.path:
sys.path.insert(0, p)
factor_env = validated.get("_active_factor_env_vars", {})
crypto = pair["crypto"]
# --- 1. Sample count (quick: load labels + apply CUSUM, count rows) ---
from commun.finetune.guardrails import quick_sample_count
samples, features = quick_sample_count(
crypto=crypto,
strategy=validated["pte"],
history_months=validated.get("history_months", 24),
)
if samples > 10000:
if "CVN_CUSUM_TRAINING_MODE" in factor_env or "CVN_CUSUM_TRAINING_ENABLED" in factor_env:
logger.info(
"GUARDRAIL OK: %d samples (high but explicit: %s)",
samples, {k: v for k, v in factor_env.items() if "CUSUM" in k}
)
else:
raise ValueError(
f"GUARDRAIL BLOCKED: {samples} training samples (expected <2000). "
f"CUSUM may be misconfigured. CVN_CUSUM_TRAINING_MODE="
f"{os.environ.get('CVN_CUSUM_TRAINING_MODE', 'unset')}"
)
if samples < 50:
raise ValueError(f"GUARDRAIL BLOCKED: {samples} samples — insufficient for training")
# --- 2. Feature count ---
if features > 200:
if "CVN_MAX_FEATURES" in factor_env:
logger.info("GUARDRAIL OK: %d features (explicit: n_features factor)", features)
else:
logger.warning("GUARDRAIL WARN: %d features — expect slow HPO", features)
# --- 3. Memory estimate ---
# 5 copies: raw + imputed + scaled + DMatrix + model
estimated_gb = samples * features * 8 * 5 / 1024**3
if estimated_gb > 24:
raise ValueError(
f"GUARDRAIL BLOCKED: estimated {estimated_gb:.1f}Gi peak memory (limit 32Gi)"
)
# --- 4. Env var consistency ---
cusum_mode = os.environ.get("CVN_CUSUM_TRAINING_MODE", "unset")
valid = ("disabled", "enabled", "relaxed_1_5", "legacy_3_0", "event", "stable", "unset")
if cusum_mode not in valid:
raise ValueError(f"GUARDRAIL BLOCKED: unknown CVN_CUSUM_TRAINING_MODE={cusum_mode}")
# --- 5. Time estimate ---
estimated_min = samples * features / 1000 * 0.01 * validated.get("n_trials", 30)
if estimated_min > 30:
if any(k in factor_env for k in ("CVN_CUSUM_TRAINING_MODE", "CVN_MAX_FEATURES")):
logger.info("GUARDRAIL OK: estimated %d min/run (explicit high-data config)", estimated_min)
else:
logger.warning(
"GUARDRAIL WARN: estimated %d min/run (expected <5). "
"Total factor: ~%dh. Consider checking config.",
estimated_min, estimated_min * 45 / 60
)
logger.info(
"PREFLIGHT PASSED: samples=%d features=%d memory=%.1fGi est_time=%.0fmin/run",
samples, features, estimated_gb, estimated_min
)
return {
"samples": samples,
"features": features,
"memory_gb": round(estimated_gb, 1),
"estimated_min_per_run": round(estimated_min, 1),
"preflight": "PASSED",
}
quick_sample_count Helper¶
# src/commun/finetune/guardrails.py (new file)
def quick_sample_count(crypto: str, strategy: str, history_months: int = 24) -> tuple:
"""Quick count of training samples WITHOUT full training. ~20 seconds.
Returns (n_samples, n_features) after CUSUM filtering + FE pipeline.
"""
from commun.cache.components.cvntrade_autonomous_fe import AutonomousFE
fe = AutonomousFE(
crypto_symbol=crypto,
timeframe=os.environ.get("CVN_TIMEFRAME", "30m"),
strategy=strategy,
history_months=history_months,
)
# Run FE pipeline up to split (no HPO, no training)
result = fe.get_result(dry_run=True) # New flag: only count, don't train
n_samples = result.get("train_size", 0)
n_features = result.get("n_features", 0)
return n_samples, n_features
Note: Requires adding a dry_run=True mode to AutonomousFE.get_result() that runs the pipeline up to the split and returns counts without training. Estimated: ~20 lines.
4. Guardrail 2: Runtime Slow-Run Detection¶
In ablation_runner.py¶
# After first variant of each factor completes:
_first_variant_elapsed = time.time() - t0
if _first_variant_elapsed > 1800: # > 30 min
_slow_config_vars = ["CVN_CUSUM_TRAINING_MODE", "CVN_MAX_FEATURES", "CVN_CUSUM_TRAINING_ENABLED"]
_is_explicit = any(v in self._active_factor_env for v in _slow_config_vars)
if _is_explicit:
logger.info(
"GUARDRAIL OK: first variant took %.0f min — expected (explicit high-data config)",
_first_variant_elapsed / 60
)
else:
logger.error(
"GUARDRAIL ALERT: first variant took %.0f min (expected <5). "
"Possible misconfiguration. Run continues but flagged for review.",
_first_variant_elapsed / 60
)
# Flag in results for Grafana visibility
base_result["guardrail_slow_run"] = True
# Estimate total time and warn
remaining_runs = total_runs - 1
eta_hours = remaining_runs * _first_variant_elapsed / 3600
logger.error(
"GUARDRAIL ALERT: at this rate, factor will take %.0fh total. "
"Consider killing and checking config.",
eta_hours
)
Persistence¶
Add guardrail_slow_run boolean to the result dict → persisted in PostgreSQL → queryable in Grafana.
5. Guardrail 3: Env Var Separation¶
Current (ambiguous)¶
CVN_CUSUM_TRAINING_MODE = "enabled" | "disabled" | "event" | "stable" | "relaxed_1_5" | "legacy_3_0"
One env var, 6 possible values, 2 different meanings. Root cause of the 21h waste.
Proposed (clear)¶
| Env var | Values | Purpose |
|---|---|---|
CVN_CUSUM_TRAINING_ENABLED |
1 (default) / 0 |
Apply CUSUM to training data? |
CVN_CUSUM_FILTER_MODE |
event (default) / stable |
Which candles to keep? |
FTF Factor Update¶
AblationFactor(
name="cusum_training_mode",
factor_type="training",
category="data",
env_vars={
"disabled": {"CVN_CUSUM_TRAINING_ENABLED": "0"},
"relaxed_1_5": {"CVN_CUSUM_TRAINING_ENABLED": "1", "CVN_CUSUM_THRESHOLD_H": "1.5"},
"legacy_3_0": {"CVN_CUSUM_TRAINING_ENABLED": "1", "CVN_CUSUM_THRESHOLD_H": "3.0"},
},
)
Code Change¶
# cvntrade_autonomous_fe.py — level 1
cusum_enabled = os.environ.get("CVN_CUSUM_TRAINING_ENABLED", "1") == "1"
if cusum_enabled:
labels, cusum_metadata = self._apply_cusum_before_split(labels, train_ratio=train_ratio)
# _apply_cusum_before_split — level 2
mode = os.environ.get("CVN_CUSUM_FILTER_MODE", "event") # "event" or "stable"
No more dual semantics. Each var has one meaning.
6. Guardrail 4: Integration Tests¶
# tests/unit/test_cusum_guardrails.py
class TestCUSUMTrainingMode:
def test_enabled_filters_most_data(self):
"""CUSUM enabled must reduce samples to <20% of total."""
os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "1"
os.environ["CVN_CUSUM_FILTER_MODE"] = "event"
labels = _make_ohlcv(n=10000)
filtered, meta = fe._apply_cusum_before_split(labels)
assert len(filtered) < len(labels) * 0.20
def test_disabled_keeps_all(self):
"""CUSUM disabled must keep all samples."""
os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "0"
labels = _make_ohlcv(n=10000)
# Verify _apply_cusum_before_split is NOT called
# and all samples are preserved
def test_stable_mode_keeps_most(self):
"""Stable mode keeps non-transition candles (~95%)."""
os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "1"
os.environ["CVN_CUSUM_FILTER_MODE"] = "stable"
labels = _make_ohlcv(n=10000)
filtered, meta = fe._apply_cusum_before_split(labels)
assert len(filtered) > len(labels) * 0.80
class TestPreflightGuardrails:
def test_blocks_unexpected_high_samples(self):
"""Pre-flight must BLOCK >10K samples when not explicit."""
validated = {"_active_factor_env_vars": {}} # no explicit CUSUM choice
with pytest.raises(ValueError, match="GUARDRAIL BLOCKED"):
preflight_check(validated, {"crypto": "TESTUSDC"})
def test_allows_explicit_high_samples(self):
"""Pre-flight must ALLOW >10K samples when factor sets CUSUM explicitly."""
validated = {
"_active_factor_env_vars": {"CVN_CUSUM_TRAINING_ENABLED": "0"}
}
result = preflight_check(validated, {"crypto": "TESTUSDC"})
assert result["preflight"] == "PASSED"
def test_always_blocks_too_few_samples(self):
"""Pre-flight must ALWAYS block <50 samples, even if explicit."""
# Even explicit config can't override minimum safety
with pytest.raises(ValueError, match="insufficient"):
...
def test_always_blocks_oom_risk(self):
"""Pre-flight must ALWAYS block >24Gi estimated memory."""
with pytest.raises(ValueError, match="OOM risk"):
...
7. Guardrail 5: Deploy Checklist¶
Add to documentation/OPERATIONS.md¶
## After Helm Deploy with Config Changes
**MANDATORY** — skipping causes pods to run with old config.
1. ✅ Verify new code deployed:
```bash
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- grep "<KEY_CHANGE>" /opt/airflow/src/...
```
2. ✅ Kill ALL running FTF pods:
```bash
kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
```
3. ✅ Wait 30 seconds for termination
4. ✅ Trigger new runs from Airflow UI
5. ✅ Monitor first 5 minutes — check logs:
```bash
POD=$(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | head -1 | awk '{print $1}')
kubectl logs -n cvntrade $POD | grep "train="
```
- Expected: `train=~1000` (CUSUM enabled)
- If `train=>10000`: STOP, check CUSUM config
- If `PREFLIGHT PASSED`: config is validated
8. Guardrail 6: Grafana Slow-Run Alert¶
New Panel: "Slow Runs (>10 min)"¶
SELECT run_id, factor, variant, crypto, fold_id, cost_bps,
elapsed_s, ROUND(elapsed_s / 60.0, 1) as minutes
FROM finetune_results
WHERE elapsed_s > 600
ORDER BY elapsed_s DESC
LIMIT 20
Alert Rule¶
| Condition | Severity | Action |
|---|---|---|
| Any run > 10 min | WARNING | Slack #cvntrade-alerts |
| Any run > 30 min | CRITICAL | Slack + investigate immediately |
| Average run > 5 min for a factor | WARNING | Check config |
9. Guardrail 7: PostgreSQL Persistence (enhanced per committee)¶
Guardrail results persisted with full context for post-hoc analysis:
| Column | Type | Purpose |
|---|---|---|
preflight_samples |
INT | Sample count from pre-flight |
preflight_features |
INT | Feature count from pre-flight |
preflight_memory_gb |
REAL | Estimated memory |
guardrail_slow_run |
BOOLEAN | True if run exceeded threshold |
guardrail_slow_reason |
TEXT | Why slow (e.g. "cusum disabled, 28K samples, 267 features") |
guardrail_explicit |
BOOLEAN | True if unusual config was explicit FTF choice |
guardrail_override |
BOOLEAN | True if override was active |
guardrail_override_reason |
TEXT | Justification for override |
Migration: infra/migrations/006_guardrails.sql
10. Guardrail Override Mechanism (committee reco)¶
For emergency scenarios where guardrails must be bypassed (e.g. urgent retraining after market crash):
# Set via env var with mandatory justification
CVN_GUARDRAIL_OVERRIDE=true
CVN_GUARDRAIL_OVERRIDE_REASON="emergency retraining after flash crash — approved by @ceven 2026-04-15"
Behavior when override is active:
- Pre-flight BLOCK checks → downgraded to WARN (still logged, not blocking)
- Memory/OOM checks → STILL BLOCK (hard safety, never overridable)
- All guardrail events logged with override=true tag
- Persisted in PostgreSQL: guardrail_override=true, guardrail_override_reason=...
- Post-hoc: Grafana query shows all overridden runs for review
Audit trail: Override requires Helm deploy (env var in ConfigMap) or explicit FTF factor config → git-tracked → reviewable.
Implementation:
def _is_override_active() -> bool:
return os.environ.get("CVN_GUARDRAIL_OVERRIDE", "false").lower() == "true"
def _get_override_reason() -> str:
reason = os.environ.get("CVN_GUARDRAIL_OVERRIDE_REASON", "")
if not reason:
raise ValueError("GUARDRAIL OVERRIDE requires CVN_GUARDRAIL_OVERRIDE_REASON")
return reason
11. Staged Rollout for Critical Changes (committee reco)¶
For changes that affect training data size, model architecture, or filter configuration:
| Stage | Scope | Duration | Gate |
|---|---|---|---|
| Pre-flight | 1 crypto, 1 fold, 1 variant | ~30s | Samples/memory/time within bounds |
| Canary | 1 crypto, all folds, all variants | ~15 min | No OOM, no guardrail alerts, Sortino > 0 |
| Full | All cryptos, all folds | 2-3h | Monitor Grafana, all metrics within CI |
When to use staged rollout:
- New FTF factor that affects sample count (e.g. cusum_training_mode)
- Change to feature selection strategy
- New model type or architecture
- Change to CUSUM parameters
When NOT needed (standard factors): - Testing different cost scenarios - Different cooldown values - Different trend filter settings
12. Sensitive Value Masking (committee reco)¶
_active_factor_env_vars may contain values that should not appear in logs (e.g. API keys if future factors use external services).
_SENSITIVE_PATTERNS = ["KEY", "SECRET", "PASSWORD", "TOKEN"]
def _mask_env_vars(env_vars: dict) -> dict:
"""Mask sensitive values for logging."""
masked = {}
for k, v in env_vars.items():
if any(p in k.upper() for p in _SENSITIVE_PATTERNS):
masked[k] = f"{v[:3]}***" if len(v) > 3 else "***"
else:
masked[k] = v
return masked
# Usage in logging:
logger.info("GUARDRAIL: active env vars: %s", _mask_env_vars(factor_env))
13. PYTHONPATH Management (committee reco)¶
Pre-flight task must NOT use sys.path.insert(0, ...) hacks. Use proper PYTHONPATH:
# In DAG definition, set PYTHONPATH via env:
@task(executor_config={"pod_override": {"spec": {"containers": [{"env": [{"name": "PYTHONPATH", "value": "/opt/airflow/src"}]}]}}})
def preflight_check(validated: dict, pair: dict) -> dict:
# No sys.path manipulation needed
from commun.finetune.guardrails import quick_sample_count
...
Or configure via Helm extraEnv for all worker pods (already set in ConfigMap).
14. Guardrail Ablation Study (committee reco)¶
After deployment, measure the impact of guardrails on iteration velocity:
| Metric | Before guardrails | After guardrails | Target |
|---|---|---|---|
| Wasted sessions / month | 2-4 | 0 | 0 |
| Time to detect misconfiguration | 2-6h | <30s (pre-flight) | <1 min |
| False positive rate | N/A | < 5% | < 2% |
| Pre-flight overhead | N/A | ~30s/pod | < 60s |
| Total iteration time impact | N/A | +30s per factor | < 1% of total run time |
Review schedule: After 10 FTF sessions, analyze guardrail triggers. Adjust thresholds if false positive rate > 5%.
15. Implementation Plan — V2¶
Phase 1 — Core (HIGH priority)¶
| Task | Effort |
|---|---|
guardrails.py module (quick_sample_count + masking) |
40 lines |
| Pre-flight task in DAG (proper PYTHONPATH, no sys.path) | 40 lines |
| Factor env var tracking in run_factor_crypto | 5 lines |
| Integration tests (6 tests) | 60 lines |
| Env var separation (ENABLED + FILTER_MODE) | 25 lines |
Phase 2 — Runtime + Safety (HIGH priority)¶
| Task | Effort |
|---|---|
| Runtime slow-run guard with context + reason | 25 lines |
| Override mechanism (env var + reason + validation) | 20 lines |
| Enhanced persistence (reason, override columns) | 15 lines |
| Migration 006 (guardrail columns) | 10 lines |
Phase 3 — Operations (MEDIUM priority)¶
| Task | Effort |
|---|---|
| Deploy checklist in OPERATIONS.md | Doc |
| Grafana slow-run panel + alert | 1 panel |
| Staged rollout documentation + canary script | Doc + 20 lines |
| Ablation study (post-deploy analysis) | Analysis |
Total: ~260 lines + docs
16. Success Criteria — V2¶
- Pre-flight catches >10K samples before training (fails fast <30s)
- Pre-flight allows >10K when FTF factor explicitly sets CUSUM=disabled
- Runtime guard flags slow runs with reason + context persisted in PostgreSQL
- Override mechanism requires justification, persisted + auditable
- Sensitive env vars masked in all logs
- No sys.path hacks (proper PYTHONPATH)
- Staged rollout documented for critical changes
- 6 integration tests pass
- Grafana alert fires on slow runs
- ADR-58 enforced on all new/modified factors
- False positive rate < 5% after 10 sessions
- Zero wasted sessions from misconfiguration