Skip to content

FTF Run Guardrails — Pre-Flight Checks & Runtime Guards

Version: 2.1 Design date: 2026-04-15 · Implementation status last reviewed: 2026-04-23 Issues: #551 (original), #608 (mission guardrails), #630 (PTE envelope guardrails) Status: PASSED (OK, 7.9/10) — V2 incorporates 6 committee recommendations ADRs: ADR-25 (no silent fallback), ADR-56 (A/B testable by design), ADR-58 (every factor must have a guardrail) Committee: Session 62046fc5 (2026-04-15)

Read me first (2026-04-23): this doc describes the §1-13 design from #551. The actual implementation in src/commun/finetune/guardrails.py today differs — see §0 below for what's running. The design sections are kept as reference for the roadmap, but don't describe live behavior.


0. Implementation status (as of 2026-04-23)

Actually live in src/commun/finetune/guardrails.py (called from ablation_runner._run_training_variant + _run_runtime_variant before os.environ mutation):

Check Raises Source
CVN_THRESHOLD_METHOD=f1_binary without CVN_BINARY_CLASSIFICATION=1 VariantGuardrailError PR #626 δ-2 round 1
CVN_HPO_OBJECTIVE=f1_binary without CVN_BINARY_CLASSIFICATION=1 VariantGuardrailError PR #626 δ-2 round 1 (hoisted from per-trial check)
CVN_SL_MULT out of bounds [0.1, 3.0] VariantGuardrailError PR #630
CVN_TP_MULT out of bounds [0.3, 5.0] VariantGuardrailError PR #630
TP < SL (strictly inverted economics; TP == SL accepted for 1:1 RR variant) VariantGuardrailError PR #630 (relaxed round 4)
Non-numeric / non-finite SL, TP raw VariantGuardrailError PR #630
Overlap warning: pte_envelope + sl_multiplier audit_factor_selection (logger.warning, not raise) PR #630
Overlap warning: pte_envelope + tp_multiplier audit_factor_selection (logger.warning, not raise) PR #630

Runtime invariant (tested): validate_variant_env runs before the os.environ[k] = v loop in _run_training_variant (src/commun/finetune/ablation_runner.py). A raise leaves the operator's process env byte-identical to the pre-call snapshot. Locked by TestRunnerInvariant (6 parametrized tests).

What's NOT yet implemented (still in the §1-13 design below, tracked for future):

Guardrail Design §§ Status
preflight_check Airflow task §3 NOT implemented. Sample count / memory / time estimates not verified pre-training.
quick_sample_count helper §3 Not implemented.
Runtime slow-run detection §4 Not implemented. Relies on Airflow task timeout only.
CUSUM env var clarity (CVN_CUSUM_TRAINING_ENABLED split) §5 Not done. Current flag CVN_CUSUM_TRAINING_MODE has 5 values.
Grafana slow-run alert §8 Not implemented.
Guardrail override mechanism (FORCE_GUARDRAIL_BYPASS) §10 Not implemented.
Staged rollout §11 Not implemented.
Sensitive value masking in logs §12 Not implemented.

The 2026-04-14/15 post-mortem (§1) triggered this design. Since then, new classes of bugs emerged (f1_binary × 3-class mode fail-fast; PTE envelope bounds) and got their own guardrails via ADR-58. The §1-13 design is still the North Star; §0 is the ground truth.


1. Problem Statement

Post-mortem: 2 sessions lost

Session Duration Samples Expected Root cause Cost
14/04 09:25–23:00 14h 28,000 ~1,000 CUSUM default=disabled in code ~€5
15/04 00:00–06:15 6h 28,000 ~1,000 CUSUM mode="enabled" interpreted as "stable" ~€3.50
Total 21h ~€8.50

No guardrail caught the problem. Pods trained for hours before anyone noticed the sample count was 28× expected. The pipeline happily ran 400× slower without any alert.

Why this matters

At 2-3 FTF sessions per week, each misconfigured session wastes: - €4-8 compute cost - 4-8h calendar time waiting for results - Iteration velocity: 1 day lost per incident


2. Design Principle: Block Defaults, Allow Explicit Choices

                 ┌─────────────────────────────────────┐
                 │   Is the unusual value set by the    │
                 │   FTF factor explicitly?             │
                 │                                     │
                 │   _active_factor_env_vars contains   │
                 │   the env var that caused the alert? │
                 └──────────┬──────────────────────────┘
                    ┌───────┴───────┐
                    │               │
                   YES              NO
                    │               │
              ┌─────┴─────┐   ┌────┴─────┐
              │  ALLOW    │   │  BLOCK   │
              │  Log info │   │  Fail    │
              │  Continue │   │  fast    │
              └───────────┘   └──────────┘

Rationale: When the FTF ablation_runner explicitly sets CVN_CUSUM_TRAINING_MODE=disabled, it's a deliberate A/B test (ADR-56). When the same value appears because of a code default bug, it's an accident that should be caught immediately.

Implementation: Factor Env Var Tracking

# In run_factor_crypto (DAG task):
for env_key, env_value in factor.env_vars[variant_name].items():
    os.environ[env_key] = env_value

# Pass active factor env vars to all downstream checks
validated["_active_factor_env_vars"] = factor.env_vars[variant_name]
# In any guardrail:
def _is_explicit_choice(env_var: str, validated: dict) -> bool:
    """True if env_var was set by the active FTF factor (not a default)."""
    return env_var in validated.get("_active_factor_env_vars", {})

3. Guardrail 1: Pre-Flight Validation Task

Position in DAG

validate_params → resolve_factors → build_pairs
    → preflight_check (NEW, ~30s)     ← fails fast here
    → run_factor_crypto (expensive, ~3 min/run)
    → aggregate → report → notify

Checks

Check Threshold Default behavior Explicit FTF choice
Training samples > 10,000 10K BLOCK ALLOW (log info)
Training samples < 50 50 BLOCK (always) BLOCK (always)
Features > 200 200 WARN (log) ALLOW (log info)
Estimated memory > 24Gi 24Gi BLOCK (always) BLOCK (always)
Unknown env var value whitelist BLOCK (always) BLOCK (always)
Estimated time > 30 min/run 30 min WARN + Slack ALLOW (log info)

Implementation

@task()
def preflight_check(validated: dict, pair: dict) -> dict:
    """Fast validation before expensive training. ~30 seconds."""
    import logging
    import os
    import sys

    logger = logging.getLogger("airflow.task")

    for p in ["/opt/airflow/src", "/opt/airflow"]:
        if p not in sys.path:
            sys.path.insert(0, p)

    factor_env = validated.get("_active_factor_env_vars", {})
    crypto = pair["crypto"]

    # --- 1. Sample count (quick: load labels + apply CUSUM, count rows) ---
    from commun.finetune.guardrails import quick_sample_count
    samples, features = quick_sample_count(
        crypto=crypto,
        strategy=validated["pte"],
        history_months=validated.get("history_months", 24),
    )

    if samples > 10000:
        if "CVN_CUSUM_TRAINING_MODE" in factor_env or "CVN_CUSUM_TRAINING_ENABLED" in factor_env:
            logger.info(
                "GUARDRAIL OK: %d samples (high but explicit: %s)",
                samples, {k: v for k, v in factor_env.items() if "CUSUM" in k}
            )
        else:
            raise ValueError(
                f"GUARDRAIL BLOCKED: {samples} training samples (expected <2000). "
                f"CUSUM may be misconfigured. CVN_CUSUM_TRAINING_MODE="
                f"{os.environ.get('CVN_CUSUM_TRAINING_MODE', 'unset')}"
            )

    if samples < 50:
        raise ValueError(f"GUARDRAIL BLOCKED: {samples} samples — insufficient for training")

    # --- 2. Feature count ---
    if features > 200:
        if "CVN_MAX_FEATURES" in factor_env:
            logger.info("GUARDRAIL OK: %d features (explicit: n_features factor)", features)
        else:
            logger.warning("GUARDRAIL WARN: %d features — expect slow HPO", features)

    # --- 3. Memory estimate ---
    # 5 copies: raw + imputed + scaled + DMatrix + model
    estimated_gb = samples * features * 8 * 5 / 1024**3
    if estimated_gb > 24:
        raise ValueError(
            f"GUARDRAIL BLOCKED: estimated {estimated_gb:.1f}Gi peak memory (limit 32Gi)"
        )

    # --- 4. Env var consistency ---
    cusum_mode = os.environ.get("CVN_CUSUM_TRAINING_MODE", "unset")
    valid = ("disabled", "enabled", "relaxed_1_5", "legacy_3_0", "event", "stable", "unset")
    if cusum_mode not in valid:
        raise ValueError(f"GUARDRAIL BLOCKED: unknown CVN_CUSUM_TRAINING_MODE={cusum_mode}")

    # --- 5. Time estimate ---
    estimated_min = samples * features / 1000 * 0.01 * validated.get("n_trials", 30)
    if estimated_min > 30:
        if any(k in factor_env for k in ("CVN_CUSUM_TRAINING_MODE", "CVN_MAX_FEATURES")):
            logger.info("GUARDRAIL OK: estimated %d min/run (explicit high-data config)", estimated_min)
        else:
            logger.warning(
                "GUARDRAIL WARN: estimated %d min/run (expected <5). "
                "Total factor: ~%dh. Consider checking config.",
                estimated_min, estimated_min * 45 / 60
            )

    logger.info(
        "PREFLIGHT PASSED: samples=%d features=%d memory=%.1fGi est_time=%.0fmin/run",
        samples, features, estimated_gb, estimated_min
    )

    return {
        "samples": samples,
        "features": features,
        "memory_gb": round(estimated_gb, 1),
        "estimated_min_per_run": round(estimated_min, 1),
        "preflight": "PASSED",
    }

quick_sample_count Helper

# src/commun/finetune/guardrails.py (new file)

def quick_sample_count(crypto: str, strategy: str, history_months: int = 24) -> tuple:
    """Quick count of training samples WITHOUT full training. ~20 seconds.

    Returns (n_samples, n_features) after CUSUM filtering + FE pipeline.
    """
    from commun.cache.components.cvntrade_autonomous_fe import AutonomousFE

    fe = AutonomousFE(
        crypto_symbol=crypto,
        timeframe=os.environ.get("CVN_TIMEFRAME", "30m"),
        strategy=strategy,
        history_months=history_months,
    )

    # Run FE pipeline up to split (no HPO, no training)
    result = fe.get_result(dry_run=True)  # New flag: only count, don't train

    n_samples = result.get("train_size", 0)
    n_features = result.get("n_features", 0)

    return n_samples, n_features

Note: Requires adding a dry_run=True mode to AutonomousFE.get_result() that runs the pipeline up to the split and returns counts without training. Estimated: ~20 lines.


4. Guardrail 2: Runtime Slow-Run Detection

In ablation_runner.py

# After first variant of each factor completes:
_first_variant_elapsed = time.time() - t0

if _first_variant_elapsed > 1800:  # > 30 min
    _slow_config_vars = ["CVN_CUSUM_TRAINING_MODE", "CVN_MAX_FEATURES", "CVN_CUSUM_TRAINING_ENABLED"]
    _is_explicit = any(v in self._active_factor_env for v in _slow_config_vars)

    if _is_explicit:
        logger.info(
            "GUARDRAIL OK: first variant took %.0f min — expected (explicit high-data config)",
            _first_variant_elapsed / 60
        )
    else:
        logger.error(
            "GUARDRAIL ALERT: first variant took %.0f min (expected <5). "
            "Possible misconfiguration. Run continues but flagged for review.",
            _first_variant_elapsed / 60
        )
        # Flag in results for Grafana visibility
        base_result["guardrail_slow_run"] = True

        # Estimate total time and warn
        remaining_runs = total_runs - 1
        eta_hours = remaining_runs * _first_variant_elapsed / 3600
        logger.error(
            "GUARDRAIL ALERT: at this rate, factor will take %.0fh total. "
            "Consider killing and checking config.",
            eta_hours
        )

Persistence

Add guardrail_slow_run boolean to the result dict → persisted in PostgreSQL → queryable in Grafana.


5. Guardrail 3: Env Var Separation

Current (ambiguous)

CVN_CUSUM_TRAINING_MODE = "enabled" | "disabled" | "event" | "stable" | "relaxed_1_5" | "legacy_3_0"

One env var, 6 possible values, 2 different meanings. Root cause of the 21h waste.

Proposed (clear)

Env var Values Purpose
CVN_CUSUM_TRAINING_ENABLED 1 (default) / 0 Apply CUSUM to training data?
CVN_CUSUM_FILTER_MODE event (default) / stable Which candles to keep?

FTF Factor Update

AblationFactor(
    name="cusum_training_mode",
    factor_type="training",
    category="data",
    env_vars={
        "disabled": {"CVN_CUSUM_TRAINING_ENABLED": "0"},
        "relaxed_1_5": {"CVN_CUSUM_TRAINING_ENABLED": "1", "CVN_CUSUM_THRESHOLD_H": "1.5"},
        "legacy_3_0": {"CVN_CUSUM_TRAINING_ENABLED": "1", "CVN_CUSUM_THRESHOLD_H": "3.0"},
    },
)

Code Change

# cvntrade_autonomous_fe.py — level 1
cusum_enabled = os.environ.get("CVN_CUSUM_TRAINING_ENABLED", "1") == "1"
if cusum_enabled:
    labels, cusum_metadata = self._apply_cusum_before_split(labels, train_ratio=train_ratio)

# _apply_cusum_before_split — level 2
mode = os.environ.get("CVN_CUSUM_FILTER_MODE", "event")  # "event" or "stable"

No more dual semantics. Each var has one meaning.


6. Guardrail 4: Integration Tests

# tests/unit/test_cusum_guardrails.py

class TestCUSUMTrainingMode:

    def test_enabled_filters_most_data(self):
        """CUSUM enabled must reduce samples to <20% of total."""
        os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "1"
        os.environ["CVN_CUSUM_FILTER_MODE"] = "event"
        labels = _make_ohlcv(n=10000)
        filtered, meta = fe._apply_cusum_before_split(labels)
        assert len(filtered) < len(labels) * 0.20

    def test_disabled_keeps_all(self):
        """CUSUM disabled must keep all samples."""
        os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "0"
        labels = _make_ohlcv(n=10000)
        # Verify _apply_cusum_before_split is NOT called
        # and all samples are preserved

    def test_stable_mode_keeps_most(self):
        """Stable mode keeps non-transition candles (~95%)."""
        os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "1"
        os.environ["CVN_CUSUM_FILTER_MODE"] = "stable"
        labels = _make_ohlcv(n=10000)
        filtered, meta = fe._apply_cusum_before_split(labels)
        assert len(filtered) > len(labels) * 0.80

class TestPreflightGuardrails:

    def test_blocks_unexpected_high_samples(self):
        """Pre-flight must BLOCK >10K samples when not explicit."""
        validated = {"_active_factor_env_vars": {}}  # no explicit CUSUM choice
        with pytest.raises(ValueError, match="GUARDRAIL BLOCKED"):
            preflight_check(validated, {"crypto": "TESTUSDC"})

    def test_allows_explicit_high_samples(self):
        """Pre-flight must ALLOW >10K samples when factor sets CUSUM explicitly."""
        validated = {
            "_active_factor_env_vars": {"CVN_CUSUM_TRAINING_ENABLED": "0"}
        }
        result = preflight_check(validated, {"crypto": "TESTUSDC"})
        assert result["preflight"] == "PASSED"

    def test_always_blocks_too_few_samples(self):
        """Pre-flight must ALWAYS block <50 samples, even if explicit."""
        # Even explicit config can't override minimum safety
        with pytest.raises(ValueError, match="insufficient"):
            ...

    def test_always_blocks_oom_risk(self):
        """Pre-flight must ALWAYS block >24Gi estimated memory."""
        with pytest.raises(ValueError, match="OOM risk"):
            ...

7. Guardrail 5: Deploy Checklist

Add to documentation/OPERATIONS.md

## After Helm Deploy with Config Changes

**MANDATORY** — skipping causes pods to run with old config.

1. ✅ Verify new code deployed:
   ```bash
   SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
   kubectl exec -n cvntrade $SCHED -c scheduler -- grep "<KEY_CHANGE>" /opt/airflow/src/...
   ```

2. ✅ Kill ALL running FTF pods:
   ```bash
   kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
   ```

3. ✅ Wait 30 seconds for termination

4. ✅ Trigger new runs from Airflow UI

5. ✅ Monitor first 5 minutes — check logs:
   ```bash
   POD=$(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | head -1 | awk '{print $1}')
   kubectl logs -n cvntrade $POD | grep "train="
   ```
   - Expected: `train=~1000` (CUSUM enabled)
   - If `train=>10000`: STOP, check CUSUM config
   - If `PREFLIGHT PASSED`: config is validated

8. Guardrail 6: Grafana Slow-Run Alert

New Panel: "Slow Runs (>10 min)"

SELECT run_id, factor, variant, crypto, fold_id, cost_bps,
       elapsed_s, ROUND(elapsed_s / 60.0, 1) as minutes
FROM finetune_results
WHERE elapsed_s > 600
ORDER BY elapsed_s DESC
LIMIT 20

Alert Rule

Condition Severity Action
Any run > 10 min WARNING Slack #cvntrade-alerts
Any run > 30 min CRITICAL Slack + investigate immediately
Average run > 5 min for a factor WARNING Check config

9. Guardrail 7: PostgreSQL Persistence (enhanced per committee)

Guardrail results persisted with full context for post-hoc analysis:

Column Type Purpose
preflight_samples INT Sample count from pre-flight
preflight_features INT Feature count from pre-flight
preflight_memory_gb REAL Estimated memory
guardrail_slow_run BOOLEAN True if run exceeded threshold
guardrail_slow_reason TEXT Why slow (e.g. "cusum disabled, 28K samples, 267 features")
guardrail_explicit BOOLEAN True if unusual config was explicit FTF choice
guardrail_override BOOLEAN True if override was active
guardrail_override_reason TEXT Justification for override

Migration: infra/migrations/006_guardrails.sql


10. Guardrail Override Mechanism (committee reco)

For emergency scenarios where guardrails must be bypassed (e.g. urgent retraining after market crash):

# Set via env var with mandatory justification
CVN_GUARDRAIL_OVERRIDE=true
CVN_GUARDRAIL_OVERRIDE_REASON="emergency retraining after flash crash — approved by @ceven 2026-04-15"

Behavior when override is active: - Pre-flight BLOCK checks → downgraded to WARN (still logged, not blocking) - Memory/OOM checks → STILL BLOCK (hard safety, never overridable) - All guardrail events logged with override=true tag - Persisted in PostgreSQL: guardrail_override=true, guardrail_override_reason=... - Post-hoc: Grafana query shows all overridden runs for review

Audit trail: Override requires Helm deploy (env var in ConfigMap) or explicit FTF factor config → git-tracked → reviewable.

Implementation:

def _is_override_active() -> bool:
    return os.environ.get("CVN_GUARDRAIL_OVERRIDE", "false").lower() == "true"

def _get_override_reason() -> str:
    reason = os.environ.get("CVN_GUARDRAIL_OVERRIDE_REASON", "")
    if not reason:
        raise ValueError("GUARDRAIL OVERRIDE requires CVN_GUARDRAIL_OVERRIDE_REASON")
    return reason


11. Staged Rollout for Critical Changes (committee reco)

For changes that affect training data size, model architecture, or filter configuration:

Stage Scope Duration Gate
Pre-flight 1 crypto, 1 fold, 1 variant ~30s Samples/memory/time within bounds
Canary 1 crypto, all folds, all variants ~15 min No OOM, no guardrail alerts, Sortino > 0
Full All cryptos, all folds 2-3h Monitor Grafana, all metrics within CI

When to use staged rollout: - New FTF factor that affects sample count (e.g. cusum_training_mode) - Change to feature selection strategy - New model type or architecture - Change to CUSUM parameters

When NOT needed (standard factors): - Testing different cost scenarios - Different cooldown values - Different trend filter settings


12. Sensitive Value Masking (committee reco)

_active_factor_env_vars may contain values that should not appear in logs (e.g. API keys if future factors use external services).

_SENSITIVE_PATTERNS = ["KEY", "SECRET", "PASSWORD", "TOKEN"]

def _mask_env_vars(env_vars: dict) -> dict:
    """Mask sensitive values for logging."""
    masked = {}
    for k, v in env_vars.items():
        if any(p in k.upper() for p in _SENSITIVE_PATTERNS):
            masked[k] = f"{v[:3]}***" if len(v) > 3 else "***"
        else:
            masked[k] = v
    return masked

# Usage in logging:
logger.info("GUARDRAIL: active env vars: %s", _mask_env_vars(factor_env))

13. PYTHONPATH Management (committee reco)

Pre-flight task must NOT use sys.path.insert(0, ...) hacks. Use proper PYTHONPATH:

# In DAG definition, set PYTHONPATH via env:
@task(executor_config={"pod_override": {"spec": {"containers": [{"env": [{"name": "PYTHONPATH", "value": "/opt/airflow/src"}]}]}}})
def preflight_check(validated: dict, pair: dict) -> dict:
    # No sys.path manipulation needed
    from commun.finetune.guardrails import quick_sample_count
    ...

Or configure via Helm extraEnv for all worker pods (already set in ConfigMap).


14. Guardrail Ablation Study (committee reco)

After deployment, measure the impact of guardrails on iteration velocity:

Metric Before guardrails After guardrails Target
Wasted sessions / month 2-4 0 0
Time to detect misconfiguration 2-6h <30s (pre-flight) <1 min
False positive rate N/A < 5% < 2%
Pre-flight overhead N/A ~30s/pod < 60s
Total iteration time impact N/A +30s per factor < 1% of total run time

Review schedule: After 10 FTF sessions, analyze guardrail triggers. Adjust thresholds if false positive rate > 5%.


15. Implementation Plan — V2

Phase 1 — Core (HIGH priority)

Task Effort
guardrails.py module (quick_sample_count + masking) 40 lines
Pre-flight task in DAG (proper PYTHONPATH, no sys.path) 40 lines
Factor env var tracking in run_factor_crypto 5 lines
Integration tests (6 tests) 60 lines
Env var separation (ENABLED + FILTER_MODE) 25 lines

Phase 2 — Runtime + Safety (HIGH priority)

Task Effort
Runtime slow-run guard with context + reason 25 lines
Override mechanism (env var + reason + validation) 20 lines
Enhanced persistence (reason, override columns) 15 lines
Migration 006 (guardrail columns) 10 lines

Phase 3 — Operations (MEDIUM priority)

Task Effort
Deploy checklist in OPERATIONS.md Doc
Grafana slow-run panel + alert 1 panel
Staged rollout documentation + canary script Doc + 20 lines
Ablation study (post-deploy analysis) Analysis

Total: ~260 lines + docs


16. Success Criteria — V2

  • Pre-flight catches >10K samples before training (fails fast <30s)
  • Pre-flight allows >10K when FTF factor explicitly sets CUSUM=disabled
  • Runtime guard flags slow runs with reason + context persisted in PostgreSQL
  • Override mechanism requires justification, persisted + auditable
  • Sensitive env vars masked in all logs
  • No sys.path hacks (proper PYTHONPATH)
  • Staged rollout documented for critical changes
  • 6 integration tests pass
  • Grafana alert fires on slow runs
  • ADR-58 enforced on all new/modified factors
  • False positive rate < 5% after 10 sessions
  • Zero wasted sessions from misconfiguration