FTF Run Guardrails — Pre-Flight Checks & Runtime Guards¶

Version: 2.1 Design date: 2026-04-15 · Implementation status last reviewed: 2026-04-23 Issues: #551 (original), #608 (mission guardrails), #630 (PTE envelope guardrails) Status: PASSED (OK, 7.9/10) — V2 incorporates 6 committee recommendations ADRs: ADR-25 (no silent fallback), ADR-56 (A/B testable by design), ADR-58 (every factor must have a guardrail) Committee: Session 62046fc5 (2026-04-15)

Read me first (2026-04-23): this doc describes the §1-13 design from #551. The actual implementation in src/commun/finetune/guardrails.py today differs — see §0 below for what's running. The design sections are kept as reference for the roadmap, but don't describe live behavior.

0. Implementation status (as of 2026-04-23)¶

Actually live in src/commun/finetune/guardrails.py (called from ablation_runner._run_training_variant + _run_runtime_variant before os.environ mutation):

Check	Raises	Source
`CVN_THRESHOLD_METHOD=f1_binary` without `CVN_BINARY_CLASSIFICATION=1`	`VariantGuardrailError`	PR #626 δ-2 round 1
`CVN_HPO_OBJECTIVE=f1_binary` without `CVN_BINARY_CLASSIFICATION=1`	`VariantGuardrailError`	PR #626 δ-2 round 1 (hoisted from per-trial check)
`CVN_SL_MULT` out of bounds `[0.1, 3.0]`	`VariantGuardrailError`	PR #630
`CVN_TP_MULT` out of bounds `[0.3, 5.0]`	`VariantGuardrailError`	PR #630
`TP < SL` (strictly inverted economics; `TP == SL` accepted for 1:1 RR variant)	`VariantGuardrailError`	PR #630 (relaxed round 4)
Non-numeric / non-finite SL, TP raw	`VariantGuardrailError`	PR #630
Overlap warning: `pte_envelope` + `sl_multiplier`	`audit_factor_selection` (logger.warning, not raise)	PR #630
Overlap warning: `pte_envelope` + `tp_multiplier`	`audit_factor_selection` (logger.warning, not raise)	PR #630

Runtime invariant (tested): validate_variant_env runs before the os.environ[k] = v loop in _run_training_variant (src/commun/finetune/ablation_runner.py). A raise leaves the operator's process env byte-identical to the pre-call snapshot. Locked by TestRunnerInvariant (6 parametrized tests).

What's NOT yet implemented (still in the §1-13 design below, tracked for future):

Guardrail	Design §§	Status
`preflight_check` Airflow task	§3	NOT implemented. Sample count / memory / time estimates not verified pre-training.
`quick_sample_count` helper	§3	Not implemented.
Runtime slow-run detection	§4	Not implemented. Relies on Airflow task timeout only.
CUSUM env var clarity (`CVN_CUSUM_TRAINING_ENABLED` split)	§5	Not done. Current flag `CVN_CUSUM_TRAINING_MODE` has 5 values.
Grafana slow-run alert	§8	Not implemented.
Guardrail override mechanism (`FORCE_GUARDRAIL_BYPASS`)	§10	Not implemented.
Staged rollout	§11	Not implemented.
Sensitive value masking in logs	§12	Not implemented.

The 2026-04-14/15 post-mortem (§1) triggered this design. Since then, new classes of bugs emerged (f1_binary × 3-class mode fail-fast; PTE envelope bounds) and got their own guardrails via ADR-58. The §1-13 design is still the North Star; §0 is the ground truth.

1. Problem Statement¶

Post-mortem: 2 sessions lost¶

Session	Duration	Samples	Expected	Root cause	Cost
14/04 09:25–23:00	14h	28,000	~1,000	CUSUM default=disabled in code	~€5
15/04 00:00–06:15	6h	28,000	~1,000	CUSUM mode="enabled" interpreted as "stable"	~€3.50
Total	21h				~€8.50

No guardrail caught the problem. Pods trained for hours before anyone noticed the sample count was 28× expected. The pipeline happily ran 400× slower without any alert.

Why this matters¶

At 2-3 FTF sessions per week, each misconfigured session wastes: - €4-8 compute cost - 4-8h calendar time waiting for results - Iteration velocity: 1 day lost per incident

2. Design Principle: Block Defaults, Allow Explicit Choices¶

                 ┌─────────────────────────────────────┐
                 │   Is the unusual value set by the    │
                 │   FTF factor explicitly?             │
                 │                                     │
                 │   _active_factor_env_vars contains   │
                 │   the env var that caused the alert? │
                 └──────────┬──────────────────────────┘
                            │
                    ┌───────┴───────┐
                    │               │
                   YES              NO
                    │               │
              ┌─────┴─────┐   ┌────┴─────┐
              │  ALLOW    │   │  BLOCK   │
              │  Log info │   │  Fail    │
              │  Continue │   │  fast    │
              └───────────┘   └──────────┘

Rationale: When the FTF ablation_runner explicitly sets CVN_CUSUM_TRAINING_MODE=disabled, it's a deliberate A/B test (ADR-56). When the same value appears because of a code default bug, it's an accident that should be caught immediately.

Implementation: Factor Env Var Tracking¶

# In run_factor_crypto (DAG task):
for env_key, env_value in factor.env_vars[variant_name].items():
    os.environ[env_key] = env_value

# Pass active factor env vars to all downstream checks
validated["_active_factor_env_vars"] = factor.env_vars[variant_name]

# In any guardrail:
def _is_explicit_choice(env_var: str, validated: dict) -> bool:
    """True if env_var was set by the active FTF factor (not a default)."""
    return env_var in validated.get("_active_factor_env_vars", {})

3. Guardrail 1: Pre-Flight Validation Task¶

Position in DAG¶

validate_params → resolve_factors → build_pairs
    → preflight_check (NEW, ~30s)     ← fails fast here
    → run_factor_crypto (expensive, ~3 min/run)
    → aggregate → report → notify

Checks¶

Check	Threshold	Default behavior	Explicit FTF choice
Training samples > 10,000	10K	BLOCK	ALLOW (log info)
Training samples < 50	50	BLOCK (always)	BLOCK (always)
Features > 200	200	WARN (log)	ALLOW (log info)
Estimated memory > 24Gi	24Gi	BLOCK (always)	BLOCK (always)
Unknown env var value	whitelist	BLOCK (always)	BLOCK (always)
Estimated time > 30 min/run	30 min	WARN + Slack	ALLOW (log info)

Implementation¶

@task()
def preflight_check(validated: dict, pair: dict) -> dict:
    """Fast validation before expensive training. ~30 seconds."""
    import logging
    import os
    import sys

    logger = logging.getLogger("airflow.task")

    for p in ["/opt/airflow/src", "/opt/airflow"]:
        if p not in sys.path:
            sys.path.insert(0, p)

    factor_env = validated.get("_active_factor_env_vars", {})
    crypto = pair["crypto"]

    # --- 1. Sample count (quick: load labels + apply CUSUM, count rows) ---
    from commun.finetune.guardrails import quick_sample_count
    samples, features = quick_sample_count(
        crypto=crypto,
        strategy=validated["pte"],
        history_months=validated.get("history_months", 24),
    )

    if samples > 10000:
        if "CVN_CUSUM_TRAINING_MODE" in factor_env or "CVN_CUSUM_TRAINING_ENABLED" in factor_env:
            logger.info(
                "GUARDRAIL OK: %d samples (high but explicit: %s)",
                samples, {k: v for k, v in factor_env.items() if "CUSUM" in k}
            )
        else:
            raise ValueError(
                f"GUARDRAIL BLOCKED: {samples} training samples (expected <2000). "
                f"CUSUM may be misconfigured. CVN_CUSUM_TRAINING_MODE="
                f"{os.environ.get('CVN_CUSUM_TRAINING_MODE', 'unset')}"
            )

    if samples < 50:
        raise ValueError(f"GUARDRAIL BLOCKED: {samples} samples — insufficient for training")

    # --- 2. Feature count ---
    if features > 200:
        if "CVN_MAX_FEATURES" in factor_env:
            logger.info("GUARDRAIL OK: %d features (explicit: n_features factor)", features)
        else:
            logger.warning("GUARDRAIL WARN: %d features — expect slow HPO", features)

    # --- 3. Memory estimate ---
    # 5 copies: raw + imputed + scaled + DMatrix + model
    estimated_gb = samples * features * 8 * 5 / 1024**3
    if estimated_gb > 24:
        raise ValueError(
            f"GUARDRAIL BLOCKED: estimated {estimated_gb:.1f}Gi peak memory (limit 32Gi)"
        )

    # --- 4. Env var consistency ---
    cusum_mode = os.environ.get("CVN_CUSUM_TRAINING_MODE", "unset")
    valid = ("disabled", "enabled", "relaxed_1_5", "legacy_3_0", "event", "stable", "unset")
    if cusum_mode not in valid:
        raise ValueError(f"GUARDRAIL BLOCKED: unknown CVN_CUSUM_TRAINING_MODE={cusum_mode}")

    # --- 5. Time estimate ---
    estimated_min = samples * features / 1000 * 0.01 * validated.get("n_trials", 30)
    if estimated_min > 30:
        if any(k in factor_env for k in ("CVN_CUSUM_TRAINING_MODE", "CVN_MAX_FEATURES")):
            logger.info("GUARDRAIL OK: estimated %d min/run (explicit high-data config)", estimated_min)
        else:
            logger.warning(
                "GUARDRAIL WARN: estimated %d min/run (expected <5). "
                "Total factor: ~%dh. Consider checking config.",
                estimated_min, estimated_min * 45 / 60
            )

    logger.info(
        "PREFLIGHT PASSED: samples=%d features=%d memory=%.1fGi est_time=%.0fmin/run",
        samples, features, estimated_gb, estimated_min
    )

    return {
        "samples": samples,
        "features": features,
        "memory_gb": round(estimated_gb, 1),
        "estimated_min_per_run": round(estimated_min, 1),
        "preflight": "PASSED",
    }

quick_sample_count Helper¶

# src/commun/finetune/guardrails.py (new file)

def quick_sample_count(crypto: str, strategy: str, history_months: int = 24) -> tuple:
    """Quick count of training samples WITHOUT full training. ~20 seconds.

    Returns (n_samples, n_features) after CUSUM filtering + FE pipeline.
    """
    from commun.cache.components.cvntrade_autonomous_fe import AutonomousFE

    fe = AutonomousFE(
        crypto_symbol=crypto,
        timeframe=os.environ.get("CVN_TIMEFRAME", "30m"),
        strategy=strategy,
        history_months=history_months,
    )

    # Run FE pipeline up to split (no HPO, no training)
    result = fe.get_result(dry_run=True)  # New flag: only count, don't train

    n_samples = result.get("train_size", 0)
    n_features = result.get("n_features", 0)

    return n_samples, n_features

Note: Requires adding a dry_run=True mode to AutonomousFE.get_result() that runs the pipeline up to the split and returns counts without training. Estimated: ~20 lines.

4. Guardrail 2: Runtime Slow-Run Detection¶

In ablation_runner.py¶

# After first variant of each factor completes:
_first_variant_elapsed = time.time() - t0

if _first_variant_elapsed > 1800:  # > 30 min
    _slow_config_vars = ["CVN_CUSUM_TRAINING_MODE", "CVN_MAX_FEATURES", "CVN_CUSUM_TRAINING_ENABLED"]
    _is_explicit = any(v in self._active_factor_env for v in _slow_config_vars)

    if _is_explicit:
        logger.info(
            "GUARDRAIL OK: first variant took %.0f min — expected (explicit high-data config)",
            _first_variant_elapsed / 60
        )
    else:
        logger.error(
            "GUARDRAIL ALERT: first variant took %.0f min (expected <5). "
            "Possible misconfiguration. Run continues but flagged for review.",
            _first_variant_elapsed / 60
        )
        # Flag in results for Grafana visibility
        base_result["guardrail_slow_run"] = True

        # Estimate total time and warn
        remaining_runs = total_runs - 1
        eta_hours = remaining_runs * _first_variant_elapsed / 3600
        logger.error(
            "GUARDRAIL ALERT: at this rate, factor will take %.0fh total. "
            "Consider killing and checking config.",
            eta_hours
        )

Persistence¶

Add guardrail_slow_run boolean to the result dict → persisted in PostgreSQL → queryable in Grafana.

5. Guardrail 3: Env Var Separation¶

Current (ambiguous)¶

CVN_CUSUM_TRAINING_MODE = "enabled" | "disabled" | "event" | "stable" | "relaxed_1_5" | "legacy_3_0"

One env var, 6 possible values, 2 different meanings. Root cause of the 21h waste.

Proposed (clear)¶

Env var	Values	Purpose
`CVN_CUSUM_TRAINING_ENABLED`	`1` (default) / `0`	Apply CUSUM to training data?
`CVN_CUSUM_FILTER_MODE`	`event` (default) / `stable`	Which candles to keep?

FTF Factor Update¶

AblationFactor(
    name="cusum_training_mode",
    factor_type="training",
    category="data",
    env_vars={
        "disabled": {"CVN_CUSUM_TRAINING_ENABLED": "0"},
        "relaxed_1_5": {"CVN_CUSUM_TRAINING_ENABLED": "1", "CVN_CUSUM_THRESHOLD_H": "1.5"},
        "legacy_3_0": {"CVN_CUSUM_TRAINING_ENABLED": "1", "CVN_CUSUM_THRESHOLD_H": "3.0"},
    },
)

Code Change¶

# cvntrade_autonomous_fe.py — level 1
cusum_enabled = os.environ.get("CVN_CUSUM_TRAINING_ENABLED", "1") == "1"
if cusum_enabled:
    labels, cusum_metadata = self._apply_cusum_before_split(labels, train_ratio=train_ratio)

# _apply_cusum_before_split — level 2
mode = os.environ.get("CVN_CUSUM_FILTER_MODE", "event")  # "event" or "stable"

No more dual semantics. Each var has one meaning.

6. Guardrail 4: Integration Tests¶

# tests/unit/test_cusum_guardrails.py

class TestCUSUMTrainingMode:

    def test_enabled_filters_most_data(self):
        """CUSUM enabled must reduce samples to <20% of total."""
        os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "1"
        os.environ["CVN_CUSUM_FILTER_MODE"] = "event"
        labels = _make_ohlcv(n=10000)
        filtered, meta = fe._apply_cusum_before_split(labels)
        assert len(filtered) < len(labels) * 0.20

    def test_disabled_keeps_all(self):
        """CUSUM disabled must keep all samples."""
        os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "0"
        labels = _make_ohlcv(n=10000)
        # Verify _apply_cusum_before_split is NOT called
        # and all samples are preserved

    def test_stable_mode_keeps_most(self):
        """Stable mode keeps non-transition candles (~95%)."""
        os.environ["CVN_CUSUM_TRAINING_ENABLED"] = "1"
        os.environ["CVN_CUSUM_FILTER_MODE"] = "stable"
        labels = _make_ohlcv(n=10000)
        filtered, meta = fe._apply_cusum_before_split(labels)
        assert len(filtered) > len(labels) * 0.80

class TestPreflightGuardrails:

    def test_blocks_unexpected_high_samples(self):
        """Pre-flight must BLOCK >10K samples when not explicit."""
        validated = {"_active_factor_env_vars": {}}  # no explicit CUSUM choice
        with pytest.raises(ValueError, match="GUARDRAIL BLOCKED"):
            preflight_check(validated, {"crypto": "TESTUSDC"})

    def test_allows_explicit_high_samples(self):
        """Pre-flight must ALLOW >10K samples when factor sets CUSUM explicitly."""
        validated = {
            "_active_factor_env_vars": {"CVN_CUSUM_TRAINING_ENABLED": "0"}
        }
        result = preflight_check(validated, {"crypto": "TESTUSDC"})
        assert result["preflight"] == "PASSED"

    def test_always_blocks_too_few_samples(self):
        """Pre-flight must ALWAYS block <50 samples, even if explicit."""
        # Even explicit config can't override minimum safety
        with pytest.raises(ValueError, match="insufficient"):
            ...

    def test_always_blocks_oom_risk(self):
        """Pre-flight must ALWAYS block >24Gi estimated memory."""
        with pytest.raises(ValueError, match="OOM risk"):
            ...

7. Guardrail 5: Deploy Checklist¶

Add to `documentation/OPERATIONS.md`¶

## After Helm Deploy with Config Changes

**MANDATORY** — skipping causes pods to run with old config.

1. ✅ Verify new code deployed:
   ```bash
   SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
   kubectl exec -n cvntrade $SCHED -c scheduler -- grep "<KEY_CHANGE>" /opt/airflow/src/...
   ```

2. ✅ Kill ALL running FTF pods:
   ```bash
   kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
   ```

3. ✅ Wait 30 seconds for termination

4. ✅ Trigger new runs from Airflow UI

5. ✅ Monitor first 5 minutes — check logs:
   ```bash
   POD=$(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | head -1 | awk '{print $1}')
   kubectl logs -n cvntrade $POD | grep "train="
   ```
   - Expected: `train=~1000` (CUSUM enabled)
   - If `train=>10000`: STOP, check CUSUM config
   - If `PREFLIGHT PASSED`: config is validated

8. Guardrail 6: Grafana Slow-Run Alert¶

New Panel: "Slow Runs (>10 min)"¶

SELECT run_id, factor, variant, crypto, fold_id, cost_bps,
       elapsed_s, ROUND(elapsed_s / 60.0, 1) as minutes
FROM finetune_results
WHERE elapsed_s > 600
ORDER BY elapsed_s DESC
LIMIT 20

Alert Rule¶

Condition	Severity	Action
Any run > 10 min	WARNING	Slack #cvntrade-alerts
Any run > 30 min	CRITICAL	Slack + investigate immediately
Average run > 5 min for a factor	WARNING	Check config

9. Guardrail 7: PostgreSQL Persistence (enhanced per committee)¶

Guardrail results persisted with full context for post-hoc analysis:

Column	Type	Purpose
`preflight_samples`	INT	Sample count from pre-flight
`preflight_features`	INT	Feature count from pre-flight
`preflight_memory_gb`	REAL	Estimated memory
`guardrail_slow_run`	BOOLEAN	True if run exceeded threshold
`guardrail_slow_reason`	TEXT	Why slow (e.g. "cusum disabled, 28K samples, 267 features")
`guardrail_explicit`	BOOLEAN	True if unusual config was explicit FTF choice
`guardrail_override`	BOOLEAN	True if override was active
`guardrail_override_reason`	TEXT	Justification for override

Migration: infra/migrations/006_guardrails.sql

10. Guardrail Override Mechanism (committee reco)¶

For emergency scenarios where guardrails must be bypassed (e.g. urgent retraining after market crash):

# Set via env var with mandatory justification
CVN_GUARDRAIL_OVERRIDE=true
CVN_GUARDRAIL_OVERRIDE_REASON="emergency retraining after flash crash — approved by @ceven 2026-04-15"

Behavior when override is active: - Pre-flight BLOCK checks → downgraded to WARN (still logged, not blocking) - Memory/OOM checks → STILL BLOCK (hard safety, never overridable) - All guardrail events logged with override=true tag - Persisted in PostgreSQL: guardrail_override=true, guardrail_override_reason=... - Post-hoc: Grafana query shows all overridden runs for review

Audit trail: Override requires Helm deploy (env var in ConfigMap) or explicit FTF factor config → git-tracked → reviewable.

Implementation:

def _is_override_active() -> bool:
    return os.environ.get("CVN_GUARDRAIL_OVERRIDE", "false").lower() == "true"

def _get_override_reason() -> str:
    reason = os.environ.get("CVN_GUARDRAIL_OVERRIDE_REASON", "")
    if not reason:
        raise ValueError("GUARDRAIL OVERRIDE requires CVN_GUARDRAIL_OVERRIDE_REASON")
    return reason

11. Staged Rollout for Critical Changes (committee reco)¶

For changes that affect training data size, model architecture, or filter configuration:

Stage	Scope	Duration	Gate
Pre-flight	1 crypto, 1 fold, 1 variant	~30s	Samples/memory/time within bounds
Canary	1 crypto, all folds, all variants	~15 min	No OOM, no guardrail alerts, Sortino > 0
Full	All cryptos, all folds	2-3h	Monitor Grafana, all metrics within CI

When to use staged rollout: - New FTF factor that affects sample count (e.g. cusum_training_mode) - Change to feature selection strategy - New model type or architecture - Change to CUSUM parameters

When NOT needed (standard factors): - Testing different cost scenarios - Different cooldown values - Different trend filter settings

12. Sensitive Value Masking (committee reco)¶

_active_factor_env_vars may contain values that should not appear in logs (e.g. API keys if future factors use external services).

_SENSITIVE_PATTERNS = ["KEY", "SECRET", "PASSWORD", "TOKEN"]

def _mask_env_vars(env_vars: dict) -> dict:
    """Mask sensitive values for logging."""
    masked = {}
    for k, v in env_vars.items():
        if any(p in k.upper() for p in _SENSITIVE_PATTERNS):
            masked[k] = f"{v[:3]}***" if len(v) > 3 else "***"
        else:
            masked[k] = v
    return masked

# Usage in logging:
logger.info("GUARDRAIL: active env vars: %s", _mask_env_vars(factor_env))

13. PYTHONPATH Management (committee reco)¶

Pre-flight task must NOT use sys.path.insert(0, ...) hacks. Use proper PYTHONPATH:

# In DAG definition, set PYTHONPATH via env:
@task(executor_config={"pod_override": {"spec": {"containers": [{"env": [{"name": "PYTHONPATH", "value": "/opt/airflow/src"}]}]}}})
def preflight_check(validated: dict, pair: dict) -> dict:
    # No sys.path manipulation needed
    from commun.finetune.guardrails import quick_sample_count
    ...

Or configure via Helm extraEnv for all worker pods (already set in ConfigMap).

14. Guardrail Ablation Study (committee reco)¶

After deployment, measure the impact of guardrails on iteration velocity:

Metric	Before guardrails	After guardrails	Target
Wasted sessions / month	2-4	0	0
Time to detect misconfiguration	2-6h	<30s (pre-flight)	<1 min
False positive rate	N/A	< 5%	< 2%
Pre-flight overhead	N/A	~30s/pod	< 60s
Total iteration time impact	N/A	+30s per factor	< 1% of total run time

Review schedule: After 10 FTF sessions, analyze guardrail triggers. Adjust thresholds if false positive rate > 5%.

15. Implementation Plan — V2¶

Phase 1 — Core (HIGH priority)¶

Task	Effort
`guardrails.py` module (quick_sample_count + masking)	40 lines
Pre-flight task in DAG (proper PYTHONPATH, no sys.path)	40 lines
Factor env var tracking in run_factor_crypto	5 lines
Integration tests (6 tests)	60 lines
Env var separation (ENABLED + FILTER_MODE)	25 lines

Phase 2 — Runtime + Safety (HIGH priority)¶

Task	Effort
Runtime slow-run guard with context + reason	25 lines
Override mechanism (env var + reason + validation)	20 lines
Enhanced persistence (reason, override columns)	15 lines
Migration 006 (guardrail columns)	10 lines

Phase 3 — Operations (MEDIUM priority)¶

Task	Effort
Deploy checklist in OPERATIONS.md	Doc
Grafana slow-run panel + alert	1 panel
Staged rollout documentation + canary script	Doc + 20 lines
Ablation study (post-deploy analysis)	Analysis

Total: ~260 lines + docs