Skip to content

ThresholdCalibrator — Design document

Status : v2 — committee-reviewed (session 7371c57d, PASSED, strong consensus) Authors : CVNTrade research, 2026-04-20 Issue : #608

Changelog

  • v2 (2026-04-20) — post committee review. Integrates 10 recommendations and arbitrates 4 dissents: fbeta locked as Phase 2 default; dynamic mode shipped gated OFF with ±2σ circuit-breaker; A/B FTF factor added to the migration plan; post-Platt/isotonic calibration enforced; 3-class decision locked to argmax + threshold filter; ADR-15 intent clarified (static per model version, not walk-forward); kill switch, runbook, and filter-chain-interaction sections added.
  • v1 (2026-04-20) — initial draft submitted to committee.

1. Executive summary

The post-inference confidence filter compares p(BUY) to a threshold to decide whether to open a position. Today that threshold is a fixed value (CVN_THRESHOLD_BUY, default 0.5), implicitly tuned for the 3-class mode. When we switch to binary (lever #3 of P0-A), the p(BUY) distribution reshuffles and the inherited threshold mechanically becomes too strict — action_rate collapses to roughly 0%, the model stops opening trades.

This document proposes a component ThresholdCalibrator that: - automatically calibrates a threshold from the training set (at the end of HPO) - exposes three operator modes (default / conservative / dynamic) to trade confidence against volume - persists the fitted threshold in MLflow next to the model (moves with retraining, not with env vars) - is extensible to the SELL class when CVNTrade opens up to derivatives


2. Problem statement

2.1 Evidence

Run ftf_20260420_122455_d94ad0 (lever #3 binary, baseline otherwise, config 2026-04-20-p0a-lever3):

Crypto f1_buy AUC action_rate n_trades
AAVE 0.059 0.540 0.5% 3
LDO 0.034 0.508 0.3% 0
OP 0.000 0.500 0.0% 1

Equivalent 3-class baseline (run ftf_20260420_082137_635720):

Crypto f1_buy AUC action_rate
AAVE 0.323 0.678 36%
LDO 0.286 0.629 38%
OP 0.375 0.632 39%

2.2 Diagnosis

In 3-class mode, argmax(p_sell, p_hold, p_buy) is the natural decision: the BUY class wins as soon as p(BUY) exceeds p(HOLD) (typically around 0.33–0.4). The 0.5 threshold is rarely the bottleneck.

In binary mode, the decision is directly p(BUY) > threshold. The p(BUY) distribution spreads over [0,1] — for a well-trained classifier with an imbalanced minority class (5–15% of bars are true BUY), p(BUY) stays below 0.5 most of the time even for genuine positives. The 0.5 threshold then sits almost always above p(BUY), driving action rate to ~0.

2.3 Requirement

A threshold mechanism that: 1. Adapts automatically to the p(BUY) distribution of the trained classifier (no manual tuning per classification mode) 2. Lets the operator steer the confidence vs volume tradeoff without touching code or model 3. Stays stable inter-fold (no overfitting bias to a specific fold) 4. Extends to the SELL class when CVNTrade adds short-selling


3. Design principles

P1 — The threshold is a model artifact, not a fixed env var

The optimal threshold depends on the p(BUY) distribution learned by a given classifier. It must therefore move with retraining, not with a config redeploy. Consequence: the fitted threshold is persisted in MLflow next to the model, loaded at inference as a feature artifact (ADR-23 compliance).

P2 — Only modes are exposed to the operator

The operator does not pick a numeric value (error-prone, easy to forget between two runs) but a mode from {default, conservative, dynamic}. The mode → effective value mapping is computed at inference from the fitted threshold plus a margin (also parameterized globally, not per-model).

P3 — Calibration is class-agnostic

The ThresholdCalibrator class knows nothing of BUY/SELL business meaning. It receives y_proba_positive and returns a threshold. One instance per class to calibrate (BUY today, SELL tomorrow) — orthogonal to is_binary_classification() and the number of classes.

P4 — Backward compatibility without magic

CVN_THRESHOLD_BUY (today's fixed value) is still honored when CVN_THRESHOLD_MODE_BUY is absent — with an explicit deprecation warning at runtime. Smooth migration, no big-bang.

P5 — Observable

Every inference logs the active mode and the effective value. Grafana shows a p(BUY) histogram with a vertical line at the effective threshold per mode.

P6 — Fail-fast + kill switch

Missing calibrator at inference → hard fail (no silent fallback, ADR-25). The only way to bypass is an explicit CVN_THRESHOLD_CALIBRATOR_DISABLE=1 kill switch, which emits a WARNING every inference and forces a hard-coded CVN_THRESHOLD_CALIBRATOR_DISABLE_VALUE (default 0.5). This exists for incident response (bad calibrator in production), not for steady state.

P7 — Static per model version (clarifies ADR-15)

The calibrator is fit once per model version on the validation fold at training time. This is the same walk-forward loop as the model itself (ADR-15 compliant): each retraining produces a new calibrator on the fresh OOS slice. We explicitly do not recalibrate the threshold live in production against rolling windows — that would be walk-forward over walk-forward, a second-order problem out of scope. Operator adjustments in production go through the mode (default / conservative), not threshold re-fitting.


4. Architecture

4.1 Central class

# src/commun/trading/threshold_calibrator.py
from dataclasses import dataclass
from typing import Literal
import numpy as np

Mode = Literal["default", "conservative", "dynamic"]
Method = Literal["fbeta", "youden", "expectancy", "target_rate"]

# Hard circuit-breaker: no mode can push the threshold outside
# fitted_default ± 2σ. Protects against runaway false positives if
# σ is overestimated on a degenerate val set, or if an operator
# sets CVN_THRESHOLD_DYNAMIC_SIGMA to an extreme value.
CIRCUIT_BREAKER_SIGMA_CAP = 2.0


@dataclass
class ThresholdCalibrator:
    class_label: str                  # "BUY" or "SELL" (for logs only, no logic)
    fitted_default: float             # fitted optimal threshold (changes only on retraining)
    proba_sigma: float                # std of p(positive) on val set — margin scale
    fit_method: Method                # method used at fit time (audit)
    n_fit: int                        # number of val samples (stability)
    fit_on_calibrated_proba: bool     # True iff val probas had post-Platt/isotonic applied

    @classmethod
    def fit(
        cls,
        y_true: np.ndarray,           # {0, 1} binary, or {0, 1} with `== class_idx` pre-applied for 3-class
        y_proba_positive: np.ndarray, # p(positive) in [0, 1], POST-Platt/isotonic (P1 enforces this)
        class_label: str = "BUY",
        method: Method = "fbeta",
        fbeta_beta: float = 1.0,
        fit_on_calibrated_proba: bool = True,
    ) -> "ThresholdCalibrator": ...

    def get(
        self,
        mode: Mode = "default",
        conservative_sigma: float = 0.5,
        dynamic_sigma: float = 0.5,
        dynamic_enabled: bool = False,   # P2: ships OFF by default (committee dissent #2)
    ) -> float:
        if mode == "default":
            return self.fitted_default
        if mode == "conservative":
            k = min(conservative_sigma, CIRCUIT_BREAKER_SIGMA_CAP)
            return min(1.0, self.fitted_default + k * self.proba_sigma)
        if mode == "dynamic":
            if not dynamic_enabled:
                raise RuntimeError(
                    "dynamic mode is gated OFF by default — enable via "
                    "CVN_THRESHOLD_DYNAMIC_ENABLED=1 after reading §5.3 and the runbook"
                )
            k = min(dynamic_sigma, CIRCUIT_BREAKER_SIGMA_CAP)
            return max(0.0, self.fitted_default - k * self.proba_sigma)
        raise ValueError(f"Unknown mode: {mode}")

    def to_metadata(self) -> dict: ...
    @classmethod
    def from_metadata(cls, meta: dict) -> "ThresholdCalibrator": ...

Pure, testable, no direct MLflow / sklearn dependency (sklearn is only imported inside fit).

4.2 Training flow (post-HPO)

XGBoost trainer fit() completes
_evaluate_model on X_val, y_val → y_proba_val
ThresholdCalibrator.fit(y_val, y_proba_val[:, buy_idx], method=fbeta) → calib_buy
model_artifacts["threshold_calibrator_buy"] = calib_buy.to_metadata()
MLflow log_dict(calib_buy.to_metadata(), "threshold_calibrator_buy.json")

4.3 Inference flow

Load model from MLflow → read threshold_calibrator_buy.json
calib_buy = ThresholdCalibrator.from_metadata(meta)
at each candle:
   p_buy = model.predict_proba(features)[:, buy_idx]
   threshold = calib_buy.get(mode=os.getenv("CVN_THRESHOLD_MODE_BUY", "default"))
   if p_buy > threshold: emit BUY
   log_event(threshold_used, mode=..., effective=threshold, default=calib_buy.fitted_default)

5. The 3 modes — formulas and rationale

5.1 default — threshold optimized on training

Threshold value that maximizes an objective (see §6) on the validation set. Published as "the" operational value.

Typical usage: normal production, baseline for the other modes.

5.2 conservative — upward margin

effective = min(1.0, fitted_default + k_conservative × σ)

where k_conservative = CVN_THRESHOLD_CONSERVATIVE_SIGMA (default 0.5).

Fewer trades, higher mean confidence per trade. σ is the p(BUY) dispersion on the val set — adaptive to each model.

Typical usage: live deployment with real capital, bear regimes, after drift detection.

5.3 dynamic — downward margin (gated, research-only at launch)

effective = max(0.0, fitted_default − min(k_dynamic, 2.0) × σ)

where k_dynamic = CVN_THRESHOLD_DYNAMIC_SIGMA (default 0.5), capped at 2.0 by the hard circuit-breaker (§4.1).

More trades, lower mean confidence. Useful to explore volume, stress false-positive robustness, or when the strategy needs denser statistical samples (research).

Gated OFF at launch (committee dissent #2). To enable: CVN_THRESHOLD_DYNAMIC_ENABLED=1 and CVN_THRESHOLD_MODE_BUY=dynamic. The guard exists because the mode's operational value is unproven and can amplify false positives in volatile regimes. Gating forces an explicit operator intent, logged in the config history (ADR-59). Phase 3 validation (§13) is the prerequisite for flipping the default ON.

Typical usage (once enabled): research / P0-A lever testing, liquid markets with low fees.

5.4 Why σ-based rather than a fixed delta

Three options considered: - Fixed delta +0.05: simple, but ignores the model's natural dispersion. A model with p(BUY) highly concentrated around 0.3 would see its three modes at (0.30, 0.35, 0.25) — 0.35 is already extreme. Conversely a model with wide p(BUY) spread gets no real effect from the mode. - Quantile (e.g. 75th percentile of true positives): requires keeping val-set positives/negatives around, complicates metadata. No simple geometric interpretation for dynamic. - σ-based (retained): naturally scales with the model distribution. The three modes keep a consistent relative spacing regardless of the model.


6. default calibration — method to decide (Phase 1)

Four candidates to evaluate against existing runs:

6.1 fbeta — maximize F_β(buy)

threshold* = argmax_t F_β(y_val == BUY, p_buy > t)

With β = CVN_BUY_BETA (already present, default 1.0). Direct alignment with the HPO objective — consistent.

Pros: simple, consistent with HPO, no new parameter. Risk: F_β does not account for asymmetric SL/TP costs.

6.2 youden — maximize TPR − FPR

threshold* = argmax_t (TPR(t) − FPR(t))

Optimal point on the ROC curve. Independent of priors / proportions.

Pros: classic, stable, parameter-free. Risk: does not reflect trade economics (a TP ≠ a FP in PnL terms).

6.3 expectancy — maximize expected PnL

threshold* = argmax_t [ TPR(t) × avg_win − FPR(t) × avg_loss ]

where avg_win and avg_loss are the TP and SL trade averages from the training set (fees included). Economically aligned.

Pros: close to the real business objective. Risk: strongly depends on avg_win/loss estimation (noisy on small datasets).

6.4 target_rate — fix the expected BUY rate

threshold* = percentile(p_buy, 100 − target_rate_pct)

where target_rate_pct = expected fraction of BUY bars (parameter, e.g. 10%). The threshold adjusts so that exactly 10% of bars pass.

Pros: predictable volume, useful for fixed-frequency strategies. Risk: disconnects the threshold from actual ML signal (a quantity is forced independently of predictive quality).

6.5 Decision (v2 — committee arbitration)

fbeta is locked as the Phase 2 default. Rationale: - Already the HPO objective (CVN_HPO_OBJECTIVE=fbeta_buy) — consistent with training, no second-order alignment drift. - Stable on small datasets (committee concern on expectancy). - Single parameter β already surfaced via CVN_BUY_BETA.

expectancy stays a first-class candidate but is gated behind a Phase 3 empirical validation that must demonstrate: - inter-fold stability std(threshold_fitted) < 0.05 on datasets with ≥100 trades/fold (not the current 0–30 regime), - avg_win / avg_loss estimates stable across folds (cv < 0.3), - Sortino net-15bps ≥ fbeta default on at least 2/3 exploration cryptos.

Until these criteria are met, expectancy is allowed via CVN_THRESHOLD_DEFAULT_METHOD=expectancy but emits a threshold_method_experimental WARNING.

youden and target_rate remain documented but unflagged — research options, not shipping defaults.


7. Environment variables

Var Type Default Role
CVN_THRESHOLD_MODE_BUY enum default BUY pilot mode — FTF factor (A/B testable, ADR-56)
CVN_THRESHOLD_MODE_SELL enum disabled SELL pilot mode (disabled = spot only)
CVN_THRESHOLD_CONSERVATIVE_SIGMA float 0.5 Upward margin in σ (capped at 2.0)
CVN_THRESHOLD_DYNAMIC_SIGMA float 0.5 Downward margin in σ (capped at 2.0)
CVN_THRESHOLD_DYNAMIC_ENABLED bool 0 Required to activate dynamic mode (§5.3)
CVN_THRESHOLD_DEFAULT_METHOD enum fbeta Calibration method (fbeta locked; youden/expectancy/target_rate experimental)
CVN_THRESHOLD_TARGET_RATE_PCT float 10.0 Used only when method=target_rate
CVN_THRESHOLD_PROBA_SIGMA_DRIFT_PCT float 30.0 Alert if live sigma(p_buy) drifts > x% from fitted.proba_sigma
CVN_THRESHOLD_CALIBRATOR_DISABLE bool 0 Kill switch — bypass calibrator entirely, use DISABLE_VALUE (incident response only)
CVN_THRESHOLD_CALIBRATOR_DISABLE_VALUE float 0.5 Hard threshold used when kill switch is ON

Backward compatibility

If CVN_THRESHOLD_MODE_BUY is absent and CVN_THRESHOLD_BUY is present: runtime logs a threshold_buy_deprecated_fixed_used warning and falls back to the old fixed value. Decommissioning: 1 sprint after merge, then drop CVN_THRESHOLD_BUY.


8. MLflow persistence

Artifact threshold_calibrator_buy.json at the same level as the model:

{
  "class_label": "BUY",
  "fitted_default": 0.37,
  "proba_sigma": 0.082,
  "fit_method": "fbeta",
  "fit_method_params": {"fbeta_beta": 1.0},
  "fit_on_calibrated_proba": true,
  "n_fit": 14523,
  "created_at": "2026-04-20T12:24:55Z",
  "feature_set_version": "v2-abc123",
  "calibration_method": "isotonic",
  "git_sha": "c620f23b"
}

Loaded by FeatureEngineeringAPI.from_mlflow_run() or equivalent when the model is loaded → ThresholdCalibrator.from_metadata().

8.1 Post-Platt/isotonic enforcement (P1)

fit() enforces fit_on_calibrated_proba=True. At training time the trainer must apply CVN_CALIBRATION_METHOD (isotonic/Platt) to y_proba_val before passing it to the calibrator. Raw (uncalibrated) probabilities produce a non-monotonic p(BUY) → true-rate mapping, so both the fitted threshold and the sigma-based margins lose their probabilistic interpretation.

This is enforced by: - a RuntimeError if fit() is called with fit_on_calibrated_proba=False outside of unit tests, - a schema validation at inference: the loaded artifact must carry fit_on_calibrated_proba=true — otherwise hard-fail (ADR-25).


9. Monitoring

9.1 Logs

event=threshold_used
class=BUY
mode=default
effective=0.370
fitted_default=0.370
sigma=0.082
adjustment=0.000
p_buy_current=0.41
triggered=true

Logged on every decision where p(positive) crosses any of the three modes (lets us estimate how many trades would have fired under a different mode without executing them).

9.2 Grafana

New panel pipeline_threshold_calibration: - p(BUY) histogram OOS per crypto × fold - Vertical lines at the three thresholds (default / conservative / dynamic) - Observed action_rate per mode (table) - Drift: daily mean(p_buy) → alert if drift > 3σ over 7 days

9.3 Alerts

Alert Signal Threshold Severity Action
threshold_buy_deprecated_fixed_used Legacy env var still in use on ingress WARN Migrate the run's config to _MODE_BUY
threshold_calibrator_missing Model loaded without calibrator and kill switch OFF on inference startup CRIT Stop, retrain, redeploy
threshold_calibrator_kill_switch_on CVN_THRESHOLD_CALIBRATOR_DISABLE=1 per inference WARN Verify this is an authorised incident response
action_rate_deviation action_rate current vs historical mean > 3σ over 24h WARN Check drift + mode; consider conservative
proba_sigma_drift sigma(p_buy)_live vs fitted.proba_sigma Δ > CVN_THRESHOLD_PROBA_SIGMA_DRIFT_PCT over 7d WARN Retrain recommended; margins may have lost meaning
dynamic_mode_enabled_in_live MODE=dynamic on CVN_SYSTEM_STATUS=active on config change CRIT Validate P3 criteria are met before
threshold_method_experimental CVN_THRESHOLD_DEFAULT_METHOD != fbeta on training WARN Expected in Phase 3 research runs only

9.4 Runbooks

Each CRIT/WARN alert maps to a runbook in documentation/runbooks/threshold_calibrator_*.md (Phase 2 deliverable):

  1. threshold_calibrator_missing → model from pre-cutover retraining or artifact write failed. Check MLflow run for the artifact; if missing, retrain. Do not flip the kill switch without a written incident ticket.
  2. proba_sigma_drift → model seeing a different p(BUY) distribution than what it was fit on (regime change, feature drift, upstream enrichment bug). Prescribed action: switch MODE=conservative if live trading, trigger retraining job within 24h, investigate root cause (feature drift panel in Grafana).
  3. action_rate_deviation → model has drifted or the wrong mode is set. Decision tree: (a) check proba_sigma_drift — if firing too, retrain; (b) else verify CVN_THRESHOLD_MODE_BUY matches the operator's intent in the config history (ADR-59).
  4. dynamic_mode_enabled_in_live → a config snapshot enabled dynamic mode on a production crypto. Immediately revert to the previous snapshot via Console Restore (#606), then review intent in the config history.

10. Migration path

Step 1 — Implementation (Phase 2 of #608)

  • Class + unit tests + training integration + inference integration
  • Backward compat active: only the new parameter is read, fallback works
  • Kill switch wired up to CVN_THRESHOLD_CALIBRATOR_DISABLE
  • Grafana panel + alerts per §9 shipped in the same PR (no "observability later")

Step 2 — Shadow validation (Phase 2 smoke)

  • Flag CVN_THRESHOLD_CALIBRATOR_ENABLED=0 in production, =1 in shadow runs
  • Shadow-compare legacy fixed threshold vs calibrated threshold on the exploration set (OP/AAVE/LDO) — same 3-class and binary modes
  • Block cutover if shadow reveals any divergence > 3σ in action_rate that is unexplained

Step 3 — A/B FTF factor (Phase 3, ADR-56)

Introduce threshold_mode_buy as a first-class FTF factor with values {default, conservative} (dynamic stays gated out). This produces a clean triptych per mode, directly comparable. The ADR-56 factor mechanism also guarantees: - deterministic recording of the active mode per run, - automatic persistence in finetune_results, - BH-corrected statistical testing on Sortino net-15bps + f1_buy + Brier/ECE (ADR-29 baseline comparison), - Console one-click restore to a prior mode (#606).

Success = default mode yields a triptych consistent with the 3-class baseline on ≥ 2/3 exploration cryptos; conservative shows the expected gradient (fewer trades, higher f1).

Step 4 — Cutover

  • Merge Phase 2, deploy image
  • All new runs automatically produce calibrators
  • Old models load via fallback (warning logs, A/B factor off)

Step 5 — Deprecation

  • 1 sprint after cutover: remove CVN_THRESHOLD_BUY from code, every production model carries a calibrator
  • 2 sprints after cutover: revisit dynamic gating based on Phase 3 data — flip to enabled-by-default only if P3 criteria (§13) are met.

10bis. Position in the 9-filter chain

The calibrator sits inside the inference stage, specifically replacing the fixed-threshold cutoff in the confidence gate. It does not replace the confidence_filter plugin — that plugin stays in place and receives the calibrator output as its threshold. The filter chain order is unchanged (CLAUDE.md §Chaîne de filtres):

CUSUM → Trend → Inference + Confidence(calibrator-driven) → Meta-Label → Regime → Cost → Kelly → Confirmation → Quality

Concretely inside stage 3 (Inference + Confidence):

model.predict_proba(features) → y_proba
(isotonic / Platt) → y_proba_calibrated           # §8.1 enforced
threshold = calibrator.get(mode=CVN_THRESHOLD_MODE_BUY,
                           dynamic_enabled=CVN_THRESHOLD_DYNAMIC_ENABLED)
confidence_filter(p_buy > threshold) → pass / block

Meta-label integration (future): if meta-labeling is enabled (CVN_USE_META_LABEL=1), the calibrator applies to the meta classifier's p(BUY), not the primary. The primary classifier uses argmax (its output is binary pass/no-pass for the meta stage, not a tradable decision). Two calibrator instances in MLflow: threshold_calibrator_buy_primary.json and threshold_calibrator_buy_meta.json, loaded in order.


11. SELL extensibility

When CVNTrade adds short-selling:

# Training
calib_sell = ThresholdCalibrator.fit(
    y_val == SELL_idx,
    p_proba_val[:, SELL_idx],
    class_label="SELL",
)
model_artifacts["threshold_calibrator_sell"] = calib_sell.to_metadata()

# Inference
calib_sell = ThresholdCalibrator.from_metadata(meta["threshold_calibrator_sell"])
if p_sell > calib_sell.get(mode=os.getenv("CVN_THRESHOLD_MODE_SELL", "disabled")):
    emit SELL

Zero code change inside ThresholdCalibrator. The env var CVN_THRESHOLD_MODE_SELL=disabled defaults to no-op while the product is not opened up.


12. Multi-class extensibility (3-class vs binary)

The calibrator only sees y_proba_positive, so:

  • Binary: y_proba_positive = y_proba[:, 1]
  • 3-class BUY: y_proba_positive = y_proba[:, 2]
  • 3-class SELL: y_proba_positive = y_proba[:, 0]

The trainer extracts the right column depending on is_binary_classification() and the class direction, then passes the array to the calibrator.

Decision on 3-class (v2): keep argmax + threshold filter (two gates). Rationale: - Backward-compatible with existing 3-class production behaviour — the calibrator only adds a confidence floor. - A "threshold only" policy in 3-class would require a joint calibration across BUY and SELL probabilities (p_buy > t_buy and p_buy > p_sell? or p_buy > t_buy alone?) — too much decision surface for v1. - The argmax gate naturally filters out ambiguous regions where p_buy ~ p_hold, which a raw threshold cannot. Keeping both gates costs nothing at inference (two cheap comparisons).

Concretely in 3-class:

if argmax(y_proba) == BUY_IDX AND y_proba[:, BUY_IDX] > calib_buy.get(mode):
    emit BUY


13. Validation plan (Phase 3 of #608)

13.1 Dataset

Three exploration cryptos (OP, AAVE, LDO), five folds, fifteen HPO trials.

13.2 Configs tested

  • Baseline: CVN_THRESHOLD_BUY=0.5 (legacy, reference)
  • default mode: fitted calibrator
  • conservative: default + 0.5σ
  • dynamic: default − 0.5σ

13.3 Success criteria

  • default mode yields n_trades ≥ 30 per fold (vs. 0–3 in today's binary baseline)
  • f1_buy OOS ≥ 3-class baseline (0.29) on at least 2/3 cryptos
  • conservativedynamic must show a consistent gradient: rising n_trades and falling f1
  • Inter-fold stability: std(threshold_fitted) < 0.05 across the five folds

13.4 Deliverables

  • Table mode × crypto × {f1, n_trades, Sortino, Brier, ECE}
  • p(BUY) histograms with per-mode vertical lines
  • Final decision on the production default mode

14. Decisions taken (v2 — was "open questions" in v1)

All five v1 open questions are resolved post committee review. Kept here as a decision log.

  1. Fold-level vs global calibrationfold-level. Each fold's calibrator is fit on that fold's val slice. The artifact published in MLflow is the fold's fit; cross-fold aggregation happens at reporting time (std(threshold) is a triptych consistency metric). Rationale: consistent with ADR-14 (multi-fold evaluation as the unit of truth).

  2. Post-Platt/isotonic calibrationenforced calibrated (§8.1). fit_on_calibrated_proba=True is a schema invariant; inference hard-fails otherwise.

  3. Asymmetric costs (PTE changes)re-fit guaranteed by retraining. The calibrator is a training-time artifact; any PTE change triggers retraining, which produces a new calibrator. Documented in §8 (artifact carries feature_set_version).

  4. Multi-step (meta-label) classifiercalibrator on the meta stage, two artifacts published when CVN_USE_META_LABEL=1 (§10bis).

  5. Threshold vs confidence_filter pluginstacked, not replaced (§10bis). The plugin reads the calibrator threshold as its parameter; the chain order is unchanged.

New open questions (v2, to revisit in Phase 3)

  1. dynamic mode gate flip — under what empirical evidence do we flip CVN_THRESHOLD_DYNAMIC_ENABLED to 1 by default? Tentative criterion: two consecutive quarterly retraining cycles where dynamic outperforms default on the Sortino net-15bps triptych AND shows no action_rate_deviation alert firings in live.

  2. expectancy promotion — if Phase 3 shows it beats fbeta on ≥2/3 exploration cryptos with ≥100 trades/fold, consider promoting to default for cryptos matching those liquidity criteria (not a blanket promotion).

  3. Calibrator TTL / retraining cadence — should the artifact carry a max-age that triggers a retraining alert? Related to the static vs walk-forward debate (P7) but lives in the MLflow lifecycle, not the calibrator itself.


15. References

  • Run evidence: ftf_20260420_122455_d94ad0_ATR1.5_3.0_H5 (binary collapse)
  • Baseline evidence: ftf_20260420_082137_635720_ATR1.5_3.0_H5 (3-class 15m baseline)
  • Parent issue: #608
  • P0-A issue (trigger): #596
  • Prior art:
  • Hand, 2009, Measuring classifier performance: a coherent alternative to AUC
  • Lopez de Prado 2018, Advances in Financial Machine Learning, ch. 12 (meta-labeling + threshold tuning)
  • Provost & Fawcett, 2001, Robust Classification for Imprecise Environments (optimal operating point on ROC)
  • Relevant ADRs: ADR-23 (version-pinned MLflow artifacts), ADR-30 (structured logs as a stable contract), ADR-59 (params in PostgreSQL, editable via Console)

16. Acceptance criteria for this document (Phase 1 of #608)

  • Problem documented with quantified evidence
  • Design principles stated explicitly (P1–P7, with ADR-15 clarified)
  • Three modes detailed with formulas (dynamic gated OFF by default)
  • Hard circuit-breaker at ±2σ (§4.1)
  • Class architecture + signatures
  • Four default_method candidates described, decision taken: fbeta locked, expectancy experimental (§6.5)
  • Env vars listed with defaults (including kill switch, drift threshold, dynamic enable flag)
  • MLflow persistence schema defined + post-Platt/isotonic enforcement (§8.1)
  • Training flow + inference flow diagrammed
  • Monitoring + alerts + runbooks (§9.4)
  • Filter-chain position (§10bis) + meta-label integration
  • Migration path in 5 steps including A/B FTF factor (ADR-56)
  • SELL + 3-class extensibility covered (3-class: argmax + filter)
  • Phase 3 validation plan with success criteria
  • v1 open questions resolved (5/5)
  • Committee review completed (session 7371c57d, PASSED, strong consensus)
  • Final production default_mode decision — still pending, taken after Phase 3 empirical evidence on defi_top5

This document is now the spec for the Phase 2 implementation PR.