ThresholdCalibrator — Design document¶

Status : v2 — committee-reviewed (session 7371c57d, PASSED, strong consensus) Authors : CVNTrade research, 2026-04-20 Issue : #608

Changelog¶

v2 (2026-04-20) — post committee review. Integrates 10 recommendations and arbitrates 4 dissents: fbeta locked as Phase 2 default; dynamic mode shipped gated OFF with ±2σ circuit-breaker; A/B FTF factor added to the migration plan; post-Platt/isotonic calibration enforced; 3-class decision locked to argmax + threshold filter; ADR-15 intent clarified (static per model version, not walk-forward); kill switch, runbook, and filter-chain-interaction sections added.
v1 (2026-04-20) — initial draft submitted to committee.

1. Executive summary¶

The post-inference confidence filter compares p(BUY) to a threshold to decide whether to open a position. Today that threshold is a fixed value (CVN_THRESHOLD_BUY, default 0.5), implicitly tuned for the 3-class mode. When we switch to binary (lever #3 of P0-A), the p(BUY) distribution reshuffles and the inherited threshold mechanically becomes too strict — action_rate collapses to roughly 0%, the model stops opening trades.

This document proposes a component ThresholdCalibrator that: - automatically calibrates a threshold from the training set (at the end of HPO) - exposes three operator modes (default / conservative / dynamic) to trade confidence against volume - persists the fitted threshold in MLflow next to the model (moves with retraining, not with env vars) - is extensible to the SELL class when CVNTrade opens up to derivatives

2. Problem statement¶

2.1 Evidence¶

Run ftf_20260420_122455_d94ad0 (lever #3 binary, baseline otherwise, config 2026-04-20-p0a-lever3):

Crypto	f1_buy	AUC	action_rate	n_trades
AAVE	0.059	0.540	0.5%	3
LDO	0.034	0.508	0.3%	0
OP	0.000	0.500	0.0%	1

Equivalent 3-class baseline (run ftf_20260420_082137_635720):

Crypto	f1_buy	AUC	action_rate
AAVE	0.323	0.678	36%
LDO	0.286	0.629	38%
OP	0.375	0.632	39%

2.2 Diagnosis¶

In 3-class mode, argmax(p_sell, p_hold, p_buy) is the natural decision: the BUY class wins as soon as p(BUY) exceeds p(HOLD) (typically around 0.33–0.4). The 0.5 threshold is rarely the bottleneck.

In binary mode, the decision is directly p(BUY) > threshold. The p(BUY) distribution spreads over [0,1] — for a well-trained classifier with an imbalanced minority class (5–15% of bars are true BUY), p(BUY) stays below 0.5 most of the time even for genuine positives. The 0.5 threshold then sits almost always above p(BUY), driving action rate to ~0.

2.3 Requirement¶

A threshold mechanism that: 1. Adapts automatically to the p(BUY) distribution of the trained classifier (no manual tuning per classification mode) 2. Lets the operator steer the confidence vs volume tradeoff without touching code or model 3. Stays stable inter-fold (no overfitting bias to a specific fold) 4. Extends to the SELL class when CVNTrade adds short-selling

3. Design principles¶

P1 — The threshold is a model artifact, not a fixed env var¶

The optimal threshold depends on the p(BUY) distribution learned by a given classifier. It must therefore move with retraining, not with a config redeploy. Consequence: the fitted threshold is persisted in MLflow next to the model, loaded at inference as a feature artifact (ADR-23 compliance).

P2 — Only modes are exposed to the operator¶

The operator does not pick a numeric value (error-prone, easy to forget between two runs) but a mode from {default, conservative, dynamic}. The mode → effective value mapping is computed at inference from the fitted threshold plus a margin (also parameterized globally, not per-model).

P3 — Calibration is class-agnostic¶

The ThresholdCalibrator class knows nothing of BUY/SELL business meaning. It receives y_proba_positive and returns a threshold. One instance per class to calibrate (BUY today, SELL tomorrow) — orthogonal to is_binary_classification() and the number of classes.

P4 — Backward compatibility without magic¶

CVN_THRESHOLD_BUY (today's fixed value) is still honored when CVN_THRESHOLD_MODE_BUY is absent — with an explicit deprecation warning at runtime. Smooth migration, no big-bang.

P5 — Observable¶

Every inference logs the active mode and the effective value. Grafana shows a p(BUY) histogram with a vertical line at the effective threshold per mode.

P6 — Fail-fast + kill switch¶

Missing calibrator at inference → hard fail (no silent fallback, ADR-25). The only way to bypass is an explicit CVN_THRESHOLD_CALIBRATOR_DISABLE=1 kill switch, which emits a WARNING every inference and forces a hard-coded CVN_THRESHOLD_CALIBRATOR_DISABLE_VALUE (default 0.5). This exists for incident response (bad calibrator in production), not for steady state.

P7 — Static per model version (clarifies ADR-15)¶

The calibrator is fit once per model version on the validation fold at training time. This is the same walk-forward loop as the model itself (ADR-15 compliant): each retraining produces a new calibrator on the fresh OOS slice. We explicitly do not recalibrate the threshold live in production against rolling windows — that would be walk-forward over walk-forward, a second-order problem out of scope. Operator adjustments in production go through the mode (default / conservative), not threshold re-fitting.

4. Architecture¶

4.1 Central class¶

# src/commun/trading/threshold_calibrator.py
from dataclasses import dataclass
from typing import Literal
import numpy as np

Mode = Literal["default", "conservative", "dynamic"]
Method = Literal["fbeta", "youden", "expectancy", "target_rate"]

# Hard circuit-breaker: no mode can push the threshold outside
# fitted_default ± 2σ. Protects against runaway false positives if
# σ is overestimated on a degenerate val set, or if an operator
# sets CVN_THRESHOLD_DYNAMIC_SIGMA to an extreme value.
CIRCUIT_BREAKER_SIGMA_CAP = 2.0


@dataclass
class ThresholdCalibrator:
    class_label: str                  # "BUY" or "SELL" (for logs only, no logic)
    fitted_default: float             # fitted optimal threshold (changes only on retraining)
    proba_sigma: float                # std of p(positive) on val set — margin scale
    fit_method: Method                # method used at fit time (audit)
    n_fit: int                        # number of val samples (stability)
    fit_on_calibrated_proba: bool     # True iff val probas had post-Platt/isotonic applied

    @classmethod
    def fit(
        cls,
        y_true: np.ndarray,           # {0, 1} binary, or {0, 1} with `== class_idx` pre-applied for 3-class
        y_proba_positive: np.ndarray, # p(positive) in [0, 1], POST-Platt/isotonic (P1 enforces this)
        class_label: str = "BUY",
        method: Method = "fbeta",
        fbeta_beta: float = 1.0,
        fit_on_calibrated_proba: bool = True,
    ) -> "ThresholdCalibrator": ...

    def get(
        self,
        mode: Mode = "default",
        conservative_sigma: float = 0.5,
        dynamic_sigma: float = 0.5,
        dynamic_enabled: bool = False,   # P2: ships OFF by default (committee dissent #2)
    ) -> float:
        if mode == "default":
            return self.fitted_default
        if mode == "conservative":
            k = min(conservative_sigma, CIRCUIT_BREAKER_SIGMA_CAP)
            return min(1.0, self.fitted_default + k * self.proba_sigma)
        if mode == "dynamic":
            if not dynamic_enabled:
                raise RuntimeError(
                    "dynamic mode is gated OFF by default — enable via "
                    "CVN_THRESHOLD_DYNAMIC_ENABLED=1 after reading §5.3 and the runbook"
                )
            k = min(dynamic_sigma, CIRCUIT_BREAKER_SIGMA_CAP)
            return max(0.0, self.fitted_default - k * self.proba_sigma)
        raise ValueError(f"Unknown mode: {mode}")

    def to_metadata(self) -> dict: ...
    @classmethod
    def from_metadata(cls, meta: dict) -> "ThresholdCalibrator": ...

Pure, testable, no direct MLflow / sklearn dependency (sklearn is only imported inside fit).

4.2 Training flow (post-HPO)¶

XGBoost trainer fit() completes
   ↓
_evaluate_model on X_val, y_val → y_proba_val
   ↓
ThresholdCalibrator.fit(y_val, y_proba_val[:, buy_idx], method=fbeta) → calib_buy
   ↓
model_artifacts["threshold_calibrator_buy"] = calib_buy.to_metadata()
   ↓
MLflow log_dict(calib_buy.to_metadata(), "threshold_calibrator_buy.json")

4.3 Inference flow¶

Load model from MLflow → read threshold_calibrator_buy.json
   ↓
calib_buy = ThresholdCalibrator.from_metadata(meta)
   ↓
at each candle:
   p_buy = model.predict_proba(features)[:, buy_idx]
   threshold = calib_buy.get(mode=os.getenv("CVN_THRESHOLD_MODE_BUY", "default"))
   if p_buy > threshold: emit BUY
   log_event(threshold_used, mode=..., effective=threshold, default=calib_buy.fitted_default)

5. The 3 modes — formulas and rationale¶

5.1 `default` — threshold optimized on training¶

Threshold value that maximizes an objective (see §6) on the validation set. Published as "the" operational value.

Typical usage: normal production, baseline for the other modes.

5.2 `conservative` — upward margin¶

effective = min(1.0, fitted_default + k_conservative × σ)

where k_conservative = CVN_THRESHOLD_CONSERVATIVE_SIGMA (default 0.5).

Fewer trades, higher mean confidence per trade. σ is the p(BUY) dispersion on the val set — adaptive to each model.

Typical usage: live deployment with real capital, bear regimes, after drift detection.

5.3 `dynamic` — downward margin (gated, research-only at launch)¶

effective = max(0.0, fitted_default − min(k_dynamic, 2.0) × σ)

where k_dynamic = CVN_THRESHOLD_DYNAMIC_SIGMA (default 0.5), capped at 2.0 by the hard circuit-breaker (§4.1).

More trades, lower mean confidence. Useful to explore volume, stress false-positive robustness, or when the strategy needs denser statistical samples (research).

Gated OFF at launch (committee dissent #2). To enable: CVN_THRESHOLD_DYNAMIC_ENABLED=1 and CVN_THRESHOLD_MODE_BUY=dynamic. The guard exists because the mode's operational value is unproven and can amplify false positives in volatile regimes. Gating forces an explicit operator intent, logged in the config history (ADR-59). Phase 3 validation (§13) is the prerequisite for flipping the default ON.

Typical usage (once enabled): research / P0-A lever testing, liquid markets with low fees.

5.4 Why σ-based rather than a fixed delta¶

Three options considered: - Fixed delta +0.05: simple, but ignores the model's natural dispersion. A model with p(BUY) highly concentrated around 0.3 would see its three modes at (0.30, 0.35, 0.25) — 0.35 is already extreme. Conversely a model with wide p(BUY) spread gets no real effect from the mode. - Quantile (e.g. 75th percentile of true positives): requires keeping val-set positives/negatives around, complicates metadata. No simple geometric interpretation for dynamic. - σ-based (retained): naturally scales with the model distribution. The three modes keep a consistent relative spacing regardless of the model.

6. `default` calibration — method to decide (Phase 1)¶

Four candidates to evaluate against existing runs:

6.1 `fbeta` — maximize F_β(buy)¶

threshold* = argmax_t F_β(y_val == BUY, p_buy > t)

With β = CVN_BUY_BETA (already present, default 1.0). Direct alignment with the HPO objective — consistent.

Pros: simple, consistent with HPO, no new parameter. Risk: F_β does not account for asymmetric SL/TP costs.

6.2 `youden` — maximize TPR − FPR¶

threshold* = argmax_t (TPR(t) − FPR(t))

Optimal point on the ROC curve. Independent of priors / proportions.

Pros: classic, stable, parameter-free. Risk: does not reflect trade economics (a TP ≠ a FP in PnL terms).

6.3 `expectancy` — maximize expected PnL¶

threshold* = argmax_t [ TPR(t) × avg_win − FPR(t) × avg_loss ]

where avg_win and avg_loss are the TP and SL trade averages from the training set (fees included). Economically aligned.

Pros: close to the real business objective. Risk: strongly depends on avg_win/loss estimation (noisy on small datasets).

6.4 `target_rate` — fix the expected BUY rate¶

threshold* = percentile(p_buy, 100 − target_rate_pct)

where target_rate_pct = expected fraction of BUY bars (parameter, e.g. 10%). The threshold adjusts so that exactly 10% of bars pass.

Pros: predictable volume, useful for fixed-frequency strategies. Risk: disconnects the threshold from actual ML signal (a quantity is forced independently of predictive quality).

6.5 Decision (v2 — committee arbitration)¶

fbeta is locked as the Phase 2 default. Rationale: - Already the HPO objective (CVN_HPO_OBJECTIVE=fbeta_buy) — consistent with training, no second-order alignment drift. - Stable on small datasets (committee concern on expectancy). - Single parameter β already surfaced via CVN_BUY_BETA.

expectancy stays a first-class candidate but is gated behind a Phase 3 empirical validation that must demonstrate: - inter-fold stability std(threshold_fitted) < 0.05 on datasets with ≥100 trades/fold (not the current 0–30 regime), - avg_win / avg_loss estimates stable across folds (cv < 0.3), - Sortino net-15bps ≥ fbeta default on at least 2/3 exploration cryptos.

Until these criteria are met, expectancy is allowed via CVN_THRESHOLD_DEFAULT_METHOD=expectancy but emits a threshold_method_experimental WARNING.

youden and target_rate remain documented but unflagged — research options, not shipping defaults.

7. Environment variables¶

Var	Type	Default	Role
`CVN_THRESHOLD_MODE_BUY`	enum	`default`	BUY pilot mode — FTF factor (A/B testable, ADR-56)
`CVN_THRESHOLD_MODE_SELL`	enum	`disabled`	SELL pilot mode (disabled = spot only)
`CVN_THRESHOLD_CONSERVATIVE_SIGMA`	float	`0.5`	Upward margin in σ (capped at 2.0)
`CVN_THRESHOLD_DYNAMIC_SIGMA`	float	`0.5`	Downward margin in σ (capped at 2.0)
`CVN_THRESHOLD_DYNAMIC_ENABLED`	bool	`0`	Required to activate `dynamic` mode (§5.3)
`CVN_THRESHOLD_DEFAULT_METHOD`	enum	`fbeta`	Calibration method (`fbeta` locked; `youden`/`expectancy`/`target_rate` experimental)
`CVN_THRESHOLD_TARGET_RATE_PCT`	float	`10.0`	Used only when method=target_rate
`CVN_THRESHOLD_PROBA_SIGMA_DRIFT_PCT`	float	`30.0`	Alert if live `sigma(p_buy)` drifts > x% from `fitted.proba_sigma`
`CVN_THRESHOLD_CALIBRATOR_DISABLE`	bool	`0`	Kill switch — bypass calibrator entirely, use `DISABLE_VALUE` (incident response only)
`CVN_THRESHOLD_CALIBRATOR_DISABLE_VALUE`	float	`0.5`	Hard threshold used when kill switch is ON

Backward compatibility¶

If CVN_THRESHOLD_MODE_BUY is absent and CVN_THRESHOLD_BUY is present: runtime logs a threshold_buy_deprecated_fixed_used warning and falls back to the old fixed value. Decommissioning: 1 sprint after merge, then drop CVN_THRESHOLD_BUY.

8. MLflow persistence¶

Artifact threshold_calibrator_buy.json at the same level as the model:

{
  "class_label": "BUY",
  "fitted_default": 0.37,
  "proba_sigma": 0.082,
  "fit_method": "fbeta",
  "fit_method_params": {"fbeta_beta": 1.0},
  "fit_on_calibrated_proba": true,
  "n_fit": 14523,
  "created_at": "2026-04-20T12:24:55Z",
  "feature_set_version": "v2-abc123",
  "calibration_method": "isotonic",
  "git_sha": "c620f23b"
}

Loaded by FeatureEngineeringAPI.from_mlflow_run() or equivalent when the model is loaded → ThresholdCalibrator.from_metadata().

8.1 Post-Platt/isotonic enforcement (P1)¶

fit() enforces fit_on_calibrated_proba=True. At training time the trainer must apply CVN_CALIBRATION_METHOD (isotonic/Platt) to y_proba_val before passing it to the calibrator. Raw (uncalibrated) probabilities produce a non-monotonic p(BUY) → true-rate mapping, so both the fitted threshold and the sigma-based margins lose their probabilistic interpretation.

This is enforced by: - a RuntimeError if fit() is called with fit_on_calibrated_proba=False outside of unit tests, - a schema validation at inference: the loaded artifact must carry fit_on_calibrated_proba=true — otherwise hard-fail (ADR-25).

9. Monitoring¶

9.1 Logs¶

event=threshold_used
class=BUY
mode=default
effective=0.370
fitted_default=0.370
sigma=0.082
adjustment=0.000
p_buy_current=0.41
triggered=true

Logged on every decision where p(positive) crosses any of the three modes (lets us estimate how many trades would have fired under a different mode without executing them).

9.2 Grafana¶

New panel pipeline_threshold_calibration: - p(BUY) histogram OOS per crypto × fold - Vertical lines at the three thresholds (default / conservative / dynamic) - Observed action_rate per mode (table) - Drift: daily mean(p_buy) → alert if drift > 3σ over 7 days

9.3 Alerts¶

Alert	Signal	Threshold	Severity	Action
`threshold_buy_deprecated_fixed_used`	Legacy env var still in use	on ingress	WARN	Migrate the run's config to `_MODE_BUY`
`threshold_calibrator_missing`	Model loaded without calibrator and kill switch OFF	on inference startup	CRIT	Stop, retrain, redeploy
`threshold_calibrator_kill_switch_on`	`CVN_THRESHOLD_CALIBRATOR_DISABLE=1`	per inference	WARN	Verify this is an authorised incident response
`action_rate_deviation`	`action_rate` current vs historical mean	> 3σ over 24h	WARN	Check drift + mode; consider `conservative`
`proba_sigma_drift`	`sigma(p_buy)_live` vs `fitted.proba_sigma`	Δ > `CVN_THRESHOLD_PROBA_SIGMA_DRIFT_PCT` over 7d	WARN	Retrain recommended; margins may have lost meaning
`dynamic_mode_enabled_in_live`	`MODE=dynamic` on `CVN_SYSTEM_STATUS=active`	on config change	CRIT	Validate P3 criteria are met before
`threshold_method_experimental`	`CVN_THRESHOLD_DEFAULT_METHOD != fbeta`	on training	WARN	Expected in Phase 3 research runs only

9.4 Runbooks¶

Each CRIT/WARN alert maps to a runbook in documentation/runbooks/threshold_calibrator_*.md (Phase 2 deliverable):

threshold_calibrator_missing → model from pre-cutover retraining or artifact write failed. Check MLflow run for the artifact; if missing, retrain. Do not flip the kill switch without a written incident ticket.
proba_sigma_drift → model seeing a different p(BUY) distribution than what it was fit on (regime change, feature drift, upstream enrichment bug). Prescribed action: switch MODE=conservative if live trading, trigger retraining job within 24h, investigate root cause (feature drift panel in Grafana).
action_rate_deviation → model has drifted or the wrong mode is set. Decision tree: (a) check proba_sigma_drift — if firing too, retrain; (b) else verify CVN_THRESHOLD_MODE_BUY matches the operator's intent in the config history (ADR-59).
dynamic_mode_enabled_in_live → a config snapshot enabled dynamic mode on a production crypto. Immediately revert to the previous snapshot via Console Restore (#606), then review intent in the config history.

10. Migration path¶

Step 1 — Implementation (Phase 2 of #608)¶

Class + unit tests + training integration + inference integration
Backward compat active: only the new parameter is read, fallback works
Kill switch wired up to CVN_THRESHOLD_CALIBRATOR_DISABLE
Grafana panel + alerts per §9 shipped in the same PR (no "observability later")

Step 2 — Shadow validation (Phase 2 smoke)¶

Flag CVN_THRESHOLD_CALIBRATOR_ENABLED=0 in production, =1 in shadow runs
Shadow-compare legacy fixed threshold vs calibrated threshold on the exploration set (OP/AAVE/LDO) — same 3-class and binary modes
Block cutover if shadow reveals any divergence > 3σ in action_rate that is unexplained

Step 3 — A/B FTF factor (Phase 3, ADR-56)¶

Introduce threshold_mode_buy as a first-class FTF factor with values {default, conservative} (dynamic stays gated out). This produces a clean triptych per mode, directly comparable. The ADR-56 factor mechanism also guarantees: - deterministic recording of the active mode per run, - automatic persistence in finetune_results, - BH-corrected statistical testing on Sortino net-15bps + f1_buy + Brier/ECE (ADR-29 baseline comparison), - Console one-click restore to a prior mode (#606).

Success = default mode yields a triptych consistent with the 3-class baseline on ≥ 2/3 exploration cryptos; conservative shows the expected gradient (fewer trades, higher f1).

Step 4 — Cutover¶

Merge Phase 2, deploy image
All new runs automatically produce calibrators
Old models load via fallback (warning logs, A/B factor off)

Step 5 — Deprecation¶

1 sprint after cutover: remove CVN_THRESHOLD_BUY from code, every production model carries a calibrator
2 sprints after cutover: revisit dynamic gating based on Phase 3 data — flip to enabled-by-default only if P3 criteria (§13) are met.

10bis. Position in the 9-filter chain¶

The calibrator sits inside the inference stage, specifically replacing the fixed-threshold cutoff in the confidence gate. It does not replace the confidence_filter plugin — that plugin stays in place and receives the calibrator output as its threshold. The filter chain order is unchanged (CLAUDE.md §Chaîne de filtres):

CUSUM → Trend → Inference + Confidence(calibrator-driven) → Meta-Label → Regime → Cost → Kelly → Confirmation → Quality

Concretely inside stage 3 (Inference + Confidence):

model.predict_proba(features) → y_proba
       ↓
(isotonic / Platt) → y_proba_calibrated           # §8.1 enforced
       ↓
threshold = calibrator.get(mode=CVN_THRESHOLD_MODE_BUY,
                           dynamic_enabled=CVN_THRESHOLD_DYNAMIC_ENABLED)
       ↓
confidence_filter(p_buy > threshold) → pass / block

Meta-label integration (future): if meta-labeling is enabled (CVN_USE_META_LABEL=1), the calibrator applies to the meta classifier's p(BUY), not the primary. The primary classifier uses argmax (its output is binary pass/no-pass for the meta stage, not a tradable decision). Two calibrator instances in MLflow: threshold_calibrator_buy_primary.json and threshold_calibrator_buy_meta.json, loaded in order.

11. SELL extensibility¶

When CVNTrade adds short-selling:

# Training
calib_sell = ThresholdCalibrator.fit(
    y_val == SELL_idx,
    p_proba_val[:, SELL_idx],
    class_label="SELL",
)
model_artifacts["threshold_calibrator_sell"] = calib_sell.to_metadata()

# Inference
calib_sell = ThresholdCalibrator.from_metadata(meta["threshold_calibrator_sell"])
if p_sell > calib_sell.get(mode=os.getenv("CVN_THRESHOLD_MODE_SELL", "disabled")):
    emit SELL

Zero code change inside ThresholdCalibrator. The env var CVN_THRESHOLD_MODE_SELL=disabled defaults to no-op while the product is not opened up.

12. Multi-class extensibility (3-class vs binary)¶

The calibrator only sees y_proba_positive, so:

Binary: y_proba_positive = y_proba[:, 1]
3-class BUY: y_proba_positive = y_proba[:, 2]
3-class SELL: y_proba_positive = y_proba[:, 0]

The trainer extracts the right column depending on is_binary_classification() and the class direction, then passes the array to the calibrator.

Decision on 3-class (v2): keep argmax + threshold filter (two gates). Rationale: - Backward-compatible with existing 3-class production behaviour — the calibrator only adds a confidence floor. - A "threshold only" policy in 3-class would require a joint calibration across BUY and SELL probabilities (p_buy > t_buy and p_buy > p_sell? or p_buy > t_buy alone?) — too much decision surface for v1. - The argmax gate naturally filters out ambiguous regions where p_buy ~ p_hold, which a raw threshold cannot. Keeping both gates costs nothing at inference (two cheap comparisons).

Concretely in 3-class:

if argmax(y_proba) == BUY_IDX AND y_proba[:, BUY_IDX] > calib_buy.get(mode):
    emit BUY

13. Validation plan (Phase 3 of #608)¶

13.1 Dataset¶

Three exploration cryptos (OP, AAVE, LDO), five folds, fifteen HPO trials.

13.2 Configs tested¶

Baseline: CVN_THRESHOLD_BUY=0.5 (legacy, reference)
default mode: fitted calibrator
conservative: default + 0.5σ
dynamic: default − 0.5σ

13.3 Success criteria¶

default mode yields n_trades ≥ 30 per fold (vs. 0–3 in today's binary baseline)
f1_buy OOS ≥ 3-class baseline (0.29) on at least 2/3 cryptos
conservative → dynamic must show a consistent gradient: rising n_trades and falling f1
Inter-fold stability: std(threshold_fitted) < 0.05 across the five folds

13.4 Deliverables¶

Table mode × crypto × {f1, n_trades, Sortino, Brier, ECE}
p(BUY) histograms with per-mode vertical lines
Final decision on the production default mode

14. Decisions taken (v2 — was "open questions" in v1)¶

All five v1 open questions are resolved post committee review. Kept here as a decision log.

Fold-level vs global calibration → fold-level. Each fold's calibrator is fit on that fold's val slice. The artifact published in MLflow is the fold's fit; cross-fold aggregation happens at reporting time (std(threshold) is a triptych consistency metric). Rationale: consistent with ADR-14 (multi-fold evaluation as the unit of truth).
Post-Platt/isotonic calibration → enforced calibrated (§8.1). fit_on_calibrated_proba=True is a schema invariant; inference hard-fails otherwise.
Asymmetric costs (PTE changes) → re-fit guaranteed by retraining. The calibrator is a training-time artifact; any PTE change triggers retraining, which produces a new calibrator. Documented in §8 (artifact carries feature_set_version).
Multi-step (meta-label) classifier → calibrator on the meta stage, two artifacts published when CVN_USE_META_LABEL=1 (§10bis).
Threshold vs confidence_filter plugin → stacked, not replaced (§10bis). The plugin reads the calibrator threshold as its parameter; the chain order is unchanged.

New open questions (v2, to revisit in Phase 3)¶

dynamic mode gate flip — under what empirical evidence do we flip CVN_THRESHOLD_DYNAMIC_ENABLED to 1 by default? Tentative criterion: two consecutive quarterly retraining cycles where dynamic outperforms default on the Sortino net-15bps triptych AND shows no action_rate_deviation alert firings in live.
expectancy promotion — if Phase 3 shows it beats fbeta on ≥2/3 exploration cryptos with ≥100 trades/fold, consider promoting to default for cryptos matching those liquidity criteria (not a blanket promotion).
Calibrator TTL / retraining cadence — should the artifact carry a max-age that triggers a retraining alert? Related to the static vs walk-forward debate (P7) but lives in the MLflow lifecycle, not the calibrator itself.

15. References¶

Run evidence: ftf_20260420_122455_d94ad0_ATR1.5_3.0_H5 (binary collapse)
Baseline evidence: ftf_20260420_082137_635720_ATR1.5_3.0_H5 (3-class 15m baseline)
Parent issue: #608
P0-A issue (trigger): #596
Prior art:
Hand, 2009, Measuring classifier performance: a coherent alternative to AUC
Lopez de Prado 2018, Advances in Financial Machine Learning, ch. 12 (meta-labeling + threshold tuning)
Provost & Fawcett, 2001, Robust Classification for Imprecise Environments (optimal operating point on ROC)
Relevant ADRs: ADR-23 (version-pinned MLflow artifacts), ADR-30 (structured logs as a stable contract), ADR-59 (params in PostgreSQL, editable via Console)

16. Acceptance criteria for this document (Phase 1 of #608)¶

This document is now the spec for the Phase 2 implementation PR.