ThresholdCalibrator — Design document¶
Status : v2 — committee-reviewed (session 7371c57d, PASSED, strong consensus)
Authors : CVNTrade research, 2026-04-20
Issue : #608
Changelog¶
- v2 (2026-04-20) — post committee review. Integrates 10 recommendations
and arbitrates 4 dissents:
fbetalocked as Phase 2 default;dynamicmode shipped gated OFF with ±2σ circuit-breaker; A/B FTF factor added to the migration plan; post-Platt/isotonic calibration enforced; 3-class decision locked toargmax + threshold filter; ADR-15 intent clarified (static per model version, not walk-forward); kill switch, runbook, and filter-chain-interaction sections added. - v1 (2026-04-20) — initial draft submitted to committee.
1. Executive summary¶
The post-inference confidence filter compares p(BUY) to a threshold to
decide whether to open a position. Today that threshold is a fixed value
(CVN_THRESHOLD_BUY, default 0.5), implicitly tuned for the 3-class mode.
When we switch to binary (lever #3 of P0-A), the p(BUY) distribution
reshuffles and the inherited threshold mechanically becomes too strict —
action_rate collapses to roughly 0%, the model stops opening trades.
This document proposes a component ThresholdCalibrator that:
- automatically calibrates a threshold from the training set (at the end
of HPO)
- exposes three operator modes (default / conservative / dynamic)
to trade confidence against volume
- persists the fitted threshold in MLflow next to the model (moves with
retraining, not with env vars)
- is extensible to the SELL class when CVNTrade opens up to derivatives
2. Problem statement¶
2.1 Evidence¶
Run ftf_20260420_122455_d94ad0 (lever #3 binary, baseline otherwise,
config 2026-04-20-p0a-lever3):
| Crypto | f1_buy | AUC | action_rate | n_trades |
|---|---|---|---|---|
| AAVE | 0.059 | 0.540 | 0.5% | 3 |
| LDO | 0.034 | 0.508 | 0.3% | 0 |
| OP | 0.000 | 0.500 | 0.0% | 1 |
Equivalent 3-class baseline (run ftf_20260420_082137_635720):
| Crypto | f1_buy | AUC | action_rate |
|---|---|---|---|
| AAVE | 0.323 | 0.678 | 36% |
| LDO | 0.286 | 0.629 | 38% |
| OP | 0.375 | 0.632 | 39% |
2.2 Diagnosis¶
In 3-class mode, argmax(p_sell, p_hold, p_buy) is the natural decision:
the BUY class wins as soon as p(BUY) exceeds p(HOLD) (typically around
0.33–0.4). The 0.5 threshold is rarely the bottleneck.
In binary mode, the decision is directly p(BUY) > threshold. The
p(BUY) distribution spreads over [0,1] — for a well-trained classifier
with an imbalanced minority class (5–15% of bars are true BUY), p(BUY)
stays below 0.5 most of the time even for genuine positives. The 0.5
threshold then sits almost always above p(BUY), driving action rate to
~0.
2.3 Requirement¶
A threshold mechanism that:
1. Adapts automatically to the p(BUY) distribution of the trained
classifier (no manual tuning per classification mode)
2. Lets the operator steer the confidence vs volume tradeoff without
touching code or model
3. Stays stable inter-fold (no overfitting bias to a specific fold)
4. Extends to the SELL class when CVNTrade adds short-selling
3. Design principles¶
P1 — The threshold is a model artifact, not a fixed env var¶
The optimal threshold depends on the p(BUY) distribution learned by a
given classifier. It must therefore move with retraining, not with a
config redeploy. Consequence: the fitted threshold is persisted in
MLflow next to the model, loaded at inference as a feature artifact
(ADR-23 compliance).
P2 — Only modes are exposed to the operator¶
The operator does not pick a numeric value (error-prone, easy to forget
between two runs) but a mode from {default, conservative, dynamic}.
The mode → effective value mapping is computed at inference from the
fitted threshold plus a margin (also parameterized globally, not
per-model).
P3 — Calibration is class-agnostic¶
The ThresholdCalibrator class knows nothing of BUY/SELL business
meaning. It receives y_proba_positive and returns a threshold. One
instance per class to calibrate (BUY today, SELL tomorrow) —
orthogonal to is_binary_classification() and the number of classes.
P4 — Backward compatibility without magic¶
CVN_THRESHOLD_BUY (today's fixed value) is still honored when
CVN_THRESHOLD_MODE_BUY is absent — with an explicit deprecation
warning at runtime. Smooth migration, no big-bang.
P5 — Observable¶
Every inference logs the active mode and the effective value. Grafana
shows a p(BUY) histogram with a vertical line at the effective
threshold per mode.
P6 — Fail-fast + kill switch¶
Missing calibrator at inference → hard fail (no silent fallback,
ADR-25). The only way to bypass is an explicit
CVN_THRESHOLD_CALIBRATOR_DISABLE=1 kill switch, which emits a
WARNING every inference and forces a hard-coded
CVN_THRESHOLD_CALIBRATOR_DISABLE_VALUE (default 0.5). This exists
for incident response (bad calibrator in production), not for steady
state.
P7 — Static per model version (clarifies ADR-15)¶
The calibrator is fit once per model version on the validation
fold at training time. This is the same walk-forward loop as the model
itself (ADR-15 compliant): each retraining produces a new calibrator
on the fresh OOS slice. We explicitly do not recalibrate the
threshold live in production against rolling windows — that would be
walk-forward over walk-forward, a second-order problem out of scope.
Operator adjustments in production go through the mode (default /
conservative), not threshold re-fitting.
4. Architecture¶
4.1 Central class¶
# src/commun/trading/threshold_calibrator.py
from dataclasses import dataclass
from typing import Literal
import numpy as np
Mode = Literal["default", "conservative", "dynamic"]
Method = Literal["fbeta", "youden", "expectancy", "target_rate"]
# Hard circuit-breaker: no mode can push the threshold outside
# fitted_default ± 2σ. Protects against runaway false positives if
# σ is overestimated on a degenerate val set, or if an operator
# sets CVN_THRESHOLD_DYNAMIC_SIGMA to an extreme value.
CIRCUIT_BREAKER_SIGMA_CAP = 2.0
@dataclass
class ThresholdCalibrator:
class_label: str # "BUY" or "SELL" (for logs only, no logic)
fitted_default: float # fitted optimal threshold (changes only on retraining)
proba_sigma: float # std of p(positive) on val set — margin scale
fit_method: Method # method used at fit time (audit)
n_fit: int # number of val samples (stability)
fit_on_calibrated_proba: bool # True iff val probas had post-Platt/isotonic applied
@classmethod
def fit(
cls,
y_true: np.ndarray, # {0, 1} binary, or {0, 1} with `== class_idx` pre-applied for 3-class
y_proba_positive: np.ndarray, # p(positive) in [0, 1], POST-Platt/isotonic (P1 enforces this)
class_label: str = "BUY",
method: Method = "fbeta",
fbeta_beta: float = 1.0,
fit_on_calibrated_proba: bool = True,
) -> "ThresholdCalibrator": ...
def get(
self,
mode: Mode = "default",
conservative_sigma: float = 0.5,
dynamic_sigma: float = 0.5,
dynamic_enabled: bool = False, # P2: ships OFF by default (committee dissent #2)
) -> float:
if mode == "default":
return self.fitted_default
if mode == "conservative":
k = min(conservative_sigma, CIRCUIT_BREAKER_SIGMA_CAP)
return min(1.0, self.fitted_default + k * self.proba_sigma)
if mode == "dynamic":
if not dynamic_enabled:
raise RuntimeError(
"dynamic mode is gated OFF by default — enable via "
"CVN_THRESHOLD_DYNAMIC_ENABLED=1 after reading §5.3 and the runbook"
)
k = min(dynamic_sigma, CIRCUIT_BREAKER_SIGMA_CAP)
return max(0.0, self.fitted_default - k * self.proba_sigma)
raise ValueError(f"Unknown mode: {mode}")
def to_metadata(self) -> dict: ...
@classmethod
def from_metadata(cls, meta: dict) -> "ThresholdCalibrator": ...
Pure, testable, no direct MLflow / sklearn dependency (sklearn is only
imported inside fit).
4.2 Training flow (post-HPO)¶
XGBoost trainer fit() completes
↓
_evaluate_model on X_val, y_val → y_proba_val
↓
ThresholdCalibrator.fit(y_val, y_proba_val[:, buy_idx], method=fbeta) → calib_buy
↓
model_artifacts["threshold_calibrator_buy"] = calib_buy.to_metadata()
↓
MLflow log_dict(calib_buy.to_metadata(), "threshold_calibrator_buy.json")
4.3 Inference flow¶
Load model from MLflow → read threshold_calibrator_buy.json
↓
calib_buy = ThresholdCalibrator.from_metadata(meta)
↓
at each candle:
p_buy = model.predict_proba(features)[:, buy_idx]
threshold = calib_buy.get(mode=os.getenv("CVN_THRESHOLD_MODE_BUY", "default"))
if p_buy > threshold: emit BUY
log_event(threshold_used, mode=..., effective=threshold, default=calib_buy.fitted_default)
5. The 3 modes — formulas and rationale¶
5.1 default — threshold optimized on training¶
Threshold value that maximizes an objective (see §6) on the validation set. Published as "the" operational value.
Typical usage: normal production, baseline for the other modes.
5.2 conservative — upward margin¶
where k_conservative = CVN_THRESHOLD_CONSERVATIVE_SIGMA (default 0.5).
Fewer trades, higher mean confidence per trade. σ is the p(BUY)
dispersion on the val set — adaptive to each model.
Typical usage: live deployment with real capital, bear regimes, after drift detection.
5.3 dynamic — downward margin (gated, research-only at launch)¶
where k_dynamic = CVN_THRESHOLD_DYNAMIC_SIGMA (default 0.5), capped
at 2.0 by the hard circuit-breaker (§4.1).
More trades, lower mean confidence. Useful to explore volume, stress false-positive robustness, or when the strategy needs denser statistical samples (research).
Gated OFF at launch (committee dissent #2). To enable:
CVN_THRESHOLD_DYNAMIC_ENABLED=1 and CVN_THRESHOLD_MODE_BUY=dynamic.
The guard exists because the mode's operational value is unproven and
can amplify false positives in volatile regimes. Gating forces an
explicit operator intent, logged in the config history (ADR-59). Phase
3 validation (§13) is the prerequisite for flipping the default ON.
Typical usage (once enabled): research / P0-A lever testing, liquid markets with low fees.
5.4 Why σ-based rather than a fixed delta¶
Three options considered:
- Fixed delta +0.05: simple, but ignores the model's natural
dispersion. A model with p(BUY) highly concentrated around 0.3
would see its three modes at (0.30, 0.35, 0.25) — 0.35 is already
extreme. Conversely a model with wide p(BUY) spread gets no real
effect from the mode.
- Quantile (e.g. 75th percentile of true positives): requires
keeping val-set positives/negatives around, complicates metadata. No
simple geometric interpretation for dynamic.
- σ-based (retained): naturally scales with the model distribution.
The three modes keep a consistent relative spacing regardless of the
model.
6. default calibration — method to decide (Phase 1)¶
Four candidates to evaluate against existing runs:
6.1 fbeta — maximize F_β(buy)¶
With β = CVN_BUY_BETA (already present, default 1.0). Direct alignment
with the HPO objective — consistent.
Pros: simple, consistent with HPO, no new parameter. Risk: F_β does not account for asymmetric SL/TP costs.
6.2 youden — maximize TPR − FPR¶
Optimal point on the ROC curve. Independent of priors / proportions.
Pros: classic, stable, parameter-free. Risk: does not reflect trade economics (a TP ≠ a FP in PnL terms).
6.3 expectancy — maximize expected PnL¶
where avg_win and avg_loss are the TP and SL trade averages from the
training set (fees included). Economically aligned.
Pros: close to the real business objective.
Risk: strongly depends on avg_win/loss estimation (noisy on small
datasets).
6.4 target_rate — fix the expected BUY rate¶
where target_rate_pct = expected fraction of BUY bars (parameter,
e.g. 10%). The threshold adjusts so that exactly 10% of bars pass.
Pros: predictable volume, useful for fixed-frequency strategies. Risk: disconnects the threshold from actual ML signal (a quantity is forced independently of predictive quality).
6.5 Decision (v2 — committee arbitration)¶
fbeta is locked as the Phase 2 default. Rationale:
- Already the HPO objective (CVN_HPO_OBJECTIVE=fbeta_buy) — consistent
with training, no second-order alignment drift.
- Stable on small datasets (committee concern on expectancy).
- Single parameter β already surfaced via CVN_BUY_BETA.
expectancy stays a first-class candidate but is gated behind a
Phase 3 empirical validation that must demonstrate:
- inter-fold stability std(threshold_fitted) < 0.05 on datasets with
≥100 trades/fold (not the current 0–30 regime),
- avg_win / avg_loss estimates stable across folds (cv < 0.3),
- Sortino net-15bps ≥ fbeta default on at least 2/3 exploration cryptos.
Until these criteria are met, expectancy is allowed via
CVN_THRESHOLD_DEFAULT_METHOD=expectancy but emits a
threshold_method_experimental WARNING.
youden and target_rate remain documented but unflagged — research
options, not shipping defaults.
7. Environment variables¶
| Var | Type | Default | Role |
|---|---|---|---|
CVN_THRESHOLD_MODE_BUY |
enum | default |
BUY pilot mode — FTF factor (A/B testable, ADR-56) |
CVN_THRESHOLD_MODE_SELL |
enum | disabled |
SELL pilot mode (disabled = spot only) |
CVN_THRESHOLD_CONSERVATIVE_SIGMA |
float | 0.5 |
Upward margin in σ (capped at 2.0) |
CVN_THRESHOLD_DYNAMIC_SIGMA |
float | 0.5 |
Downward margin in σ (capped at 2.0) |
CVN_THRESHOLD_DYNAMIC_ENABLED |
bool | 0 |
Required to activate dynamic mode (§5.3) |
CVN_THRESHOLD_DEFAULT_METHOD |
enum | fbeta |
Calibration method (fbeta locked; youden/expectancy/target_rate experimental) |
CVN_THRESHOLD_TARGET_RATE_PCT |
float | 10.0 |
Used only when method=target_rate |
CVN_THRESHOLD_PROBA_SIGMA_DRIFT_PCT |
float | 30.0 |
Alert if live sigma(p_buy) drifts > x% from fitted.proba_sigma |
CVN_THRESHOLD_CALIBRATOR_DISABLE |
bool | 0 |
Kill switch — bypass calibrator entirely, use DISABLE_VALUE (incident response only) |
CVN_THRESHOLD_CALIBRATOR_DISABLE_VALUE |
float | 0.5 |
Hard threshold used when kill switch is ON |
Backward compatibility¶
If CVN_THRESHOLD_MODE_BUY is absent and CVN_THRESHOLD_BUY is present:
runtime logs a threshold_buy_deprecated_fixed_used warning and falls
back to the old fixed value. Decommissioning: 1 sprint after merge,
then drop CVN_THRESHOLD_BUY.
8. MLflow persistence¶
Artifact threshold_calibrator_buy.json at the same level as the model:
{
"class_label": "BUY",
"fitted_default": 0.37,
"proba_sigma": 0.082,
"fit_method": "fbeta",
"fit_method_params": {"fbeta_beta": 1.0},
"fit_on_calibrated_proba": true,
"n_fit": 14523,
"created_at": "2026-04-20T12:24:55Z",
"feature_set_version": "v2-abc123",
"calibration_method": "isotonic",
"git_sha": "c620f23b"
}
Loaded by FeatureEngineeringAPI.from_mlflow_run() or equivalent when
the model is loaded → ThresholdCalibrator.from_metadata().
8.1 Post-Platt/isotonic enforcement (P1)¶
fit() enforces fit_on_calibrated_proba=True. At training time the
trainer must apply CVN_CALIBRATION_METHOD (isotonic/Platt) to
y_proba_val before passing it to the calibrator. Raw (uncalibrated)
probabilities produce a non-monotonic p(BUY) → true-rate mapping, so
both the fitted threshold and the sigma-based margins lose their
probabilistic interpretation.
This is enforced by:
- a RuntimeError if fit() is called with
fit_on_calibrated_proba=False outside of unit tests,
- a schema validation at inference: the loaded artifact must carry
fit_on_calibrated_proba=true — otherwise hard-fail (ADR-25).
9. Monitoring¶
9.1 Logs¶
event=threshold_used
class=BUY
mode=default
effective=0.370
fitted_default=0.370
sigma=0.082
adjustment=0.000
p_buy_current=0.41
triggered=true
Logged on every decision where p(positive) crosses any of the three
modes (lets us estimate how many trades would have fired under a
different mode without executing them).
9.2 Grafana¶
New panel pipeline_threshold_calibration:
- p(BUY) histogram OOS per crypto × fold
- Vertical lines at the three thresholds (default / conservative / dynamic)
- Observed action_rate per mode (table)
- Drift: daily mean(p_buy) → alert if drift > 3σ over 7 days
9.3 Alerts¶
| Alert | Signal | Threshold | Severity | Action |
|---|---|---|---|---|
threshold_buy_deprecated_fixed_used |
Legacy env var still in use | on ingress | WARN | Migrate the run's config to _MODE_BUY |
threshold_calibrator_missing |
Model loaded without calibrator and kill switch OFF | on inference startup | CRIT | Stop, retrain, redeploy |
threshold_calibrator_kill_switch_on |
CVN_THRESHOLD_CALIBRATOR_DISABLE=1 |
per inference | WARN | Verify this is an authorised incident response |
action_rate_deviation |
action_rate current vs historical mean |
> 3σ over 24h | WARN | Check drift + mode; consider conservative |
proba_sigma_drift |
sigma(p_buy)_live vs fitted.proba_sigma |
Δ > CVN_THRESHOLD_PROBA_SIGMA_DRIFT_PCT over 7d |
WARN | Retrain recommended; margins may have lost meaning |
dynamic_mode_enabled_in_live |
MODE=dynamic on CVN_SYSTEM_STATUS=active |
on config change | CRIT | Validate P3 criteria are met before |
threshold_method_experimental |
CVN_THRESHOLD_DEFAULT_METHOD != fbeta |
on training | WARN | Expected in Phase 3 research runs only |
9.4 Runbooks¶
Each CRIT/WARN alert maps to a runbook in
documentation/runbooks/threshold_calibrator_*.md (Phase 2 deliverable):
threshold_calibrator_missing→ model from pre-cutover retraining or artifact write failed. Check MLflow run for the artifact; if missing, retrain. Do not flip the kill switch without a written incident ticket.proba_sigma_drift→ model seeing a differentp(BUY)distribution than what it was fit on (regime change, feature drift, upstream enrichment bug). Prescribed action: switchMODE=conservativeif live trading, trigger retraining job within 24h, investigate root cause (feature drift panel in Grafana).action_rate_deviation→ model has drifted or the wrong mode is set. Decision tree: (a) checkproba_sigma_drift— if firing too, retrain; (b) else verifyCVN_THRESHOLD_MODE_BUYmatches the operator's intent in the config history (ADR-59).dynamic_mode_enabled_in_live→ a config snapshot enabled dynamic mode on a production crypto. Immediately revert to the previous snapshot via Console Restore (#606), then review intent in the config history.
10. Migration path¶
Step 1 — Implementation (Phase 2 of #608)¶
- Class + unit tests + training integration + inference integration
- Backward compat active: only the new parameter is read, fallback works
- Kill switch wired up to
CVN_THRESHOLD_CALIBRATOR_DISABLE - Grafana panel + alerts per §9 shipped in the same PR (no "observability later")
Step 2 — Shadow validation (Phase 2 smoke)¶
- Flag
CVN_THRESHOLD_CALIBRATOR_ENABLED=0in production,=1in shadow runs - Shadow-compare legacy fixed threshold vs calibrated threshold on the exploration set (OP/AAVE/LDO) — same 3-class and binary modes
- Block cutover if shadow reveals any divergence > 3σ in
action_ratethat is unexplained
Step 3 — A/B FTF factor (Phase 3, ADR-56)¶
Introduce threshold_mode_buy as a first-class FTF factor with values
{default, conservative} (dynamic stays gated out). This produces a
clean triptych per mode, directly comparable. The ADR-56 factor
mechanism also guarantees:
- deterministic recording of the active mode per run,
- automatic persistence in finetune_results,
- BH-corrected statistical testing on Sortino net-15bps + f1_buy +
Brier/ECE (ADR-29 baseline comparison),
- Console one-click restore to a prior mode (#606).
Success = default mode yields a triptych consistent with the
3-class baseline on ≥ 2/3 exploration cryptos; conservative shows the
expected gradient (fewer trades, higher f1).
Step 4 — Cutover¶
- Merge Phase 2, deploy image
- All new runs automatically produce calibrators
- Old models load via fallback (warning logs, A/B factor off)
Step 5 — Deprecation¶
- 1 sprint after cutover: remove
CVN_THRESHOLD_BUYfrom code, every production model carries a calibrator - 2 sprints after cutover: revisit
dynamicgating based on Phase 3 data — flip to enabled-by-default only if P3 criteria (§13) are met.
10bis. Position in the 9-filter chain¶
The calibrator sits inside the inference stage, specifically
replacing the fixed-threshold cutoff in the confidence gate. It does
not replace the confidence_filter plugin — that plugin stays in
place and receives the calibrator output as its threshold. The filter
chain order is unchanged (CLAUDE.md §Chaîne de filtres):
CUSUM → Trend → Inference + Confidence(calibrator-driven) → Meta-Label → Regime → Cost → Kelly → Confirmation → Quality
Concretely inside stage 3 (Inference + Confidence):
model.predict_proba(features) → y_proba
↓
(isotonic / Platt) → y_proba_calibrated # §8.1 enforced
↓
threshold = calibrator.get(mode=CVN_THRESHOLD_MODE_BUY,
dynamic_enabled=CVN_THRESHOLD_DYNAMIC_ENABLED)
↓
confidence_filter(p_buy > threshold) → pass / block
Meta-label integration (future): if meta-labeling is enabled
(CVN_USE_META_LABEL=1), the calibrator applies to the meta
classifier's p(BUY), not the primary. The primary classifier uses
argmax (its output is binary pass/no-pass for the meta stage, not a
tradable decision). Two calibrator instances in MLflow:
threshold_calibrator_buy_primary.json and
threshold_calibrator_buy_meta.json, loaded in order.
11. SELL extensibility¶
When CVNTrade adds short-selling:
# Training
calib_sell = ThresholdCalibrator.fit(
y_val == SELL_idx,
p_proba_val[:, SELL_idx],
class_label="SELL",
)
model_artifacts["threshold_calibrator_sell"] = calib_sell.to_metadata()
# Inference
calib_sell = ThresholdCalibrator.from_metadata(meta["threshold_calibrator_sell"])
if p_sell > calib_sell.get(mode=os.getenv("CVN_THRESHOLD_MODE_SELL", "disabled")):
emit SELL
Zero code change inside ThresholdCalibrator. The env var
CVN_THRESHOLD_MODE_SELL=disabled defaults to no-op while the product
is not opened up.
12. Multi-class extensibility (3-class vs binary)¶
The calibrator only sees y_proba_positive, so:
- Binary:
y_proba_positive = y_proba[:, 1] - 3-class BUY:
y_proba_positive = y_proba[:, 2] - 3-class SELL:
y_proba_positive = y_proba[:, 0]
The trainer extracts the right column depending on
is_binary_classification() and the class direction, then passes the
array to the calibrator.
Decision on 3-class (v2): keep argmax + threshold filter (two
gates). Rationale:
- Backward-compatible with existing 3-class production behaviour — the
calibrator only adds a confidence floor.
- A "threshold only" policy in 3-class would require a joint calibration
across BUY and SELL probabilities (p_buy > t_buy and p_buy > p_sell?
or p_buy > t_buy alone?) — too much decision surface for v1.
- The argmax gate naturally filters out ambiguous regions where
p_buy ~ p_hold, which a raw threshold cannot. Keeping both gates
costs nothing at inference (two cheap comparisons).
Concretely in 3-class:
13. Validation plan (Phase 3 of #608)¶
13.1 Dataset¶
Three exploration cryptos (OP, AAVE, LDO), five folds, fifteen HPO trials.
13.2 Configs tested¶
- Baseline:
CVN_THRESHOLD_BUY=0.5(legacy, reference) defaultmode: fitted calibratorconservative: default + 0.5σdynamic: default − 0.5σ
13.3 Success criteria¶
defaultmode yields n_trades ≥ 30 per fold (vs. 0–3 in today's binary baseline)f1_buy OOS≥ 3-class baseline (0.29) on at least 2/3 cryptosconservative→dynamicmust show a consistent gradient: rising n_trades and falling f1- Inter-fold stability:
std(threshold_fitted)< 0.05 across the five folds
13.4 Deliverables¶
- Table
mode × crypto × {f1, n_trades, Sortino, Brier, ECE} p(BUY)histograms with per-mode vertical lines- Final decision on the production default mode
14. Decisions taken (v2 — was "open questions" in v1)¶
All five v1 open questions are resolved post committee review. Kept here as a decision log.
-
Fold-level vs global calibration → fold-level. Each fold's calibrator is fit on that fold's val slice. The artifact published in MLflow is the fold's fit; cross-fold aggregation happens at reporting time (std(threshold) is a triptych consistency metric). Rationale: consistent with ADR-14 (multi-fold evaluation as the unit of truth).
-
Post-Platt/isotonic calibration → enforced calibrated (§8.1).
fit_on_calibrated_proba=Trueis a schema invariant; inference hard-fails otherwise. -
Asymmetric costs (PTE changes) → re-fit guaranteed by retraining. The calibrator is a training-time artifact; any PTE change triggers retraining, which produces a new calibrator. Documented in §8 (artifact carries
feature_set_version). -
Multi-step (meta-label) classifier → calibrator on the meta stage, two artifacts published when
CVN_USE_META_LABEL=1(§10bis). -
Threshold vs
confidence_filterplugin → stacked, not replaced (§10bis). The plugin reads the calibrator threshold as its parameter; the chain order is unchanged.
New open questions (v2, to revisit in Phase 3)¶
-
dynamicmode gate flip — under what empirical evidence do we flipCVN_THRESHOLD_DYNAMIC_ENABLEDto1by default? Tentative criterion: two consecutive quarterly retraining cycles wheredynamicoutperformsdefaulton the Sortino net-15bps triptych AND shows noaction_rate_deviationalert firings in live. -
expectancypromotion — if Phase 3 shows it beatsfbetaon ≥2/3 exploration cryptos with ≥100 trades/fold, consider promoting to default for cryptos matching those liquidity criteria (not a blanket promotion). -
Calibrator TTL / retraining cadence — should the artifact carry a max-age that triggers a retraining alert? Related to the static vs walk-forward debate (P7) but lives in the MLflow lifecycle, not the calibrator itself.
15. References¶
- Run evidence:
ftf_20260420_122455_d94ad0_ATR1.5_3.0_H5(binary collapse) - Baseline evidence:
ftf_20260420_082137_635720_ATR1.5_3.0_H5(3-class 15m baseline) - Parent issue: #608
- P0-A issue (trigger): #596
- Prior art:
- Hand, 2009, Measuring classifier performance: a coherent alternative to AUC
- Lopez de Prado 2018, Advances in Financial Machine Learning, ch. 12 (meta-labeling + threshold tuning)
- Provost & Fawcett, 2001, Robust Classification for Imprecise Environments (optimal operating point on ROC)
- Relevant ADRs: ADR-23 (version-pinned MLflow artifacts), ADR-30 (structured logs as a stable contract), ADR-59 (params in PostgreSQL, editable via Console)
16. Acceptance criteria for this document (Phase 1 of #608)¶
- Problem documented with quantified evidence
- Design principles stated explicitly (P1–P7, with ADR-15 clarified)
- Three modes detailed with formulas (dynamic gated OFF by default)
- Hard circuit-breaker at ±2σ (§4.1)
- Class architecture + signatures
- Four
default_methodcandidates described, decision taken:fbetalocked,expectancyexperimental (§6.5) - Env vars listed with defaults (including kill switch, drift threshold, dynamic enable flag)
- MLflow persistence schema defined + post-Platt/isotonic enforcement (§8.1)
- Training flow + inference flow diagrammed
- Monitoring + alerts + runbooks (§9.4)
- Filter-chain position (§10bis) + meta-label integration
- Migration path in 5 steps including A/B FTF factor (ADR-56)
- SELL + 3-class extensibility covered (3-class:
argmax + filter) - Phase 3 validation plan with success criteria
- v1 open questions resolved (5/5)
- Committee review completed (session
7371c57d, PASSED, strong consensus) - Final production
default_modedecision — still pending, taken after Phase 3 empirical evidence on defi_top5
This document is now the spec for the Phase 2 implementation PR.