Plan dossier — Track 5 Bug #1 : Calibration refactor on hard labels (Option C)¶
Date : 2026-04-28
Story : CVN-N011-EA-S07 (OP wp#85)
GH issue : #764
Author : Dominique (operator) + Claude
Session type : plan_review (per ADR-68)
Severity : P0 — production blocker (50 % of FTF sweep variants)
Predecessors :
- §17.1 (Bug #1 Track 5 sweep — DataFrame coercion, PR #752)
- §17.2 (Bug #2 Track 5 sweep — XGBoost feature_names mismatch, PR #754)
- This dossier = §17.3 will be added post-fix
1. Production failure observed¶
Operator launched FTF sweep on 2026-04-28 ~15:30 UTC (after PR #754 deploy). New failure mode :
File "src/training/XGBoost/cvntrade_XGBoost_trainer.py", line 315
self._apply_calibration(X_train, y_train, config.calibration)
File ".../sklearn/calibration.py", line 319, in fit
check_classification_targets(y)
ValueError: Unknown label type: continuous. Maybe you are trying to fit a
classifier, which expects discrete classes on a regression target with
continuous values.
Crash on UNIUSDC fold 3, variant eps_buy=0.3 eps_hold=0.15 cleanlab_mode=off with calibration='isotonic'.
2. Root cause analysis¶
2.1 The trace¶
- Trainer loads
(X_train, y_train)fromdatasets["train"]—y_traininitially in{0, 1}(hard labels). - Trainer calls
apply_label_pipeline(X_train, y_train, ...)(src/training/XGBoost/cvntrade_XGBoost_trainer.py:251) — for non-baseline variants,y_trainis overwritten with soft labels (continuous floats in[0, 1]). - Trainer constructs
dtrain = xgb.DMatrix(X_train, label=y_train, weight=sample_weights)and runsxgb.train(..., 'binary:logistic')— XGBoost binary objective accepts soft labels natively. ✅ - Trainer calls
self._apply_calibration(X_train, y_train, config.calibration)(...trainer.py:315) which callsCalibratedClassifierCV.fit(X_train, y_train)(...trainer.py:403). - sklearn's
CalibratedClassifierCVcallscheck_classification_targets(y)and rejects continuous targets. ❌
2.2 The implicit contract violation¶
This is the third incident in 24 h on the same conceptual surface : apply_label_pipeline violates an undocumented contract with a downstream consumer.
| Incident | Surface | Downstream consumer |
|---|---|---|
| §17.1 | apply_label_pipeline X_train type |
Hamilton validate_inputs |
| §17.2 | apply_label_pipeline X_train feature_names |
XGBoost _validate_features |
| §17.3 (this) | apply_label_pipeline y_train discrete-vs-continuous |
sklearn CalibratedClassifierCV |
The pattern is identical to §17.2 : Track 5's design assumed (correctly) that xgb.train(..., 'binary:logistic') accepts soft labels — but failed to audit OTHER downstream consumers of y_train. The trainer's _apply_calibration is one such consumer ; sklearn requires hard labels and crashes on the soft ones.
2.3 Why local tests didn't catch it¶
The Track 5 integration tests (tests/integration/test_track5_label_smoothing.py) and even the new TestTrainerEndToEndDataFrame from PR #754 only exercise the path up to xgb.train — NOT the _apply_calibration step that follows. The local test trainer call sequence stops at xgb.train ; the calibration codepath was uncovered.
This is a direct repeat of the PR #754 committee's recommendation #4 ("Mandate comprehensive integration testing & test parity") — we did not yet apply it broadly, and the calibration codepath fell through the gap.
3. Why Option C (calibrate on hold-out) over A and B¶
Option A — Round soft labels back to hard before calibration¶
- ✅ 1-line fix, immediate
- ❌ Silently loses the smoothing information
- ❌ Calibration still happens on the training set (in-sample) — this was a pre-existing methodological weakness that Bug #1 just exposed
- ❌ Doesn't address the implicit-contract pattern from §17.1+§17.2+§17.3
Option B — Capture original hard y_train before label_pipeline¶
y_train_original = y_train.copy()
X_train, y_train, sw = apply_label_pipeline(X_train, y_train, ...)
# ... xgb.train ...
self._apply_calibration(X_train, y_train_original, ...) # use original
- ✅ Sémantiquement plus clean que A (preserves info)
- ✅ Cheap (~5 lines)
- ❌ Calibration toujours in-sample (sur train) — calibration optimistic, n'évalue pas la robustesse OOS
- ❌ Tampère superficiellement avec le bug : on remet le bug d'origine sous le tapis (la calibration sur train était déjà sous-optimale même avant le bug)
Option C — Calibrate on hold-out (X_val, y_val)¶
- ✅ Sémantiquement correct : calibration doit toujours s'appuyer sur un set HOLD-OUT non vu pendant l'entraînement (best practice ML — Platt 1999, Niculescu-Mizil & Caruana 2005)
- ✅ y_val est jamais touché par
apply_label_pipeline→ toujours hard-labeled → soft labels du train n'impactent plus la calibration - ✅ Élimine la classe de bug : calibration ne peut plus crasher sur soft labels train
- ⚠️ Side effect : calibration in-sample sur val (le set qu'on évalue ensuite) → ECE_VAL légèrement biaisé optimiste
- Mitigation : ajouter une évaluation OOS sur
X_test/y_test(déjà disponible — voirtest_metricsàcvntrade_XGBoost_trainer.py:676-678déjà computé en mode non-HPO) → ECE_TEST devient la métrique primaire de calibration ; ECE_VAL devient sanity check - ✅ Aligne avec ADR-15 (theta calibré OOS — déjà précédent dans le projet)
Choix opérateur : C. Justification : c'est la seule option qui (a) règle le bug à sa racine, (b) corrige une faiblesse méthodologique pre-existing (calibration in-sample sur train), (c) ne réintroduit pas la classe de bug d'incident #4 plus tard.
4. Implementation path¶
4.1 Trainer change (cvntrade_XGBoost_trainer.py:315)¶
Before :
# Calibration si nécessaire
if config.calibration != "none":
self._apply_calibration(X_train, y_train, config.calibration)
After :
# Calibration sur hold-out (X_val, y_val) — y_val est hard-labeled, jamais
# touché par apply_label_pipeline. Évite le bug Bug #1 (incident §17.3) et
# corrige la calibration in-sample qui était méthodologiquement sous-optimale.
# y_val MUST be discrete {0, 1} (per ADR-25 fail-fast).
if config.calibration != "none":
_assert_calibration_targets_discrete(y_val)
self._apply_calibration(X_val, y_val, config.calibration)
4.2 Trainer-side fail-fast helper¶
Add a defensive assertion analogous to _assert_dmatrix_contract (PR #754 committee reco #2) :
def _assert_calibration_targets_discrete(y: np.ndarray) -> None:
"""Fail-fast contract check before CalibratedClassifierCV.fit (per ADR-25,
no silent fallback). sklearn's check_classification_targets crashes deep
in the call stack if y is continuous ; this assertion surfaces the
contract violation at the boundary with a readable message pointing at
the suspected root cause (a transform that produced soft labels)."""
y_arr = np.asarray(y)
unique_vals = np.unique(y_arr)
is_discrete = np.all(np.isin(unique_vals, [0, 1])) or np.issubdtype(y_arr.dtype, np.integer)
if not is_discrete:
raise ValueError(
f"Calibration contract violation: target y must be discrete classes "
f"({{0, 1}} for binary), got continuous values with "
f"{len(unique_vals)} unique entries (sample: {unique_vals[:5]}...). "
f"Likely cause: a transform between dataset extraction and "
f"_apply_calibration produced soft labels (e.g., apply_label_pipeline "
f"with eps_buy > 0). Calibration should be invoked on a hold-out set "
f"(X_val, y_val) where labels are unmodified — see OPERATIONS §17.3."
)
4.3 Update both calibration paths¶
The same change must be applied to _apply_hybrid_calibration (...trainer.py:405) which also calls CalibratedClassifierCV.fit(X_train, y_train) twice. Same fix : pass (X_val, y_val).
4.4 Integration test (regression bar)¶
Add to tests/integration/test_track5_label_smoothing.py (or new tests/integration/test_calibration_soft_labels.py) :
class TestCalibrationOnSoftLabelsTrain:
"""Reproduces production incident §17.3 : Track 5 label smoothing
produces soft y_train ; trainer's _apply_calibration must not crash
even when training labels are continuous, by calibrating on hard y_val."""
def test_calibration_does_not_crash_with_soft_y_train(self):
# Build (X_train, y_train) with eps_buy=0.3 → continuous in {0.15, 0.85}
# Build (X_val, y_val) untouched → discrete in {0, 1}
# Run trainer.train(..., config with calibration='isotonic')
# Verify : no crash, calibrated_model is callable, predict_proba in [0,1]
...
def test_calibration_assertion_fires_on_continuous_val(self):
# If y_val is somehow continuous (defensive), assertion must raise
# with the readable message before CalibratedClassifierCV.fit
...
4.5 Documentation¶
OPERATIONS.md§17.3 — incident log entrydocumentation/reviews/2026-04-28-track5-bug1-calibration-refactor-plan.md— this dossierdocumentation/reviews/2026-04-28-track5-bug1-calibration-refactor-pr-review.md— committeepr_reviewdossier (Phase 4)- Brief comment in trainer code citing OPERATIONS §17.3
5. Acceptance criteria¶
- Plan dossier reviewed by Expert Committee (
plan_review) — verdict PASSED -
_apply_calibrationcalled with(X_val, y_val)instead of(X_train, y_train)(both standard and hybrid paths) -
_assert_calibration_targets_discreteinvoked before eachCalibratedClassifierCV.fit - Integration test reproducing the crash (eps_buy=0.3 + calibration='isotonic') — verified to FAIL before fix and PASS after (regression bar)
- Existing calibration tests (
tests/unit/test_calibration_metrics.py,tests/unit/test_oos_calibrator.py) green - Trainer test
tests/unit/test_model_trainers.pygreen - ECE_VAL post-fix ≤ ECE_VAL pre-fix + 0.005 (calibration on val should be ≈ as good as calibration on train, since train is similar distribution)
- ECE_TEST (OOS) becomes the primary calibration metric reported in FTF results — added to
_evaluate_modelif not already - OPERATIONS.md §17.3 added
- Committee
pr_reviewPASSED before merge - Operator re-triggers FTF sweep — variant
eps_buy=0.3 cl_mode=offcalibration='isotonic' no longer crashes
6. Falsifiability criteria¶
The hypothesis "calibration on val + assertion improves robustness without degrading quality" is falsified if any of :
- ECE_VAL post-fix > ECE_VAL pre-fix + 0.01 → calibration on val is materially worse than calibration on train (unlikely given same data distribution but worth checking)
- ECE_TEST (OOS) > ECE_TEST pre-fix + 0.01 → OOS calibration regresses (would require investigating overfit to val)
- Brier_score on test > Brier_score pre-fix + 0.005 → calibration quality regression
- Any baseline variant (eps=0, cl=off) regresses on f1_buy or expectancy_net by ≥ 0.005 (calibration change should not impact decision threshold metrics in a healthy world)
If any falsifier triggers : plan rollback (Option B as fallback) + investigate.
7. Risks¶
| # | Risk | Probability | Impact | Mitigation |
|---|---|---|---|---|
| 1 | ECE_VAL slightly degrades (in-sample bias removed) | medium | low | Document as expected ; ECE_TEST becomes primary metric |
| 2 | Calibration overfits to val if val is small | low | medium | Check val_size ≥ 500 in trainer ; add guardrail if not |
| 3 | Hybrid calibration path missed by tests → regression | medium | medium | Explicit test for _apply_hybrid_calibration path |
| 4 | OOS calibration metric (ECE_TEST) pre-existing values not comparable post-fix | low | low | Document baseline shift in OPERATIONS §17.3 + plan dossier |
| 5 | The fix exposes a 4th implicit contract elsewhere | medium | high | Audit all y_train/y_val consumers in trainer + post-trainer code (oos_calibrator.py, regime_trainer.py, etc.) — out of scope of S07, file follow-up Story if found |
8. Cost estimate¶
| Phase | Effort | Wall-clock |
|---|---|---|
| 1 (Plan + committee) | 2 h | 1 day (committee + recos round-trip) |
| 2 (Design) | 1 h | included in Phase 3 |
| 3 (Implem + tests + docs) | 3 h | 1 day |
| 4 (PR + CR cycles + committee pr_review + merge + deploy) | 4 h | 2 days (CR cycles 4-5 + committee + image build/deploy) |
| 5 (Validation cluster) | 30 min | dependent on operator |
| Total | ~10 h effort | ~4 days wall-clock |
9. Out of scope (tracked separately)¶
- Single-source dependency hygiene → CVN-N013-EA-S02-S04 (already created)
- Generalized contract enforcement at all ML pipeline boundaries → CVN-N011-EA-S01 (data contracts ADR, already created)
- Audit of other
y_train/y_valconsumers (Risk #5) → follow-up Story if anything found - Calibration fold-aware (calibrate on a separate slice of val per-fold) → potential follow-up if Risk #2 materializes
10. Question for the committee¶
Is Option C (calibrate on
(X_val, y_val)+ fail-fast assertion + integration test reproducing the crash) the right grain of fix for this Story — or should the scope be widened to (a) split val into val_eval / val_calib, (b) audit all other implicit contracts in the trainer at the same time, or (c) both ?Bonus : what is the right durable mechanism (beyond this Story's scope) to prevent a 4th incident on the
apply_label_pipelinesurface ?
11. Linked context¶
- ADR-15 (theta calibré OOS) — direct precedent for hold-out calibration
- ADR-25 (no silent fallback / fail-fast) — informs the
_assert_calibration_targets_discretedesign - ADR-58 (every FTF factor must have guardrail + integration test) — informs the test pattern
- ADR-68 (Expert Committee for substantive ML PR)
- OPERATIONS §17.1 + §17.2 + §17.3 (this incident series — same surface, 3 incidents in 24 h)
- Committee
0e15acc0(PR #754 pr_review) — reco #4 "mandate integration testing parity" predicted this exact gap (calibration codepath uncovered) ; this Story partially applies it