Plan dossier — Track 5 Bug #1 : Calibration refactor on hard labels (Option C)¶

Date : 2026-04-28 Story : CVN-N011-EA-S07 (OP wp#85) GH issue : #764 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P0 — production blocker (50 % of FTF sweep variants) Predecessors : - §17.1 (Bug #1 Track 5 sweep — DataFrame coercion, PR #752) - §17.2 (Bug #2 Track 5 sweep — XGBoost feature_names mismatch, PR #754) - This dossier = §17.3 will be added post-fix

1. Production failure observed¶

Operator launched FTF sweep on 2026-04-28 ~15:30 UTC (after PR #754 deploy). New failure mode :

File "src/training/XGBoost/cvntrade_XGBoost_trainer.py", line 315
  self._apply_calibration(X_train, y_train, config.calibration)
File ".../sklearn/calibration.py", line 319, in fit
  check_classification_targets(y)
ValueError: Unknown label type: continuous. Maybe you are trying to fit a
classifier, which expects discrete classes on a regression target with
continuous values.

Crash on UNIUSDC fold 3, variant eps_buy=0.3 eps_hold=0.15 cleanlab_mode=off with calibration='isotonic'.

2. Root cause analysis¶

2.1 The trace¶

Trainer loads (X_train, y_train) from datasets["train"] — y_train initially in {0, 1} (hard labels).
Trainer calls apply_label_pipeline(X_train, y_train, ...) (src/training/XGBoost/cvntrade_XGBoost_trainer.py:251) — for non-baseline variants, y_train is overwritten with soft labels (continuous floats in [0, 1]).
Trainer constructs dtrain = xgb.DMatrix(X_train, label=y_train, weight=sample_weights) and runs xgb.train(..., 'binary:logistic') — XGBoost binary objective accepts soft labels natively. ✅
Trainer calls self._apply_calibration(X_train, y_train, config.calibration) (...trainer.py:315) which calls CalibratedClassifierCV.fit(X_train, y_train) (...trainer.py:403).
sklearn's CalibratedClassifierCV calls check_classification_targets(y) and rejects continuous targets. ❌

2.2 The implicit contract violation¶

This is the third incident in 24 h on the same conceptual surface : apply_label_pipeline violates an undocumented contract with a downstream consumer.

Incident	Surface	Downstream consumer
§17.1	`apply_label_pipeline` X_train type	Hamilton `validate_inputs`
§17.2	`apply_label_pipeline` X_train feature_names	XGBoost `_validate_features`
§17.3 (this)	`apply_label_pipeline` y_train discrete-vs-continuous	sklearn `CalibratedClassifierCV`

The pattern is identical to §17.2 : Track 5's design assumed (correctly) that xgb.train(..., 'binary:logistic') accepts soft labels — but failed to audit OTHER downstream consumers of y_train. The trainer's _apply_calibration is one such consumer ; sklearn requires hard labels and crashes on the soft ones.

2.3 Why local tests didn't catch it¶

The Track 5 integration tests (tests/integration/test_track5_label_smoothing.py) and even the new TestTrainerEndToEndDataFrame from PR #754 only exercise the path up to xgb.train — NOT the _apply_calibration step that follows. The local test trainer call sequence stops at xgb.train ; the calibration codepath was uncovered.

This is a direct repeat of the PR #754 committee's recommendation #4 ("Mandate comprehensive integration testing & test parity") — we did not yet apply it broadly, and the calibration codepath fell through the gap.

3. Why Option C (calibrate on hold-out) over A and B¶

Option A — Round soft labels back to hard before calibration¶

y_calib = np.round(y_train).astype(int)
self.calibrated_model.fit(X_train, y_calib)

✅ 1-line fix, immediate
❌ Silently loses the smoothing information
❌ Calibration still happens on the training set (in-sample) — this was a pre-existing methodological weakness that Bug #1 just exposed
❌ Doesn't address the implicit-contract pattern from §17.1+§17.2+§17.3

Option B — Capture original hard `y_train` before label_pipeline¶

y_train_original = y_train.copy()
X_train, y_train, sw = apply_label_pipeline(X_train, y_train, ...)
# ... xgb.train ...
self._apply_calibration(X_train, y_train_original, ...)  # use original

✅ Sémantiquement plus clean que A (preserves info)
✅ Cheap (~5 lines)
❌ Calibration toujours in-sample (sur train) — calibration optimistic, n'évalue pas la robustesse OOS
❌ Tampère superficiellement avec le bug : on remet le bug d'origine sous le tapis (la calibration sur train était déjà sous-optimale même avant le bug)

Option C — Calibrate on hold-out (X_val, y_val)¶

self._apply_calibration(X_val, y_val, config.calibration)

✅ Sémantiquement correct : calibration doit toujours s'appuyer sur un set HOLD-OUT non vu pendant l'entraînement (best practice ML — Platt 1999, Niculescu-Mizil & Caruana 2005)
✅ y_val est jamais touché par apply_label_pipeline → toujours hard-labeled → soft labels du train n'impactent plus la calibration
✅ Élimine la classe de bug : calibration ne peut plus crasher sur soft labels train
⚠️ Side effect : calibration in-sample sur val (le set qu'on évalue ensuite) → ECE_VAL légèrement biaisé optimiste
Mitigation : ajouter une évaluation OOS sur X_test/y_test (déjà disponible — voir test_metrics à cvntrade_XGBoost_trainer.py:676-678 déjà computé en mode non-HPO) → ECE_TEST devient la métrique primaire de calibration ; ECE_VAL devient sanity check
✅ Aligne avec ADR-15 (theta calibré OOS — déjà précédent dans le projet)

Choix opérateur : C. Justification : c'est la seule option qui (a) règle le bug à sa racine, (b) corrige une faiblesse méthodologique pre-existing (calibration in-sample sur train), (c) ne réintroduit pas la classe de bug d'incident #4 plus tard.

4. Implementation path¶

4.1 Trainer change (`cvntrade_XGBoost_trainer.py:315`)¶

Before :

# Calibration si nécessaire
if config.calibration != "none":
    self._apply_calibration(X_train, y_train, config.calibration)

After :

# Calibration sur hold-out (X_val, y_val) — y_val est hard-labeled, jamais
# touché par apply_label_pipeline. Évite le bug Bug #1 (incident §17.3) et
# corrige la calibration in-sample qui était méthodologiquement sous-optimale.
# y_val MUST be discrete {0, 1} (per ADR-25 fail-fast).
if config.calibration != "none":
    _assert_calibration_targets_discrete(y_val)
    self._apply_calibration(X_val, y_val, config.calibration)

4.2 Trainer-side fail-fast helper¶

Add a defensive assertion analogous to _assert_dmatrix_contract (PR #754 committee reco #2) :

def _assert_calibration_targets_discrete(y: np.ndarray) -> None:
    """Fail-fast contract check before CalibratedClassifierCV.fit (per ADR-25,
    no silent fallback). sklearn's check_classification_targets crashes deep
    in the call stack if y is continuous ; this assertion surfaces the
    contract violation at the boundary with a readable message pointing at
    the suspected root cause (a transform that produced soft labels)."""
    y_arr = np.asarray(y)
    unique_vals = np.unique(y_arr)
    is_discrete = np.all(np.isin(unique_vals, [0, 1])) or np.issubdtype(y_arr.dtype, np.integer)
    if not is_discrete:
        raise ValueError(
            f"Calibration contract violation: target y must be discrete classes "
            f"({{0, 1}} for binary), got continuous values with "
            f"{len(unique_vals)} unique entries (sample: {unique_vals[:5]}...). "
            f"Likely cause: a transform between dataset extraction and "
            f"_apply_calibration produced soft labels (e.g., apply_label_pipeline "
            f"with eps_buy > 0). Calibration should be invoked on a hold-out set "
            f"(X_val, y_val) where labels are unmodified — see OPERATIONS §17.3."
        )

4.3 Update both calibration paths¶

The same change must be applied to _apply_hybrid_calibration (...trainer.py:405) which also calls CalibratedClassifierCV.fit(X_train, y_train) twice. Same fix : pass (X_val, y_val).

4.4 Integration test (regression bar)¶

Add to tests/integration/test_track5_label_smoothing.py (or new tests/integration/test_calibration_soft_labels.py) :

class TestCalibrationOnSoftLabelsTrain:
    """Reproduces production incident §17.3 : Track 5 label smoothing
    produces soft y_train ; trainer's _apply_calibration must not crash
    even when training labels are continuous, by calibrating on hard y_val."""

    def test_calibration_does_not_crash_with_soft_y_train(self):
        # Build (X_train, y_train) with eps_buy=0.3 → continuous in {0.15, 0.85}
        # Build (X_val, y_val) untouched → discrete in {0, 1}
        # Run trainer.train(..., config with calibration='isotonic')
        # Verify : no crash, calibrated_model is callable, predict_proba in [0,1]
        ...

    def test_calibration_assertion_fires_on_continuous_val(self):
        # If y_val is somehow continuous (defensive), assertion must raise
        # with the readable message before CalibratedClassifierCV.fit
        ...

4.5 Documentation¶

OPERATIONS.md §17.3 — incident log entry
documentation/reviews/2026-04-28-track5-bug1-calibration-refactor-plan.md — this dossier
documentation/reviews/2026-04-28-track5-bug1-calibration-refactor-pr-review.md — committee pr_review dossier (Phase 4)
Brief comment in trainer code citing OPERATIONS §17.3

5. Acceptance criteria¶

6. Falsifiability criteria¶

The hypothesis "calibration on val + assertion improves robustness without degrading quality" is falsified if any of :

ECE_VAL post-fix > ECE_VAL pre-fix + 0.01 → calibration on val is materially worse than calibration on train (unlikely given same data distribution but worth checking)
ECE_TEST (OOS) > ECE_TEST pre-fix + 0.01 → OOS calibration regresses (would require investigating overfit to val)
Brier_score on test > Brier_score pre-fix + 0.005 → calibration quality regression
Any baseline variant (eps=0, cl=off) regresses on f1_buy or expectancy_net by ≥ 0.005 (calibration change should not impact decision threshold metrics in a healthy world)

If any falsifier triggers : plan rollback (Option B as fallback) + investigate.

7. Risks¶

#	Risk	Probability	Impact	Mitigation
1	ECE_VAL slightly degrades (in-sample bias removed)	medium	low	Document as expected ; ECE_TEST becomes primary metric
2	Calibration overfits to val if val is small	low	medium	Check val_size ≥ 500 in trainer ; add guardrail if not
3	Hybrid calibration path missed by tests → regression	medium	medium	Explicit test for `_apply_hybrid_calibration` path
4	OOS calibration metric (ECE_TEST) pre-existing values not comparable post-fix	low	low	Document baseline shift in OPERATIONS §17.3 + plan dossier
5	The fix exposes a 4th implicit contract elsewhere	medium	high	Audit all `y_train`/`y_val` consumers in trainer + post-trainer code (`oos_calibrator.py`, `regime_trainer.py`, etc.) — out of scope of S07, file follow-up Story if found

8. Cost estimate¶

Phase	Effort	Wall-clock
1 (Plan + committee)	2 h	1 day (committee + recos round-trip)
2 (Design)	1 h	included in Phase 3
3 (Implem + tests + docs)	3 h	1 day
4 (PR + CR cycles + committee pr_review + merge + deploy)	4 h	2 days (CR cycles 4-5 + committee + image build/deploy)
5 (Validation cluster)	30 min	dependent on operator
Total	~10 h effort	~4 days wall-clock

9. Out of scope (tracked separately)¶

Single-source dependency hygiene → CVN-N013-EA-S02-S04 (already created)
Generalized contract enforcement at all ML pipeline boundaries → CVN-N011-EA-S01 (data contracts ADR, already created)
Audit of other y_train/y_val consumers (Risk #5) → follow-up Story if anything found
Calibration fold-aware (calibrate on a separate slice of val per-fold) → potential follow-up if Risk #2 materializes

10. Question for the committee¶

Is Option C (calibrate on (X_val, y_val) + fail-fast assertion + integration test reproducing the crash) the right grain of fix for this Story — or should the scope be widened to (a) split val into val_eval / val_calib, (b) audit all other implicit contracts in the trainer at the same time, or (c) both ?

Bonus : what is the right durable mechanism (beyond this Story's scope) to prevent a 4th incident on the apply_label_pipeline surface ?

11. Linked context¶

ADR-15 (theta calibré OOS) — direct precedent for hold-out calibration
ADR-25 (no silent fallback / fail-fast) — informs the _assert_calibration_targets_discrete design
ADR-58 (every FTF factor must have guardrail + integration test) — informs the test pattern
ADR-68 (Expert Committee for substantive ML PR)
OPERATIONS §17.1 + §17.2 + §17.3 (this incident series — same surface, 3 incidents in 24 h)
Committee 0e15acc0 (PR #754 pr_review) — reco #4 "mandate integration testing parity" predicted this exact gap (calibration codepath uncovered) ; this Story partially applies it