Skip to content

Plan dossier — Track 5 Bug #1 : Calibration refactor on hard labels (Option C)

Date : 2026-04-28 Story : CVN-N011-EA-S07 (OP wp#85) GH issue : #764 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P0 — production blocker (50 % of FTF sweep variants) Predecessors : - §17.1 (Bug #1 Track 5 sweep — DataFrame coercion, PR #752) - §17.2 (Bug #2 Track 5 sweep — XGBoost feature_names mismatch, PR #754) - This dossier = §17.3 will be added post-fix


1. Production failure observed

Operator launched FTF sweep on 2026-04-28 ~15:30 UTC (after PR #754 deploy). New failure mode :

File "src/training/XGBoost/cvntrade_XGBoost_trainer.py", line 315
  self._apply_calibration(X_train, y_train, config.calibration)
File ".../sklearn/calibration.py", line 319, in fit
  check_classification_targets(y)
ValueError: Unknown label type: continuous. Maybe you are trying to fit a
classifier, which expects discrete classes on a regression target with
continuous values.

Crash on UNIUSDC fold 3, variant eps_buy=0.3 eps_hold=0.15 cleanlab_mode=off with calibration='isotonic'.

2. Root cause analysis

2.1 The trace

  1. Trainer loads (X_train, y_train) from datasets["train"]y_train initially in {0, 1} (hard labels).
  2. Trainer calls apply_label_pipeline(X_train, y_train, ...) (src/training/XGBoost/cvntrade_XGBoost_trainer.py:251) — for non-baseline variants, y_train is overwritten with soft labels (continuous floats in [0, 1]).
  3. Trainer constructs dtrain = xgb.DMatrix(X_train, label=y_train, weight=sample_weights) and runs xgb.train(..., 'binary:logistic') — XGBoost binary objective accepts soft labels natively. ✅
  4. Trainer calls self._apply_calibration(X_train, y_train, config.calibration) (...trainer.py:315) which calls CalibratedClassifierCV.fit(X_train, y_train) (...trainer.py:403).
  5. sklearn's CalibratedClassifierCV calls check_classification_targets(y) and rejects continuous targets. ❌

2.2 The implicit contract violation

This is the third incident in 24 h on the same conceptual surface : apply_label_pipeline violates an undocumented contract with a downstream consumer.

Incident Surface Downstream consumer
§17.1 apply_label_pipeline X_train type Hamilton validate_inputs
§17.2 apply_label_pipeline X_train feature_names XGBoost _validate_features
§17.3 (this) apply_label_pipeline y_train discrete-vs-continuous sklearn CalibratedClassifierCV

The pattern is identical to §17.2 : Track 5's design assumed (correctly) that xgb.train(..., 'binary:logistic') accepts soft labels — but failed to audit OTHER downstream consumers of y_train. The trainer's _apply_calibration is one such consumer ; sklearn requires hard labels and crashes on the soft ones.

2.3 Why local tests didn't catch it

The Track 5 integration tests (tests/integration/test_track5_label_smoothing.py) and even the new TestTrainerEndToEndDataFrame from PR #754 only exercise the path up to xgb.train — NOT the _apply_calibration step that follows. The local test trainer call sequence stops at xgb.train ; the calibration codepath was uncovered.

This is a direct repeat of the PR #754 committee's recommendation #4 ("Mandate comprehensive integration testing & test parity") — we did not yet apply it broadly, and the calibration codepath fell through the gap.

3. Why Option C (calibrate on hold-out) over A and B

Option A — Round soft labels back to hard before calibration

y_calib = np.round(y_train).astype(int)
self.calibrated_model.fit(X_train, y_calib)
  • ✅ 1-line fix, immediate
  • ❌ Silently loses the smoothing information
  • ❌ Calibration still happens on the training set (in-sample) — this was a pre-existing methodological weakness that Bug #1 just exposed
  • ❌ Doesn't address the implicit-contract pattern from §17.1+§17.2+§17.3

Option B — Capture original hard y_train before label_pipeline

y_train_original = y_train.copy()
X_train, y_train, sw = apply_label_pipeline(X_train, y_train, ...)
# ... xgb.train ...
self._apply_calibration(X_train, y_train_original, ...)  # use original
  • ✅ Sémantiquement plus clean que A (preserves info)
  • ✅ Cheap (~5 lines)
  • ❌ Calibration toujours in-sample (sur train) — calibration optimistic, n'évalue pas la robustesse OOS
  • ❌ Tampère superficiellement avec le bug : on remet le bug d'origine sous le tapis (la calibration sur train était déjà sous-optimale même avant le bug)

Option C — Calibrate on hold-out (X_val, y_val)

self._apply_calibration(X_val, y_val, config.calibration)
  • ✅ Sémantiquement correct : calibration doit toujours s'appuyer sur un set HOLD-OUT non vu pendant l'entraînement (best practice ML — Platt 1999, Niculescu-Mizil & Caruana 2005)
  • ✅ y_val est jamais touché par apply_label_pipeline → toujours hard-labeled → soft labels du train n'impactent plus la calibration
  • ✅ Élimine la classe de bug : calibration ne peut plus crasher sur soft labels train
  • ⚠️ Side effect : calibration in-sample sur val (le set qu'on évalue ensuite) → ECE_VAL légèrement biaisé optimiste
  • Mitigation : ajouter une évaluation OOS sur X_test/y_test (déjà disponible — voir test_metrics à cvntrade_XGBoost_trainer.py:676-678 déjà computé en mode non-HPO) → ECE_TEST devient la métrique primaire de calibration ; ECE_VAL devient sanity check
  • ✅ Aligne avec ADR-15 (theta calibré OOS — déjà précédent dans le projet)

Choix opérateur : C. Justification : c'est la seule option qui (a) règle le bug à sa racine, (b) corrige une faiblesse méthodologique pre-existing (calibration in-sample sur train), (c) ne réintroduit pas la classe de bug d'incident #4 plus tard.

4. Implementation path

4.1 Trainer change (cvntrade_XGBoost_trainer.py:315)

Before :

# Calibration si nécessaire
if config.calibration != "none":
    self._apply_calibration(X_train, y_train, config.calibration)

After :

# Calibration sur hold-out (X_val, y_val) — y_val est hard-labeled, jamais
# touché par apply_label_pipeline. Évite le bug Bug #1 (incident §17.3) et
# corrige la calibration in-sample qui était méthodologiquement sous-optimale.
# y_val MUST be discrete {0, 1} (per ADR-25 fail-fast).
if config.calibration != "none":
    _assert_calibration_targets_discrete(y_val)
    self._apply_calibration(X_val, y_val, config.calibration)

4.2 Trainer-side fail-fast helper

Add a defensive assertion analogous to _assert_dmatrix_contract (PR #754 committee reco #2) :

def _assert_calibration_targets_discrete(y: np.ndarray) -> None:
    """Fail-fast contract check before CalibratedClassifierCV.fit (per ADR-25,
    no silent fallback). sklearn's check_classification_targets crashes deep
    in the call stack if y is continuous ; this assertion surfaces the
    contract violation at the boundary with a readable message pointing at
    the suspected root cause (a transform that produced soft labels)."""
    y_arr = np.asarray(y)
    unique_vals = np.unique(y_arr)
    is_discrete = np.all(np.isin(unique_vals, [0, 1])) or np.issubdtype(y_arr.dtype, np.integer)
    if not is_discrete:
        raise ValueError(
            f"Calibration contract violation: target y must be discrete classes "
            f"({{0, 1}} for binary), got continuous values with "
            f"{len(unique_vals)} unique entries (sample: {unique_vals[:5]}...). "
            f"Likely cause: a transform between dataset extraction and "
            f"_apply_calibration produced soft labels (e.g., apply_label_pipeline "
            f"with eps_buy > 0). Calibration should be invoked on a hold-out set "
            f"(X_val, y_val) where labels are unmodified — see OPERATIONS §17.3."
        )

4.3 Update both calibration paths

The same change must be applied to _apply_hybrid_calibration (...trainer.py:405) which also calls CalibratedClassifierCV.fit(X_train, y_train) twice. Same fix : pass (X_val, y_val).

4.4 Integration test (regression bar)

Add to tests/integration/test_track5_label_smoothing.py (or new tests/integration/test_calibration_soft_labels.py) :

class TestCalibrationOnSoftLabelsTrain:
    """Reproduces production incident §17.3 : Track 5 label smoothing
    produces soft y_train ; trainer's _apply_calibration must not crash
    even when training labels are continuous, by calibrating on hard y_val."""

    def test_calibration_does_not_crash_with_soft_y_train(self):
        # Build (X_train, y_train) with eps_buy=0.3 → continuous in {0.15, 0.85}
        # Build (X_val, y_val) untouched → discrete in {0, 1}
        # Run trainer.train(..., config with calibration='isotonic')
        # Verify : no crash, calibrated_model is callable, predict_proba in [0,1]
        ...

    def test_calibration_assertion_fires_on_continuous_val(self):
        # If y_val is somehow continuous (defensive), assertion must raise
        # with the readable message before CalibratedClassifierCV.fit
        ...

4.5 Documentation

  • OPERATIONS.md §17.3 — incident log entry
  • documentation/reviews/2026-04-28-track5-bug1-calibration-refactor-plan.md — this dossier
  • documentation/reviews/2026-04-28-track5-bug1-calibration-refactor-pr-review.md — committee pr_review dossier (Phase 4)
  • Brief comment in trainer code citing OPERATIONS §17.3

5. Acceptance criteria

  • Plan dossier reviewed by Expert Committee (plan_review) — verdict PASSED
  • _apply_calibration called with (X_val, y_val) instead of (X_train, y_train) (both standard and hybrid paths)
  • _assert_calibration_targets_discrete invoked before each CalibratedClassifierCV.fit
  • Integration test reproducing the crash (eps_buy=0.3 + calibration='isotonic') — verified to FAIL before fix and PASS after (regression bar)
  • Existing calibration tests (tests/unit/test_calibration_metrics.py, tests/unit/test_oos_calibrator.py) green
  • Trainer test tests/unit/test_model_trainers.py green
  • ECE_VAL post-fix ≤ ECE_VAL pre-fix + 0.005 (calibration on val should be ≈ as good as calibration on train, since train is similar distribution)
  • ECE_TEST (OOS) becomes the primary calibration metric reported in FTF results — added to _evaluate_model if not already
  • OPERATIONS.md §17.3 added
  • Committee pr_review PASSED before merge
  • Operator re-triggers FTF sweep — variant eps_buy=0.3 cl_mode=off calibration='isotonic' no longer crashes

6. Falsifiability criteria

The hypothesis "calibration on val + assertion improves robustness without degrading quality" is falsified if any of :

  • ECE_VAL post-fix > ECE_VAL pre-fix + 0.01 → calibration on val is materially worse than calibration on train (unlikely given same data distribution but worth checking)
  • ECE_TEST (OOS) > ECE_TEST pre-fix + 0.01 → OOS calibration regresses (would require investigating overfit to val)
  • Brier_score on test > Brier_score pre-fix + 0.005 → calibration quality regression
  • Any baseline variant (eps=0, cl=off) regresses on f1_buy or expectancy_net by ≥ 0.005 (calibration change should not impact decision threshold metrics in a healthy world)

If any falsifier triggers : plan rollback (Option B as fallback) + investigate.

7. Risks

# Risk Probability Impact Mitigation
1 ECE_VAL slightly degrades (in-sample bias removed) medium low Document as expected ; ECE_TEST becomes primary metric
2 Calibration overfits to val if val is small low medium Check val_size ≥ 500 in trainer ; add guardrail if not
3 Hybrid calibration path missed by tests → regression medium medium Explicit test for _apply_hybrid_calibration path
4 OOS calibration metric (ECE_TEST) pre-existing values not comparable post-fix low low Document baseline shift in OPERATIONS §17.3 + plan dossier
5 The fix exposes a 4th implicit contract elsewhere medium high Audit all y_train/y_val consumers in trainer + post-trainer code (oos_calibrator.py, regime_trainer.py, etc.) — out of scope of S07, file follow-up Story if found

8. Cost estimate

Phase Effort Wall-clock
1 (Plan + committee) 2 h 1 day (committee + recos round-trip)
2 (Design) 1 h included in Phase 3
3 (Implem + tests + docs) 3 h 1 day
4 (PR + CR cycles + committee pr_review + merge + deploy) 4 h 2 days (CR cycles 4-5 + committee + image build/deploy)
5 (Validation cluster) 30 min dependent on operator
Total ~10 h effort ~4 days wall-clock

9. Out of scope (tracked separately)

  • Single-source dependency hygiene → CVN-N013-EA-S02-S04 (already created)
  • Generalized contract enforcement at all ML pipeline boundaries → CVN-N011-EA-S01 (data contracts ADR, already created)
  • Audit of other y_train/y_val consumers (Risk #5) → follow-up Story if anything found
  • Calibration fold-aware (calibrate on a separate slice of val per-fold) → potential follow-up if Risk #2 materializes

10. Question for the committee

Is Option C (calibrate on (X_val, y_val) + fail-fast assertion + integration test reproducing the crash) the right grain of fix for this Story — or should the scope be widened to (a) split val into val_eval / val_calib, (b) audit all other implicit contracts in the trainer at the same time, or (c) both ?

Bonus : what is the right durable mechanism (beyond this Story's scope) to prevent a 4th incident on the apply_label_pipeline surface ?

11. Linked context

  • ADR-15 (theta calibré OOS) — direct precedent for hold-out calibration
  • ADR-25 (no silent fallback / fail-fast) — informs the _assert_calibration_targets_discrete design
  • ADR-58 (every FTF factor must have guardrail + integration test) — informs the test pattern
  • ADR-68 (Expert Committee for substantive ML PR)
  • OPERATIONS §17.1 + §17.2 + §17.3 (this incident series — same surface, 3 incidents in 24 h)
  • Committee 0e15acc0 (PR #754 pr_review) — reco #4 "mandate integration testing parity" predicted this exact gap (calibration codepath uncovered) ; this Story partially applies it