Hotfix v2 Dossier — Track 5 Production Failure (XGBoost feature-name mismatch)¶

Date: 2026-04-28 PR: #754 Issue: #753 Story: CVN-N001-EE-S01 (Track 5, OP wp#40 In testing) Author: Dominique (operator) + Claude Session type: pr_review (per ADR-68 — substantive code in src/training/) Severity: CRITICAL (production FTF sweep — second failure on the same surface in <24h) Predecessor: Hotfix v1 dossier (PR #752 merged commit 7e1e0eeb)

1. What happened — chronology¶

When	Event
2026-04-28 12:25 UTC	Operator triggered FTF sweep — incident #1 : Hamilton `validate_inputs` raised on DataFrame inputs (~75% variants failed)
2026-04-28 ~14:00 UTC	Hotfix v1 (PR #752) merged after committee `pr_review` PASSED — added defensive coercion DataFrame → ndarray
2026-04-28 ~15:30 UTC	Operator re-triggered FTF sweep — incident #2 (this dossier)
2026-04-28 ~16:00 UTC	Hotfix v2 (PR #754) opened

2. Production failure observed (incident #2)¶

ValueError: data did not contain feature names, but the following fields are expected:
  open, high, low, close, BBL_8_2_0, BBM_8_2_0, BBU_8_2_0, RSI_14, MACD_12_26_9, ...

Failure path : xgb.train(params, dtrain, evals=[(dtrain, "train"), (dval, "val")], ...) raised inside XGBoost's _validate_features.

3. Root cause analysis¶

After hotfix v1, apply_label_pipeline's coercion layer transformed X_train (DataFrame → ndarray, no feature names). But the trainer never touches X_val (still DataFrame, with feature names). The XGBoost call sequence becomes :

# src/training/XGBoost/cvntrade_XGBoost_trainer.py
X_train, y_train = datasets["train"]                         # DataFrame, Series
X_train, y_train, sw = apply_label_pipeline(X_train, y_train, ...)  # → ndarray (post hotfix v1)
X_val, y_val = datasets["val"]                               # DataFrame, Series — UNTOUCHED

dtrain = xgb.DMatrix(X_train, label=y_train, weight=sw)      # no feature_names
dval   = xgb.DMatrix(X_val, label=y_val)                     # has feature_names

xgb.train(params, dtrain, evals=[(dtrain, "train"), (dval, "val")], ...)  # ← BOOM

XGBoost's _validate_features cross-checks dtrain ↔ dval and raises when one has feature names and the other doesn't. The hotfix v1 defensive coercion thus broke an implicit, undocumented contract that the trainer assumed (both eval-set DMatrices either have feature names or neither does).

4. Why hotfix v1's tests didn't catch it¶

Hotfix v1 added TestDataFrameCoercion in tests/unit/training/labels/test_label_pipeline.py. Those tests exercised apply_label_pipeline in isolation : DataFrame in, ndarray out, no downstream XGBoost call. They satisfied the Hamilton contract but never replayed the full trainer codepath.

Methodology gap : unit tests of a transform are not enough — the integration with downstream consumers must be tested at the boundary.

5. Hotfix v2 — what changed¶

Preserve input type round-trip. The transform internally still coerces to ndarray (Hamilton requires it), but on output re-wraps to the original type with the original metadata.

Patch summary : - apply_label_pipeline saves X_was_dataframe, X_columns, y_was_series, y_name, w_was_series, w_name BEFORE the coercion block - A nested _restore_input_types(x_out, y_out, w_out) helper re-wraps to original type - Helper is invoked in both return paths : identity short-circuit AND Hamilton path - Net contract : apply_label_pipeline is a type-preserving transform (DataFrame in → DataFrame out with same column names ; ndarray in → ndarray out)

6. New tests — `TestTrainerEndToEndDataFrame`¶

Located in tests/integration/test_track5_label_smoothing.py (6 tests added).

The class replays the exact trainer codepath, not the transform in isolation :

def _trainer_xgb_train(self, X_train, y_train, X_val, y_val, sample_weights):
    dtrain = xgb.DMatrix(X_train, label=y_train, weight=sample_weights)
    dval   = xgb.DMatrix(X_val, label=y_val)
    xgb.train(params, dtrain, num_boost_round=5,
              evals=[(dtrain, "train"), (dval, "val")], verbose_eval=False)
    preds = booster.predict(dval)
    assert preds.shape == (X_val.shape[0],)
    assert (preds >= 0.0).all() and (preds <= 1.0).all()

Coverage : 1. 4 parametrized variants of label_smoothing × cleanlab matrix (baseline, mild × off, none × filter, mild × reweight) — all 4 trigger xgb.train end-to-end through the trainer pattern 2. test_ndarray_inputs_still_return_ndarray — reverse direction, no surprise type promotion 3. test_dataframe_columns_preserved_through_filter_mode — column metadata survives row-shrinking filter mode

Regression bar proven by temporary revert : 5/6 of these tests fail with the production error message when _restore_input_types is removed. Re-applied → 6/6 green.

7. Validation¶

pytest tests/unit/training/labels/ tests/integration/test_track5_label_smoothing.py tests/unit/test_ftf_guardrails.py → 187 passed
black --line-length=120 → 1 file reformatted, all green
mkdocs build --strict → green (only pre-existing INFO links)

8. Question for the committee¶

Is the hotfix v2 a complete fix and a sufficient regression bar for this class of bug — or are there sibling failure modes still latent in apply_label_pipeline (e.g., dtype coercion, sparse matrix support, weight Series alignment, NaN handling) that should be added to the test class before merge ?

Bonus : what is the right durable mechanism to prevent a third incident on this surface ? (e.g., trainer-side type contract assertion ; ADR change ; test fixture parity rule between unit and integration ; …)

9. Linked context¶

ADR-25 (no silent fallback / fail-fast)
ADR-61 (batch DAGs use Hamilton)
ADR-68 (Expert Committee for substantive PR)
ADR-69 (OpenProject as orchestrator) — wp#40 in In testing until Phase 3 sweep produces results
OPERATIONS §17 (incident #1) + §17.2 (incident #2 — to be added post-merge)
6 forward-looking recos from previous committee (DataFrame fixture parity, production smoke gate, cache layer audit) — backlog