Skip to content

ADR-47 — Meta-label model trained on a separate fold (no shared data with primary)

Status: accepted (partially enforced) (backfilled 2026-04-24 from prior references ; partial-enforcement note added 2026-04-28 CR PR #652 round 4) Date: 2026-04-24 (backfill) · originally implied by v2 LdP pipeline design Introduced by: backfill under #637 (Phase 2 of #593) Supersedes: —


Context

The meta-label model (plugin meta_label_plugin.py) is a second ML classifier that, for each primary BUY signal, predicts the probability that THIS signal is correct. It validates the primary model's prediction before execution (cf. Lopez de Prado, Advances in Financial Machine Learning ch. 3).

The classic trap: training the meta-model on the same rows as the primary model. This leaks the primary's labels (its errors are learned by the meta as "signals to reject") and produces an optimistically biased meta-model — P(correct) is systematically overestimated on the validation sample.

The effect: in live trading, the meta-model accepts signals it would have rejected if exposed to OOS data ; the funnel behaves like v1 but with false confidence in the meta gate.

Decision

The meta-model is trained on a fold disjoint from the primary model's fold:

  • Primary model: fold T (classical train / val / test with embargo, ADR-14 — purged + embargoed CV via PurgedKFold, reads CVN_PURGE_BARS + CVN_EMBARGO_BARS).
  • Meta model: fold T+1 (or wider rolling window), using the primary's OOS predictions as a feature + y_true as the binary label (correct / incorrect).
  • The two folds are separated by an embargo of CVN_EMBARGO_BARS (single ADR-14 policy, no parallel knob).

At training time, the orchestrator generates the primary's OOS predictions via purged cross-validation (ADR-14) and only passes to the meta-trainer the rows whose labels the primary has NOT seen.

Invariants (enforced)

  • Invariant 2: the meta-model's training set has no row overlap with the primary's training set, enforced by the PurgedKFold index-level split used by both trainers (src/training/cv/purged_kfold.py, ADR-14). Since each row is uniquely identified by (symbol, timestamp) in the OHLCV input, index-level non-overlap implicitly prevents (symbol, timestamp) overlap. Defensive check (planned in #737 follow-up scope) : an explicit (symbol, timestamp) intersection assertion at meta-feature assembly time as defense-in-depth — not load-bearing for correctness today (PurgedKFold suffices), but a useful trap for future refactors that might bypass PurgedKFold.
  • Invariant 3: the meta-model NEVER reads the primary's raw features directly without going through the primary's OOS predictions. A meta feature that uses the ground truth on the same window as the primary's training set is a bug (leakage). Enforced at code-review time via the meta-feature assembly module convention (no direct X_primary[...] access ; only pred_proba_primary[...]).

Planned enforcement (#737)

  • Invariant 1 (planned): meta_label_plugin.py refuses to load a meta-model whose MLflow origin_fold metadata equals the primary model's origin_fold (fail-fast, ADR-25). Current status: NOT IMPLEMENTED — origin_fold is not yet tagged at MLflow registration time and not yet verified at load time. Compliance currently rests on the MLflow run-naming convention + Invariant 2 (empty (symbol, timestamp) intersection). When #737 merges, this invariant flips to enforced and moves into the Invariants (enforced) block above ; the ADR status flips from accepted (partially enforced) to active.

Alternatives rejected

  • Meta-model trained on the same fold as the primary: leakage — exactly what this ADR forbids.
  • Meta-model = heuristic rule (no ML): loses the ability to learn the primary's error patterns. Kept as a degraded mode if the meta MLflow is absent (cf. ADR-25 — fail-fast, no silent fallback ; but a meta-model "absent = plugin disabled" is acceptable).
  • Single-fold stacking with wide purge: equivalent to what this ADR mandates, but without the explicit named-fold separation. Rejected for log readability + auditability reasons (ADR-42).

Consequences

  • Positive: the meta-model's signal genuinely reflects its OOS discriminative power. Calibration of the meta threshold (CVN_META_THRESHOLD) becomes meaningful.
  • Negative: training cost ~2× (two folds). Mitigatable via parallelization (ADR-56 FTF factor).
  • Neutral: the meta-model can now use the primary's y_pred_proba OOS as a feature, enriching its input without leakage.

Rollback

  • Disable the meta-label plugin: CVN_USE_META_LABEL=0 (already the baseline — cf. FILTER_FUNNEL §5.4).
  • Or revert the commit introducing the fold separation — but then ADR-47 becomes explicitly superseded, not silently.

References

  • Parent need: CVN-N001 (F1 mission, #608) · meta-label is a funnel filter during economic-layer evaluation
  • Related ADRs: ADR-14 (purged + embargoed CV — single source of CVN_PURGE_BARS + CVN_EMBARGO_BARS), ADR-15 (OOS threshold), ADR-25 (no silent fallback), ADR-44 (filter contract)
  • Code: src/commun/filters/plugins/meta_label_plugin.py (load + gate at filter-chain time), src/training/MetaLabel/cvntrade_meta_label_trainer.py (meta-model training on disjoint fold), src/training/XGBoost/cvntrade_XGBoost_autonomous_trainer.py (primary OOS predictions used as meta features)
  • Docs: documentation/architecture/FILTER_FUNNEL.md §5.4 (Meta Label), documentation/architecture/TRAINING_PIPELINE.md §Anti-leakage
  • External ref: Lopez de Prado, Advances in Financial Machine Learning, ch. 3 §3.4 "Meta-Labeling"