Skip to content

Plan Dossier — Track 5 : Asymmetric label smoothing + cleanlab (CVN-N001-EE-S01)

Date: 2026-04-28 Issue: #712 Story: CVN-N001-EE-S01 (OP wp#40, sprint F1B-S1-QW-PhaseA) Epic: CVN-N001-EE (#707) — F1_buy boost (13-track plan) Need: CVN-N001 (F1 mission, #608) Author: Dominique (operator) + Claude Status: awaiting plan_review (ADR-68 mode A — committee BEFORE implementation) Reviewers requested: Expert Committee (5 personas + consolidator) Operational prereqs sign-off: ✅ ADR-70 (#709 wp#55, PR #730), ✅ ADR-71 (#708 wp#56, PR #731)


0. Executive summary

One paragraph — the rest of the dossier expands on this.

CVNTrade's binary classifier plateaus at f1_buy ∈ [0.40, 0.46] across 400 Phase 2 runs (median 0.418, peak 0.541). The hypothesis under test : the model overfits noise on the minority BUY class (~20 % rate) because hard binary cross-entropy on {0, 1} labels rewards over-confident wrong predictions. Track 5 introduces two independent levers : (a) asymmetric label smoothing (y_buy = 1 − ε_buy, y_hold = ε_hold with ε_buy > ε_hold to shrink the BUY tail more than the HOLD bulk) ; (b) cleanlab-based label-noise filtering (drop or reweight samples where confident-learning identifies likely mislabelling). Both ship as independent FTF factors (ADR-58 guardrails + integration tests), parameters editable from Console (ADR-59), label preprocessing implemented as a Hamilton dataflow (ADR-61) emitting OpenTelemetry spans (ADR-62). Success = per-track gate (F1 plan §6) cleared on the FTF rerun against baseline ftf_20260427_170614_06626e_*. If neither lever fires the gate, the result still falsifies the smoothing hypothesis cleanly and unlocks Track 6 (focal loss).


1. Objective

1.1 Problem

f1_buy plateau on the live FTF surface :

Quantile f1_buy Source
Median (n=400 variants) 0.418 Phase 2 audit, 2026-04-26
Peak 0.541 Phase 2 audit
Baseline naïve "always BUY" ~0.20 ADR-29
Economic break-even (cost-net) ~0.500 F1 plan §4

The median is below break-even — current model is structurally unprofitable. Tuning inside the existing OHLCV + XGBoost + binary-triple-barrier surface has reached an asymptote (12 levers tried in v1, all hit the same wall). Track 5 exits the asymptote by changing the label structure without changing inputs, model class, or barriers.

1.2 Hypothesis (falsifiable)

H : Hard binary labels {0, 1} cause XGBoost to overfit on the minority BUY class. Replacing them with asymmetric soft labels (with ε_buy > ε_hold) and removing systematically mislabelled samples (cleanlab confident-learning) will reduce calibration error and improve f1_buy by ≥ +0.015 on the joint metric gate.

The hypothesis is falsifiable if none of the three following holds on the FTF rerun : - ECE drops by ≥ 0.02 (calibration improvement) - Δf1_buy ≥ +0.01 (Story-specific success, AC #712) - Per-track gate (F1 plan §6) cleared on at least one variant of either factor

If all three fail across all variants, the hypothesis is rejected. The Story still merges (the FTF result is a permanent record per ADR-29), the factor stays available with none as the default variant, and we proceed to Track 6 (focal loss) with a tighter prior.

1.3 Success criterion (from #712 + F1 plan §6)

The Story is considered successful if all of the following hold on a single winning variant configuration :

Criterion Threshold Source
ECE drop OR f1_buy gain Δ ECE ≥ 0.02 OR Δ f1_buy ≥ +0.01 #712 Story-specific
Per-class HOLD ECE ECE_HOLD ≤ baseline + 0.01 (asymmetric smoothing must NOT degrade majority calibration) committee reco 7
Joint primary — f1_buy ≥ +0.015 with 95 % bootstrap CI excluding 0 F1 plan §6
Joint primary — expectancy_net ≥ baseline (cost formula v3) F1 plan §4 + §6
Joint primary — sortino ≥ baseline F1 plan §6
Joint primary — max_drawdown ≤ baseline + 1 % F1 plan §6
Distribution — cryptos improving ≥ 4 of 5 F1 plan §6
Density — BUY trades per fold ≥ 50 F1 plan §6
Statistical — Cohen's d ≥ 0.3 vs baseline on f1_buy FTF tuning protocol
Statistical — BH-corrected p < 0.05 vs baseline on f1_buy FTF tuning protocol

A variant that beats f1_buy alone but degrades expectancy_net is automatically rejected by the joint gate (F1 plan §6 §risk #2).


2. How we prove the objective is met

2.1 Baseline anchor

Baseline run: ftf_20260427_170614_06626e_* (Phase 2 rerun, 550 results, post-#704 FS↔FE drift fix). All metrics are read from PostgreSQL finetune_results (src/commun/finetune/persistence.py:34) with WHERE run_id LIKE 'ftf_20260427_170614_06626e_%'.

For each (crypto × fold) cell, the baseline value is the none variant of every existing factor. Track 5 introduces two new factors (label_smoothing + cleanlab) ; the baseline = both at none / off (i.e., the existing trainer behavior, unchanged).

Fold construction (per ADR-14 + committee reco 2) : 5 outer folds via walk-forward time-series split with purge + embargo (no random KFold). Each fold's train / val / test windows are contiguous in time, and a 1-day embargo separates consecutive splits to prevent leakage from triple-barrier label horizons (H4). Same construction as the baseline run for direct comparability.

Cost formula v3 (per F1 plan §4 + committee reco 1) : expectancy_net = expectancy_gross − (taker_fee + spread + slippage_10bps + funding_cost), where taker_fee=0.10%, spread=0.05%, slippage=10 bps interim (until #711 dynamic slippage ships), funding=0.01% per 8h. Implementation in src/commun/audit/economic_thresholds.py. The same formula is applied to the baseline rows ; deltas are like-for-like.

2.2 Experimental matrix

Per ADR-56 (every factor is independent and A/B-testable), the FTF runs each new factor against baseline with one factor active at a time :

Factor Variants Baseline cross-factor
label_smoothing none, mild, aggressive cleanlab=off
cleanlab off, filter, reweight label_smoothing=none

Total : 5 unique configs (3 + 3 − 1 shared none/off baseline) × 5 cryptos × 5 folds = 125 trained models. Plus a joint confirmation run (best label_smoothing × best cleanlab) to test for super-additivity, run only if both factors individually pass the per-track gate. Joint adds ≤ 1 additional config × 25 = +25 models. Worst case 150 models.

2.3 Statistical evidence required for a "PASS" verdict

The factor is kept (won the gate) only if :

  1. Per-track gate (F1 plan §6) clears on the joint metric vector, evaluated on the winning variant.
  2. Bootstrap CI 95 % on the (factor × variant − baseline) f1_buy delta excludes 0. Computed per (crypto × fold) cell, then aggregated.
  3. Benjamini-Hochberg correction : the variant's p-value (paired Wilcoxon vs baseline) is significant after BH-correction across all variants tested in this Story (5 variants → BH cutoff = α × rank/m).
  4. Cohen's d ≥ 0.3 on the f1_buy delta distribution (small effect size minimum).
  5. At least 4 of 5 cryptos show positive Δ f1_buy individually (no portfolio-only wins).

2.4 Reporting artefact

The PDF report from documentation/reviews/2026-04-29-track5-label-smoothing-results.md (post-rerun, separate dossier) MUST contain :

  • The 5 statistical evidence boxes from §2.3
  • A per-crypto × per-fold heatmap of f1_buy deltas
  • Calibration curves (reliability diagrams) for baseline vs winning variant
  • ECE tables (winning variant per crypto)
  • Bootstrap CI plots
  • Lockfile : the ftf_config snapshot (PostgreSQL) at run time, so the result is reproducible
  • MLflow run IDs for top-3 models per crypto
  • Operator decision : lock, keep available, or abandon

This artefact is the deliverable that closes the Story — not the code merge.


3. What we'll do

3.1 Functional decomposition

Five thin slices, ordered by dependency :

  1. Label preprocessing module (src/training/labels/) — pure functions for asymmetric smoothing + cleanlab application, no XGBoost dependency.
  2. FTF factor declarations (src/commun/finetune/ablation_matrix.py) — two new factors (label_smoothing, cleanlab), env-var-driven per the existing pattern (line 342 factor_calibration is the model).
  3. Console params registration (scripts/ftf_config_ui.py) — register the new env keys in PARAM_OPTIONS so operators edit them from Console only.
  4. Trainer integration (src/training/XGBoost/cvntrade_XGBoost_trainer.py:144-145) — call the label preprocessing pipeline before xgb.DMatrix creation, gated by env vars.
  5. Hamilton orchestration + OTel — wrap the label preprocessing as a Hamilton dataflow (ADR-61) emitting OTel spans (ADR-62) per step.

Plus the cross-cutting : - Guardrail tests + integration test (ADR-58) for both new factors - MLOps readiness template filled (documentation/stories/CVN-N001-EE-S01/mlops_readiness.md) per ADR-70 - Migration : no SQL migration required (all new keys live in ftf_config.base_env JSONB ; factor catalogue is Python data per ADR-56)

3.2 What we explicitly do NOT do

  • No model class change — XGBoost stays
  • No barrier changeATR0.5_1.5_H4 stays per project policy
  • No feature change — same OHLCV + enrichment + FE pipeline
  • No CUSUM / filter chain change — out of Track 5 scope
  • No live deployment — paper-FTF only ; live deployment gated by ADR-71 + EG-S06 flatten_all
  • No 9-filter chain re-evaluation in this Story — the FTF metrics are computed inside the existing filter chain, but a separate downstream evaluation (per committee reco 3) will assess whether the winning Track 5 variant changes the funnel pass-rate distribution. That evaluation is added to the post-merge instrumentation phase (§6 Phase 5).

4. Detailed architecture

4.1 System integration overview

flowchart TB
    Console["Console UI
(Streamlit, port 8501)
scripts/ftf_config_ui.py"] PG[("PostgreSQL
ftf_config
(base_env JSONB)")] AblMatrix["ablation_matrix.py
+ 2 new AblationFactors"] Runner["ablation_runner.py
variant × crypto × fold loop"] PreflightFTF["FTF preflight
(ADR-64)
guardrail validation"] LabelHam["Hamilton dataflow
(ADR-61)
src/training/labels/"] Trainer["XGBoost trainer
cvntrade_XGBoost_trainer.py:144"] OTel["OTel collector
(ADR-62)"] Loki["Loki
structured logs"] Persist["finetune_results table
(persistence.py:34)"] Report["PDF report + dossier
2026-04-29-track5-results.md"] Console -- "operator edits
label_smoothing.epsilon_buy
cleanlab.mode" --> PG PG -- "loaded at FTF run start" --> AblMatrix AblMatrix -- "factor declarations" --> Runner Runner --> PreflightFTF PreflightFTF -- "validate (factor, variant) compat" --> Runner Runner -- "iterate variants" --> LabelHam LabelHam -- "smoothed + cleanlab-filtered y_train" --> Trainer Trainer -- "f1_buy, expectancy_net, ..." --> Persist LabelHam -- "spans" --> OTel Trainer -- "spans" --> OTel LabelHam -- "log_event" --> Loki Persist --> Report

4.2 Label transformation pipeline (per-fold)

flowchart LR
    raw["y_train (raw)
shape (N,)
dtype int8 ∈ {0,1}"] smooth["asymmetric_smooth
ε_hold, ε_buy"] soft["y_smoothed
shape (N,)
dtype float32"] clean["cleanlab.find_label_issues
+ pred_probs from CV fold"] mask["suspect_mask
shape (N,) bool"] apply["apply_mode
filter | reweight | off"] final["y_final + sample_weights
passed to xgb.DMatrix"] raw --> smooth smooth --> soft soft -.optional.-> clean clean --> mask soft --> apply mask --> apply apply --> final classDef new fill:#fef3c7,stroke:#d97706 class smooth,clean,mask,apply,soft,final new

Notes : - asymmetric_smooth is unconditional (variant none = identity, mild = (eps_buy=0.15, eps_hold=0.075), aggressive = (eps_buy=0.30, eps_hold=0.15) — values per committee reco 17, see §4.3 variant table for the canonical declaration) - cleanlab step is gated by factor_cleanlab variant ; when off it's a no-op pass-through - pred_probs for cleanlab come from the existing PurgedKFold splitter (training.cv.purged_kfold, López de Prado AFML Ch 7), reused so the inner cleanlab CV honours the same purge / embargo contract as the outer FTF folds. PurgedKFold reads CVN_PURGE_BARS and CVN_EMBARGO_BARS from the env, so this Story doesn't introduce a parallel configuration knob (CR PR #734 round 2). Default n_splits=2 per committee reco 18. Soft per-fold wall-time budget : if any fold exceeds CVN_CLEANLAB_FOLD_BUDGET_S (default 60 s), the loop short-circuits and unprocessed slices keep the initial 0.5 prior — find_label_issues then treats those samples as ambiguous and skips them rather than mis-flagging them. No leaky fallback (committee pr_review session 989a6567 blocker 1) : the prior "single-fold tail train-on-self" path was removed because it produced biased pred_probs that silently corrupted the cleanlab signal. Honest partial coverage with explicit telemetry (coverage_pct event field + P2 alert at < 80 %) beats a leaky full coverage. - Sample weights from apply_mode=reweight are multiplied with existing class-balancing weights (line 165 of trainer) — order matters, documented in code

4.3 FTF factor matrix (declarations)

classDiagram
    class AblationFactor {
        +str name
        +str factor_type
        +str category
        +Dict env_vars
        +str description
    }
    class factor_label_smoothing {
        +name = "label_smoothing"
        +category = "model"
        +variants none, mild, aggressive
        +env_keys CVN_LABEL_SMOOTH_EPS_BUY,
        CVN_LABEL_SMOOTH_EPS_HOLD
    }
    class factor_cleanlab {
        +name = "cleanlab"
        +category = "model"
        +variants off, filter, reweight
        +env_keys CVN_CLEANLAB_MODE,
        CVN_CLEANLAB_MAX_DROP_PCT,
        CVN_CLEANLAB_REWEIGHT_FLOOR
    }

    AblationFactor <|-- factor_label_smoothing
    AblationFactor <|-- factor_cleanlab

Variant table :

Factor Variant env_vars
label_smoothing none CVN_LABEL_SMOOTH_EPS_BUY=0.0, CVN_LABEL_SMOOTH_EPS_HOLD=0.0
label_smoothing mild CVN_LABEL_SMOOTH_EPS_BUY=0.15, CVN_LABEL_SMOOTH_EPS_HOLD=0.075 (per committee reco 17, Müller 2019 calibrated for 20 % BUY rate)
label_smoothing aggressive CVN_LABEL_SMOOTH_EPS_BUY=0.30, CVN_LABEL_SMOOTH_EPS_HOLD=0.15 (per committee reco 17)
cleanlab off CVN_CLEANLAB_MODE=off
cleanlab filter CVN_CLEANLAB_MODE=filter, CVN_CLEANLAB_MAX_DROP_PCT=5
cleanlab reweight CVN_CLEANLAB_MODE=reweight, CVN_CLEANLAB_MAX_DROP_PCT=5, CVN_CLEANLAB_REWEIGHT_FLOOR=0.5

Guardrails (per ADR-58, in src/commun/finetune/validation.py) :

  • label_smoothing.epsilon_buy ∈ [0.0, 0.5] — silly-bounds check
  • label_smoothing.epsilon_hold ∈ [0.0, epsilon_buy] — asymmetry preserved (no symmetric or inverted smoothing)
  • cleanlab.max_drop_pct ≤ 5.0 — F1 plan §6 risk #4 (cap label drop rate at 5 %)
  • cleanlab.mode in {off, filter, reweight} — closed enum

Integration test (ADR-58) : - tests/integration/test_track5_label_smoothing.py — exercises apply_label_pipeline end-to-end (in-process, no PostgreSQL, no Airflow runner) for each of the 5 unique configs from §4.2 + the joint config. Asserts (a) no crash, (b) (X_final, y_final, sample_weights_final) shapes are mutually consistent, (c) y_final dtype is float32 or float64, (d) cleanlab filter mode drops ≤ max_drop_pct rows (strict cap, no tolerance), (e) the output is consumable by a real xgb.DMatrix → xgb.train pair (training succeeds and the produced booster predicts in [0, 1]). - The PostgreSQL finetune_results write path is not exercised here ; it's covered by the existing FTF persistence tests (tests/unit/finetune/test_persist_predictions.py) and verified manually in Phase 3 of §6 when the operator runs the FTF sweep on the cluster.

4.4 Hamilton dataflow

flowchart TD
    raw_y["raw_y_train: np.ndarray"]
    eps_buy["epsilon_buy: float"]
    eps_hold["epsilon_hold: float"]
    cl_mode["cleanlab_mode: str"]
    cl_max_drop["cleanlab_max_drop_pct: float"]
    cl_floor["cleanlab_reweight_floor: float"]
    X_train["X_train: np.ndarray"]
    base_w["base_sample_weights: np.ndarray"]

    smoothed_y["smoothed_y(raw_y, eps_buy, eps_hold)"]
    cv_probs["cv_pred_probs(X_train, smoothed_y)
2-fold purged+embargoed CV
per ADR-14"] suspect["suspect_mask(smoothed_y, cv_probs)"] final_y["final_y(smoothed_y, suspect, cl_mode, cl_max_drop)"] final_w["final_weights(base_w, suspect, cl_mode, cl_floor)"] raw_y --> smoothed_y eps_buy --> smoothed_y eps_hold --> smoothed_y smoothed_y --> cv_probs X_train --> cv_probs smoothed_y --> suspect cv_probs --> suspect smoothed_y --> final_y suspect --> final_y cl_mode --> final_y cl_max_drop --> final_y base_w --> final_w suspect --> final_w cl_mode --> final_w cl_floor --> final_w classDef out fill:#dcfce7,stroke:#16a34a class final_y,final_w out

File : src/training/labels/label_pipeline.py (Hamilton dataflow module).

Trainer-side invocation (actual implementation as of committee pr_review 989a6567 blocker 3 fix : apply_label_pipeline now uses hamilton.driver.Driver(...).execute(...) with a DictResult adapter ; the previous imperative path was rejected for bypassing Hamilton's lineage tracking) :

from training.labels import apply_label_pipeline
X_train, y_train, sample_weights = apply_label_pipeline(
    X_train,
    y_train,
    base_sample_weights=sample_weights,
    crypto=os.environ.get("CVN_CRYPTO_SYMBOL", "unknown"),
    fold_id=int(os.environ.get("CVN_FOLD_ID", "-1")),
)
# env vars read inside: CVN_LABEL_SMOOTH_EPS_BUY, CVN_LABEL_SMOOTH_EPS_HOLD,
# CVN_CLEANLAB_MODE (validated fail-fast against {off, filter, reweight} per
# ADR-25 — a typo aborts the variant cleanly), CVN_CLEANLAB_MAX_DROP_PCT,
# CVN_CLEANLAB_REWEIGHT_FLOOR, CVN_CLEANLAB_CV_FOLDS (default 2),
# CVN_CLEANLAB_FOLD_BUDGET_S (default 60), CVN_PURGE_BARS + CVN_EMBARGO_BARS
# (consumed by the existing PurgedKFold splitter — single source of truth
# for purge / embargo, no new knob).

Lineage emission (per ADR-61) : Hamilton's dr.visualize_execution(...) writes an SVG to MLflow as artefact label_pipeline_lineage_<crypto>_<fold>.svg.

4.5 OpenTelemetry instrumentation (ADR-62)

sequenceDiagram
    participant Runner as ablation_runner
    participant Trainer as XGBoost trainer
    participant Pipeline as Hamilton label pipeline
    participant Tracer as OTel tracer
    participant Coll as OTel collector

    Runner->>Tracer: span "ftf.variant" (factor=label_smoothing, variant=mild, crypto=BTCUSDC, fold_id=0)
    Trainer->>Tracer: span "training.label_pipeline"
    Trainer->>Pipeline: dr.execute(...)
    Pipeline->>Tracer: span "label.smooth" (eps_buy=0.10, eps_hold=0.05)
    Tracer-->>Coll: emit
    Pipeline->>Tracer: span "label.cleanlab.cv_probs"
    Tracer-->>Coll: emit
    Pipeline->>Tracer: span "label.cleanlab.find_issues" (suspect_count=N, drop_pct=2.3)
    Tracer-->>Coll: emit
    Pipeline->>Tracer: span "label.apply_mode" (mode=filter)
    Tracer-->>Coll: emit
    Trainer->>Tracer: span "xgb.train" (n_samples=N')
    Tracer-->>Coll: emit
    Trainer->>Tracer: span "metrics.compute" (f1_buy, ECE, ...)
    Tracer-->>Coll: emit

Spans emitted (one per step, ADR-62 golden-field attributes : factor, variant, crypto, fold_id) :

Span Attributes (in addition to golden fields) Purpose
label.smooth epsilon_buy, epsilon_hold timing + parameter trace
label.cleanlab.cv_probs cv_folds=2, model=xgb, purge_embargo=enabled, mem_mb, cpu_pct wall time + resource footprint of CV pre-step (committee reco 13)
label.cleanlab.find_issues suspect_count, drop_pct, buy_drop_pct (separate, reco 4), hold_drop_pct granular drop visibility on minority class
label.apply_mode mode, effective_drop_pct, effective_buy_drop_pct confirm guardrail (drop ≤ 5 %) and minority preservation
training.label_pipeline total_duration_ms parent span over all label work

Hamilton ↔ OTel naming consistency (committee reco 8) : Hamilton node names (smoothed_y, cv_pred_probs, suspect_mask, final_y, final_weights) appear as the second-level segment of the corresponding span name (label.smooth, label.cleanlab.cv_probs, label.cleanlab.find_issues, label.apply_mode). The cross-reference is asserted in a unit test that diffs the Hamilton node graph against the span catalogue.

Metrics (Prometheus, per ADR-62) : - cvntrade_label_smooth_eps_applied{factor, variant} (gauge) - cvntrade_cleanlab_drop_pct{crypto, fold} (gauge) - cvntrade_cleanlab_buy_drop_pct{crypto, fold} (gauge — committee reco 4 — minority class granularity) - cvntrade_cleanlab_hold_drop_pct{crypto, fold} (gauge) - cvntrade_label_pipeline_duration_seconds{stage} (histogram) - cvntrade_label_final_weights_min{crypto, fold} (gauge — committee reco 5) - cvntrade_label_final_weights_max{crypto, fold} (gauge) - cvntrade_label_final_weights_mean{crypto, fold} (gauge) - cvntrade_label_final_weights_std{crypto, fold} (gauge) - cvntrade_label_pipeline_mem_peak_mb{crypto, fold} (gauge — committee reco 13, K8s pressure visibility)

Grafana alerts (committee reco 19) : - cvntrade_cleanlab_drop_pct > 4.5 for any (crypto × fold) → P2 alert "preempt 5 % cap breach", runbook documentation/runbooks/cleanlab_over_drop.md - cvntrade_cleanlab_buy_drop_pct > 3.0 → P2 alert "minority decimation" — separate threshold, BUY samples are precious

4.6 Console UI integration (ADR-59 + ADR-65)

flowchart LR
    Op[Operator]
    Console["Console UI
scripts/ftf_config_ui.py"] Param["PARAM_OPTIONS dict
(line 60-101)"] PG[("ftf_config.base_env
JSONB")] Hist[("ftf_config_history
audit trail")] Op -- "selects variant
'aggressive' from dropdown" --> Console Console -- "reads schema" --> Param Console -- "UPDATE base_env
SET label_smoothing.epsilon_buy=0.20" --> PG PG -. "trigger" .-> Hist Console -- "shows diff: was=0.0, now=0.20" --> Op

Registration (in scripts/ftf_config_ui.py:60-101) :

PARAM_OPTIONS = {
    # ... existing ...
    "CVN_LABEL_SMOOTH_EPS_BUY":   ("number", 0.0, 0.5, 0.05),    # (type, min, max, step)
    "CVN_LABEL_SMOOTH_EPS_HOLD":  ("number", 0.0, 0.5, 0.05),
    "CVN_CLEANLAB_MODE":          ("enum", ["off", "filter", "reweight"]),
    "CVN_CLEANLAB_MAX_DROP_PCT":  ("number", 0.0, 5.0, 0.5),
    "CVN_CLEANLAB_REWEIGHT_FLOOR":("number", 0.0, 1.0, 0.05),
}

The Streamlit form auto-renders the right widget per type. No code change in the UI beyond the dict ; that's the ADR-59 promise.

4.7 Observability fan-out

flowchart LR
    Trainer[Trainer + Hamilton]
    OTel[OTel collector]
    Tempo[Tempo
traces] Prom[Prometheus
metrics] Loki[Loki
structured logs] Graf[Grafana
dashboard 'F1 Boost Track 5'] Trainer -- "spans" --> OTel Trainer -- "metrics" --> OTel Trainer -- "log_event" --> Loki OTel --> Tempo OTel --> Prom Tempo --> Graf Prom --> Graf Loki --> Graf Graf -.- Op[Operator dashboard]

New Grafana panels (added in infra/grafana/dashboards/cvntrade-f1-boost.json post-implementation) :

  • Panel 1 : cvntrade_label_smooth_eps_applied heatmap by (factor, variant)
  • Panel 2 : cvntrade_cleanlab_drop_pct per crypto over the FTF run
  • Panel 3 : cvntrade_label_pipeline_duration_seconds p95 — catches cleanlab CV pre-step regressions
  • Panel 4 : Joint metric scoreboard (f1_buy delta, expectancy_net delta, sortino delta, max_dd delta) per variant from Postgres finetune_results

5. Falsifiability + rollback

5.1 Falsifiable hypotheses (Story-level)

H What would falsify Action if falsified
H1 (smoothing helps) All 3 label_smoothing variants miss per-track gate AND ΔECE < 0.02 Keep factor available with none default ; advance to Track 6
H2 (cleanlab helps) All 3 cleanlab variants miss per-track gate AND no improvement on the calibration curves Keep factor available with off default ; advance to Track 6
H3 (joint super-additive) Joint config (best LS × best CL) ≤ max(LS-only, CL-only) on f1_buy Drop joint, ship best single-factor variant only
H4 (cleanlab over-drops) Drop rate hits the 5 % cap on > 1 crypto Lower max_drop_pct to 2 % or revert cleanlab to off ; document in Story closure
H5 (silent training degradation) OOS distribution shifts (training accuracy ↑ but validation accuracy ↓ by > 0.05) This is overfitting from over-smoothing ; cap epsilon_buy at 0.10 ; document

A REJECTED Story (H1 + H2 both) is still a successful Story — it falsifies cleanly and informs the next track. Per F1 plan §6, a NO-GO at sprint S1 close triggers escalation to big-bet bundle (S06-S08).

5.2 Rollback (config-only per ADR-71 + ADR-56)

  • Per-variant rollback : set factor_label_smoothing=none and factor_cleanlab=off in ftf_config.base_env via Console — < 1 minute.
  • Per-factor rollback : remove the factor from ablation_matrix.py (revert one commit). Schema unchanged because all params live in JSONB.
  • No live deployment to roll back — Track 5 ships to FTF only, no production trading impact.

5.3 What's NOT a failure mode

  • A variant taking longer than the baseline (cleanlab adds ~30 s/fold for the CV pre-step) is expected, not a failure. Logged as cvntrade_label_pipeline_duration_seconds.
  • A variant changing the per-fold prediction distribution is expected, not a failure. The gate evaluates outcomes, not internals.
  • A variant beating the gate on 4 of 5 cryptos but failing on 1 (e.g. SHIBUSDC) is acceptable per gate criterion — the outlier crypto stays on the none variant.

6. Action plan (sequenced)

Phase 1 — Plan review (this dossier)

# Action Owner Output
1.1 Submit this dossier to Expert Committee plan_review Claude committee verdict (PASSED / OK / EXECUTION_RISK / REJECTED)
1.2 Apply blockers (if REJECTED) or actionable recommendations Claude + operator updated dossier
1.3 Operator sign-off : approve to proceed to Phase 2 Operator OP wp#40 comment

Phase 2 — Implementation (post-committee OK)

# Action File Test
2.1 Create Story dossier dir documentation/stories/CVN-N001-EE-S01/ n/a
2.2 Add label_smoothing factor declaration src/commun/finetune/ablation_matrix.py (after line 351) unit
2.3 Add cleanlab factor declaration src/commun/finetune/ablation_matrix.py unit
2.4 Add 4 guardrails src/commun/finetune/validation.py tests/unit/test_ftf_guardrails.py
2.5 Register 5 keys in Console scripts/ftf_config_ui.py (PARAM_OPTIONS dict) manual UI smoke
2.6 Implement label preprocessing module src/training/labels/label_pipeline.py (Hamilton) unit + property-based : value-range checks on y_final ∈ [0, 1], sample_weights >= 0, len(y_final) == len(X_train), no NaN/Inf, monotonic shrinkage of confidence (per committee reco 12)
2.7 Add OTel spans + metrics src/training/labels/label_pipeline.py unit (mock collector)
2.8 Wire into XGBoost trainer src/training/XGBoost/cvntrade_XGBoost_trainer.py:144-145 unit + integration
2.9 Add integration test (ADR-58) tests/integration/test_track5_label_smoothing.py passes locally + CI
2.10 Fill MLOps readiness template documentation/stories/CVN-N001-EE-S01/mlops_readiness.md reviewer checks 6 sections
2.11 Add Grafana panel JSON infra/grafana/dashboards/cvntrade-f1-boost.json manual smoke
2.12 Update CLAUDE.md if patterns changed CLAUDE.md n/a
2.13 Pin cleanlab version in dependencies (committee reco 14) pyproject.toml + poetry.lock CI pinned-version check
2.14 Review K8s training pod resources for cleanlab CV pre-step (committee reco 11) — bump memory request if cleanlab CV peaks > 2× baseline ; document in infra/helm/cvntrade-airflow/values.yaml if changed infra/helm/cvntrade-airflow/values.yaml (if needed) k8s smoke pod boot

Phase 3 — Validation (FTF run)

# Action Owner Output
3.1 Run FTF preflight Operator (Airflow trigger dag_ftf__preflight) preflight green / red
3.2 Run FTF sweep on 5 unique configs Operator (Airflow dag_finetune__pte with factor=label_smoothing,cleanlab) 125 rows in finetune_results
3.3 Generate stats : bootstrap CI + BH-correction + Cohen's d Operator (existing FTF report generator) PDF report
3.3b PDF report integrity check (committee reco 15) : non-empty file, valid PDF magic bytes, contains the 5 statistical evidence boxes from §2.3 Operator script (scripts/check_pdf_integrity.py) green = continue ; red = re-run report generator
3.4 Decide : per-track gate cleared? Operator + Claude go / no-go per variant
3.5 If go : submit joint config (best LS × best CL) Operator +25 rows
3.6 Write results dossier Claude documentation/reviews/2026-04-29-track5-label-smoothing-results.md

Phase 4 — Merge + close Story

# Action Owner Output
4.1 Open PR (squash, base main) Claude PR #
4.2 CodeRabbit review cycle (4-5 passes) Claude (apply fixes) clean CR
4.3 Submit committee pr_review Claude session id
4.4 Apply pr_review fixes Claude iteration
4.5 Merge Operator commit on main
4.6 CI Docs site green Claude (verify)
4.7 Close OP Story wp#40 (ADR-69 step 14) Claude wp#40 → Closed
4.8 If S02 (#713 focal loss) is the last in sprint S1, close sprint per OPERATIONS §16.4 Claude sprint closed

Phase 5 — Post-merge instrumentation

# Action Owner
5.1 Operator runs FTF sweep with new factor enabled in baseline Operator
5.2 Grafana dashboard "F1 Boost Track 5" populated Claude (panels exist post-merge)
5.3 Lock decision per FTF tuning protocol (BH significance + Cohen's d ≥ 0.3) Operator
5.4 If lock : update ftf_baseline.json with the winning variant as new baseline Operator

7. Risks + mitigations

# Risk Likelihood Impact Mitigation
R1 Cleanlab CV pre-step adds > 30 s/fold ; FTF run wall-time blows budget M M Span timing measured ; if > 60 s/fold, drop CV folds from 3 to 2
R2 Asymmetric smoothing degrades calibration on majority class (HOLD) M L ECE measured per class ; if HOLD ECE worsens > 0.02, ship mild only
R3 Cleanlab drops samples that are real BUYs (the rare ones we need most) M H F1 plan §6 risk #4 cap : drop ≤ 5 % ; cleanlab reweight mode is safer than filter
R4 Operator forgets to run FTF preflight → guardrail trips mid-run L M Preflight is part of the standard dag_finetune__pte flow per ADR-64, hard to skip
R5 Joint LS×CL config worse than either alone (anti-additive) M L H3 hypothesis ; ship single-factor winner
R6 The 5 % drop cap is hit on > 1 crypto → cleanlab effectively useless L L H4 hypothesis ; lower drop cap or revert
R7 OTel collector unreachable → spans drop ; FTF run continues L L OTel emission is non-blocking per src/commun/observability/otel.py ; logged warning only
R8 Hamilton dataflow lineage SVG too large for MLflow L L Cap nodes shown ; fall back to JSON adjacency list
R9 Cleanlab model instability : poor pred_probs from internal CV → misidentified label issues, false positives on suspect samples (committee reco 9) M M Monitor suspect_count and effective_drop_pct per fold ; alert on cleanlab_buy_drop_pct > 4.5% (preempt 5 % cap) ; if instability seen on > 1 crypto, fall back to 1-fold mode or switch to cleanlab=off
R10 XGBoost training time inflates : altered label distribution + sample weights makes optimization slower (committee reco 10) L L Monitor xgb.train OTel span duration p95 ; if regression > 50 % vs baseline, investigate before increasing variant count
R11 K8s resource pressure during cleanlab CV pre-step (memory + CPU) (committee reco 11 + 13) M L Per-pod memory + CPU monitoring during the label.cleanlab.cv_probs span ; review training pod requests / limits before sweep ; fall back to 1-fold cleanlab if OOMKilled
R12 Cleanlab dependency drift between dev / CI / prod (committee reco 14) L M Pin cleanlab exact version in pyproject.toml + poetry.lock ; CI verifies pinned version matches lock

8. MLOps readiness preview

The full MLOps readiness file (per ADR-70) will be at documentation/stories/CVN-N001-EE-S01/mlops_readiness.md and committed in Phase 2 step 2.10. Preview of the 6 sections :

§ Section Track 5 specifics
1 Production monitoring cvntrade_label_smooth_eps_applied, cvntrade_cleanlab_drop_pct, cvntrade_cleanlab_buy_drop_pct, cvntrade_cleanlab_hold_drop_pct, cvntrade_label_pipeline_duration_seconds, cvntrade_label_final_weights_{min,max,mean,std}, cvntrade_label_pipeline_mem_peak_mb
2 Alerting & runbooks P2 alert on cleanlab_drop_pct > 4.5% (preempt 5 % cap, reco 19) ; P2 alert on cleanlab_buy_drop_pct > 3.0% (minority decimation) ; P3 alert on label_pipeline_mem_peak_mb > 2× baseline (K8s pressure, reco 11+13) ; runbooks documentation/runbooks/cleanlab_over_drop.md, cleanlab_minority_decimation.md, cleanlab_oom_pressure.md
3 Drift detection PSI on top-K features unchanged (Track 5 doesn't touch features) ; concept drift on f1_buy per crypto ; post-Track 5 funnel re-evaluation (committee reco 3) — measure 9-filter chain pass-rate distribution shift
4 Staged rollout FTF-only this Story (no live) ; if ever promoted to live trading, shadow ≥ 7d on BTCUSDC (deepest order book)
5 Rollback plan Console flip factor_label_smoothing=none, factor_cleanlab=off (< 1 min) ; no code deploy needed ; cleanlab pinned in pyproject.toml (reco 14) so rollback doesn't trigger transitive dep upgrade
6 DRI Operator (dococeven) ; backup TBD ; sunset 2026-07-28 (90d)

9. Open questions for committee

  1. Smoothing values (§4.3 variant table) : mild=(eps_buy=0.15, eps_hold=0.075) and aggressive=(eps_buy=0.30, eps_hold=0.15) — appropriate for a 20 % BUY rate? (Resolved by committee reco 17 ; values shown here are post-amendment.)
  2. Asymmetry direction : we apply ε_buy > ε_hold (more smoothing on minority) — this is the standard "label smoothing for imbalanced classification" recipe. Is there literature pushing the opposite direction we should consider?
  3. Cleanlab reweight floor at 0.5 : reasonable, or push to 0.25 (halve weight twice)?
  4. CV folds for cleanlab pre-step : 3 — increases wall time but improves pred_probs quality. Worth 5 folds for cost?
  5. Joint config (best LS × best CL) : run unconditionally or only if both single-factor variants pass the per-track gate? Defaulting to gated to save compute.
  6. Sample weights ordering : cleanlab reweight applies AFTER existing class-balancing weights. Multiplicative composition. Is there a known interaction with XGBoost gradient that makes additive composition safer?
  7. Smoothing on validation/test : we apply smoothing on training labels only ; val/test labels stay hard for honest evaluation. Confirm this is the standard practice (it is for image classification but may differ for HFT-style binary tasks).
  8. Hamilton vs imperative for this 4-step preprocessing : Hamilton is mandated by ADR-61 for batch flows. The pipeline is small (4 nodes) — any concern about over-engineering for 4 nodes vs gain in lineage emission?

10. Acceptance criteria (Story level)

  • Plan dossier merged with committee plan_review verdict ≥ ACCEPTED (≥ 8.0 avg, no unresolved blockers)
  • All committee actionable recommendations applied to dossier or PR
  • Phase 2 implementation complete (10 of 12 sub-steps green ; CLAUDE.md update only if patterns change)
  • FTF preflight green, FTF sweep complete (125 + optional 25 rows in finetune_results)
  • Results dossier merged at documentation/reviews/2026-04-29-track5-label-smoothing-results.md
  • Story OP wp#40 closed with PR + commit reference + MLflow run IDs
  • If gate cleared : winning variant locked in ftf_baseline.json via Console flip (operator action)
  • If gate not cleared : H1/H2 falsified entry written into Story closure note + sprint S1 retrospective triggered

11. References

11.1 Existing artefacts (built on)

  • documentation/F1_BUY_BOOST_PLAN.md — canonical plan, §5 Track 5 spec
  • documentation/epics/CVN-N001-EE-f1-buy-boost.md — Epic doc with Story tracking table
  • documentation/templates/TEMPLATE_mlops_readiness.md — to be filled per ADR-70
  • documentation/adr/0070-mlops-readiness-template-mandatory.md — gate
  • documentation/adr/0071-trading-kill-switch-invariants.md — sister DESIGN

11.2 Code surfaces

  • src/commun/finetune/ablation_matrix.py:342 (factor_calibration model pattern)
  • src/commun/finetune/ablation_runner.py:110 (_extract_extended_metrics)
  • src/commun/finetune/persistence.py:34 (finetune_results insert)
  • src/commun/finetune/preflight/base.py (FTF preflight ADR-64)
  • src/commun/finetune/preflight/hamilton_exec.py (Hamilton driver pattern)
  • src/commun/observability/otel.py:88 (OTel singleton)
  • src/training/XGBoost/cvntrade_XGBoost_trainer.py:144 (label ingestion point)
  • scripts/ftf_config_ui.py:60 (PARAM_OPTIONS registration)
  • tests/unit/test_ftf_guardrails.py (guardrail test pattern)

11.3 Existing ADRs the plan builds on

  • ADR-25 — No silent fallback (cleanlab errors must be loud)
  • ADR-29 — Naïve baseline mandatory (already in baseline)
  • ADR-30, ADR-32, ADR-33 — Structured logs (golden-field attributes on spans)
  • ADR-56 — Every change FTF-testable (factor matrix is the contract)
  • ADR-58 — Every factor has a guardrail + integration test (mandatory)
  • ADR-59 — All params in PostgreSQL ftf_config (Console-only edit)
  • ADR-61 — Hamilton for batch flows (label preprocessing)
  • ADR-62 — OpenTelemetry spans (instrumentation contract)
  • ADR-63 — Binary BUY/NOT_BUY mission mode
  • ADR-64 — FTF preflight first-class
  • ADR-65 — Airflow DAG params run-level only
  • ADR-67 — Pluggable feature selection (architectural sibling)
  • ADR-68 — Committee = default review channel (this dossier)
  • ADR-69 — OpenProject orchestrator (Story discipline)
  • ADR-70 — MLOps readiness template (must fill, Phase 2 step 2.10)
  • ADR-71 — Kill-switch invariants (live-deploy gate ; not relevant for FTF-only Track 5 but documented)

11.4 Committee sessions (precedent)

  • 9d4942cb — F1 boost plan v3 PASSED OK avg 8.96 (parent approval)
  • 3e0a3008 — sprint S0 / S01 ADR-70 PASSED OK avg 8.6 (mode B precedent for docs)
  • 4c388b4c — sprint S0 / S02 ADR-71 PASSED EXECUTION_RISK avg ~8.3 (mode B with 1 BLOCKER)
  • 8a202a18 — this Story's plan_review : PASSED / OK avg 8.7 (architect 9.0, ops 9.0, ml-eng 8.5, crypto-trader 8.5, data-sci 8.5), 0 blockers, 4 dissents (Q1 / Q4 / Q5 / temporal leakage), 19 recommendations of which 18 actionable applied to this dossier in revisions below ; reco 16 ("consult ML expert on sample-weight × XGBoost gradient interaction") considered satisfied by this very committee session

11.4.1 Recommendations applied (committee 8a202a18)

# Reco Where applied
1 Detail cost formula v3 §2.1 with formula + link to economic_thresholds.py
2 Clarify fold construction (walk-forward + purge + embargo, ADR-14) §2.1
3 Note post-Track 5 9-filter chain re-evaluation §3.2 + §8 §3 (drift)
4 OTel metric cleanlab_buy_drop_pct §4.5 metrics + alerts
5 OTel metrics for final_weights distribution (min/max/mean/std) §4.5 metrics
6 Cleanlab CV folds purged + embargoed (ADR-14 temporal leakage) §4.2 + §4.4 + Hamilton driver pseudocode
7 Per-class HOLD ECE gate (ECE_HOLD ≤ baseline + 0.01) §1.3 success criterion
8 Hamilton node ↔ OTel span naming consistency + asserted via unit test §4.5
9 Risk : cleanlab model instability §7 R9 + monitoring (recos 4, 19)
10 Risk : XGBoost training time inflation §7 R10
11 K8s resource scaling check §6 Phase 2 step 2.14 + §8 §2 alerting
12 Property-based tests details for label_pipeline.py §6 Phase 2 step 2.6
13 Memory + CPU monitoring during cleanlab CV §4.5 attributes + metric mem_peak_mb
14 Pin cleanlab version in dependencies §6 Phase 2 step 2.13 + §8 §5 rollback
15 PDF report integrity check §6 Phase 3 step 3.3b
17 Adjusted smoothing values : mild=(eps_buy=0.15, eps_hold=0.075), aggressive=(eps_buy=0.30, eps_hold=0.15) (canonical (eps_buy, eps_hold) tuple ordering) §4.3 variant table
18 Cleanlab CV folds : 3 → 2 (fallback 1 if > 60 s/fold) §4.2 + §4.4 + Hamilton pseudocode + §7 R1
19 Grafana alert cleanlab_drop_pct > 4.5% §4.5 alerts + §8 §2

11.4.2 Dissents recorded (not amendments)

  • Q1 smoothing values : 4 experts split, settled by reco 17 amendment
  • Q4 cleanlab CV folds : settled by reco 18 amendment (3 → 2 with fallback)
  • Q5 joint config strategy : kept gated (run only if both single-factor pass) — operator may override
  • Temporal leakage on cleanlab CV : critical, settled by reco 6 amendment

11.5 Issues + sprint

  • Need CVN-N001 (#608)
  • Epic CVN-N001-EE (#707) — F1_buy boost
  • Story CVN-N001-EE-S01 (#712) — this Story
  • Operational prereqs : ✅ #708 (kill-switch DESIGN), ✅ #709 (MLOps template DESIGN)
  • Sprint F1B-S1-QW-PhaseA (OP version id 6, dates 2026-05-04 → 2026-05-17 — note : we're starting early ; sprint dates are aspirational)
  • Sibling Story (next in sprint) : CVN-N001-EE-S02 (#713) — focal loss, gated by Track 5 result

11.6 External

  • Müller, R., et al. When Does Label Smoothing Help? NeurIPS 2019 — asymmetric smoothing for imbalanced classification
  • Northcutt, C., et al. Confident Learning: Estimating Uncertainty in Dataset Labels — cleanlab paper
  • de Prado, M.L. Advances in Financial Machine Learning §3.5 — meta-labeling and label noise in HFT