Plan Dossier — Track 5 : Asymmetric label smoothing + cleanlab (CVN-N001-EE-S01)¶

Date: 2026-04-28 Issue: #712 Story: CVN-N001-EE-S01 (OP wp#40, sprint F1B-S1-QW-PhaseA) Epic: CVN-N001-EE (#707) — F1_buy boost (13-track plan) Need: CVN-N001 (F1 mission, #608) Author: Dominique (operator) + Claude Status: awaiting plan_review (ADR-68 mode A — committee BEFORE implementation) Reviewers requested: Expert Committee (5 personas + consolidator) Operational prereqs sign-off: ✅ ADR-70 (#709 wp#55, PR #730), ✅ ADR-71 (#708 wp#56, PR #731)

0. Executive summary¶

One paragraph — the rest of the dossier expands on this.

CVNTrade's binary classifier plateaus at f1_buy ∈ [0.40, 0.46] across 400 Phase 2 runs (median 0.418, peak 0.541). The hypothesis under test : the model overfits noise on the minority BUY class (~20 % rate) because hard binary cross-entropy on {0, 1} labels rewards over-confident wrong predictions. Track 5 introduces two independent levers : (a) asymmetric label smoothing (y_buy = 1 − ε_buy, y_hold = ε_hold with ε_buy > ε_hold to shrink the BUY tail more than the HOLD bulk) ; (b) cleanlab-based label-noise filtering (drop or reweight samples where confident-learning identifies likely mislabelling). Both ship as independent FTF factors (ADR-58 guardrails + integration tests), parameters editable from Console (ADR-59), label preprocessing implemented as a Hamilton dataflow (ADR-61) emitting OpenTelemetry spans (ADR-62). Success = per-track gate (F1 plan §6) cleared on the FTF rerun against baseline ftf_20260427_170614_06626e_*. If neither lever fires the gate, the result still falsifies the smoothing hypothesis cleanly and unlocks Track 6 (focal loss).

1. Objective¶

1.1 Problem¶

f1_buy plateau on the live FTF surface :

Quantile	f1_buy	Source
Median (n=400 variants)	0.418	Phase 2 audit, 2026-04-26
Peak	0.541	Phase 2 audit
Baseline naïve "always BUY"	~0.20	ADR-29
Economic break-even (cost-net)	~0.500	F1 plan §4

The median is below break-even — current model is structurally unprofitable. Tuning inside the existing OHLCV + XGBoost + binary-triple-barrier surface has reached an asymptote (12 levers tried in v1, all hit the same wall). Track 5 exits the asymptote by changing the label structure without changing inputs, model class, or barriers.

1.2 Hypothesis (falsifiable)¶

H : Hard binary labels {0, 1} cause XGBoost to overfit on the minority BUY class. Replacing them with asymmetric soft labels (with ε_buy > ε_hold) and removing systematically mislabelled samples (cleanlab confident-learning) will reduce calibration error and improve f1_buy by ≥ +0.015 on the joint metric gate.

The hypothesis is falsifiable if none of the three following holds on the FTF rerun : - ECE drops by ≥ 0.02 (calibration improvement) - Δf1_buy ≥ +0.01 (Story-specific success, AC #712) - Per-track gate (F1 plan §6) cleared on at least one variant of either factor

If all three fail across all variants, the hypothesis is rejected. The Story still merges (the FTF result is a permanent record per ADR-29), the factor stays available with none as the default variant, and we proceed to Track 6 (focal loss) with a tighter prior.

1.3 Success criterion (from #712 + F1 plan §6)¶

The Story is considered successful if all of the following hold on a single winning variant configuration :

Criterion	Threshold	Source
ECE drop OR f1_buy gain	Δ ECE ≥ 0.02 OR Δ f1_buy ≥ +0.01	#712 Story-specific
Per-class HOLD ECE	`ECE_HOLD ≤ baseline + 0.01` (asymmetric smoothing must NOT degrade majority calibration)	committee reco 7
Joint primary — f1_buy	≥ +0.015 with 95 % bootstrap CI excluding 0	F1 plan §6
Joint primary — expectancy_net	≥ baseline (cost formula v3)	F1 plan §4 + §6
Joint primary — sortino	≥ baseline	F1 plan §6
Joint primary — max_drawdown	≤ baseline + 1 %	F1 plan §6
Distribution — cryptos improving	≥ 4 of 5	F1 plan §6
Density — BUY trades per fold	≥ 50	F1 plan §6
Statistical — Cohen's d	≥ 0.3 vs baseline on f1_buy	FTF tuning protocol
Statistical — BH-corrected p	< 0.05 vs baseline on f1_buy	FTF tuning protocol

A variant that beats f1_buy alone but degrades expectancy_net is automatically rejected by the joint gate (F1 plan §6 §risk #2).

2. How we prove the objective is met¶

2.1 Baseline anchor¶

Baseline run: ftf_20260427_170614_06626e_* (Phase 2 rerun, 550 results, post-#704 FS↔FE drift fix). All metrics are read from PostgreSQL finetune_results (src/commun/finetune/persistence.py:34) with WHERE run_id LIKE 'ftf_20260427_170614_06626e_%'.

For each (crypto × fold) cell, the baseline value is the none variant of every existing factor. Track 5 introduces two new factors (label_smoothing + cleanlab) ; the baseline = both at none / off (i.e., the existing trainer behavior, unchanged).

Fold construction (per ADR-14 + committee reco 2) : 5 outer folds via walk-forward time-series split with purge + embargo (no random KFold). Each fold's train / val / test windows are contiguous in time, and a 1-day embargo separates consecutive splits to prevent leakage from triple-barrier label horizons (H4). Same construction as the baseline run for direct comparability.

Cost formula v3 (per F1 plan §4 + committee reco 1) : expectancy_net = expectancy_gross − (taker_fee + spread + slippage_10bps + funding_cost), where taker_fee=0.10%, spread=0.05%, slippage=10 bps interim (until #711 dynamic slippage ships), funding=0.01% per 8h. Implementation in src/commun/audit/economic_thresholds.py. The same formula is applied to the baseline rows ; deltas are like-for-like.

2.2 Experimental matrix¶

Per ADR-56 (every factor is independent and A/B-testable), the FTF runs each new factor against baseline with one factor active at a time :

Factor	Variants	Baseline cross-factor
`label_smoothing`	`none`, `mild`, `aggressive`	`cleanlab=off`
`cleanlab`	`off`, `filter`, `reweight`	`label_smoothing=none`

Total : 5 unique configs (3 + 3 − 1 shared none/off baseline) × 5 cryptos × 5 folds = 125 trained models. Plus a joint confirmation run (best label_smoothing × best cleanlab) to test for super-additivity, run only if both factors individually pass the per-track gate. Joint adds ≤ 1 additional config × 25 = +25 models. Worst case 150 models.

2.3 Statistical evidence required for a "PASS" verdict¶

The factor is kept (won the gate) only if :

Per-track gate (F1 plan §6) clears on the joint metric vector, evaluated on the winning variant.
Bootstrap CI 95 % on the (factor × variant − baseline) f1_buy delta excludes 0. Computed per (crypto × fold) cell, then aggregated.
Benjamini-Hochberg correction : the variant's p-value (paired Wilcoxon vs baseline) is significant after BH-correction across all variants tested in this Story (5 variants → BH cutoff = α × rank/m).
Cohen's d ≥ 0.3 on the f1_buy delta distribution (small effect size minimum).
At least 4 of 5 cryptos show positive Δ f1_buy individually (no portfolio-only wins).

2.4 Reporting artefact¶

The PDF report from documentation/reviews/2026-04-29-track5-label-smoothing-results.md (post-rerun, separate dossier) MUST contain :

The 5 statistical evidence boxes from §2.3
A per-crypto × per-fold heatmap of f1_buy deltas
Calibration curves (reliability diagrams) for baseline vs winning variant
ECE tables (winning variant per crypto)
Bootstrap CI plots
Lockfile : the ftf_config snapshot (PostgreSQL) at run time, so the result is reproducible
MLflow run IDs for top-3 models per crypto
Operator decision : lock, keep available, or abandon

This artefact is the deliverable that closes the Story — not the code merge.

3. What we'll do¶

3.1 Functional decomposition¶

Five thin slices, ordered by dependency :

Label preprocessing module (src/training/labels/) — pure functions for asymmetric smoothing + cleanlab application, no XGBoost dependency.
FTF factor declarations (src/commun/finetune/ablation_matrix.py) — two new factors (label_smoothing, cleanlab), env-var-driven per the existing pattern (line 342 factor_calibration is the model).
Console params registration (scripts/ftf_config_ui.py) — register the new env keys in PARAM_OPTIONS so operators edit them from Console only.
Trainer integration (src/training/XGBoost/cvntrade_XGBoost_trainer.py:144-145) — call the label preprocessing pipeline before xgb.DMatrix creation, gated by env vars.
Hamilton orchestration + OTel — wrap the label preprocessing as a Hamilton dataflow (ADR-61) emitting OTel spans (ADR-62) per step.

Plus the cross-cutting : - Guardrail tests + integration test (ADR-58) for both new factors - MLOps readiness template filled (documentation/stories/CVN-N001-EE-S01/mlops_readiness.md) per ADR-70 - Migration : no SQL migration required (all new keys live in ftf_config.base_env JSONB ; factor catalogue is Python data per ADR-56)

3.2 What we explicitly do NOT do¶

No model class change — XGBoost stays
No barrier change — ATR0.5_1.5_H4 stays per project policy
No feature change — same OHLCV + enrichment + FE pipeline
No CUSUM / filter chain change — out of Track 5 scope
No live deployment — paper-FTF only ; live deployment gated by ADR-71 + EG-S06 flatten_all
No 9-filter chain re-evaluation in this Story — the FTF metrics are computed inside the existing filter chain, but a separate downstream evaluation (per committee reco 3) will assess whether the winning Track 5 variant changes the funnel pass-rate distribution. That evaluation is added to the post-merge instrumentation phase (§6 Phase 5).

4. Detailed architecture¶

4.1 System integration overview¶

flowchart TB
    Console["Console UI
(Streamlit, port 8501)
scripts/ftf_config_ui.py"]
    PG[("PostgreSQL
ftf_config
(base_env JSONB)")]
    AblMatrix["ablation_matrix.py
+ 2 new AblationFactors"]
    Runner["ablation_runner.py
variant × crypto × fold loop"]
    PreflightFTF["FTF preflight
(ADR-64)
guardrail validation"]
    LabelHam["Hamilton dataflow
(ADR-61)
src/training/labels/"]
    Trainer["XGBoost trainer
cvntrade_XGBoost_trainer.py:144"]
    OTel["OTel collector
(ADR-62)"]
    Loki["Loki
structured logs"]
    Persist["finetune_results table
(persistence.py:34)"]
    Report["PDF report + dossier
2026-04-29-track5-results.md"]

    Console -- "operator edits
label_smoothing.epsilon_buy
cleanlab.mode" --> PG
    PG -- "loaded at FTF run start" --> AblMatrix
    AblMatrix -- "factor declarations" --> Runner
    Runner --> PreflightFTF
    PreflightFTF -- "validate (factor, variant) compat" --> Runner
    Runner -- "iterate variants" --> LabelHam
    LabelHam -- "smoothed + cleanlab-filtered y_train" --> Trainer
    Trainer -- "f1_buy, expectancy_net, ..." --> Persist
    LabelHam -- "spans" --> OTel
    Trainer -- "spans" --> OTel
    LabelHam -- "log_event" --> Loki
    Persist --> Report

4.2 Label transformation pipeline (per-fold)¶

flowchart LR
    raw["y_train (raw)
shape (N,)
dtype int8 ∈ {0,1}"]
    smooth["asymmetric_smooth
ε_hold, ε_buy"]
    soft["y_smoothed
shape (N,)
dtype float32"]
    clean["cleanlab.find_label_issues
+ pred_probs from CV fold"]
    mask["suspect_mask
shape (N,) bool"]
    apply["apply_mode
filter | reweight | off"]
    final["y_final + sample_weights
passed to xgb.DMatrix"]

    raw --> smooth
    smooth --> soft
    soft -.optional.-> clean
    clean --> mask
    soft --> apply
    mask --> apply
    apply --> final

    classDef new fill:#fef3c7,stroke:#d97706
    class smooth,clean,mask,apply,soft,final new

Notes : - asymmetric_smooth is unconditional (variant none = identity, mild = (eps_buy=0.15, eps_hold=0.075), aggressive = (eps_buy=0.30, eps_hold=0.15) — values per committee reco 17, see §4.3 variant table for the canonical declaration) - cleanlab step is gated by factor_cleanlab variant ; when off it's a no-op pass-through - pred_probs for cleanlab come from the existing PurgedKFold splitter (training.cv.purged_kfold, López de Prado AFML Ch 7), reused so the inner cleanlab CV honours the same purge / embargo contract as the outer FTF folds. PurgedKFold reads CVN_PURGE_BARS and CVN_EMBARGO_BARS from the env, so this Story doesn't introduce a parallel configuration knob (CR PR #734 round 2). Default n_splits=2 per committee reco 18. Soft per-fold wall-time budget : if any fold exceeds CVN_CLEANLAB_FOLD_BUDGET_S (default 60 s), the loop short-circuits and unprocessed slices keep the initial 0.5 prior — find_label_issues then treats those samples as ambiguous and skips them rather than mis-flagging them. No leaky fallback (committee pr_review session 989a6567 blocker 1) : the prior "single-fold tail train-on-self" path was removed because it produced biased pred_probs that silently corrupted the cleanlab signal. Honest partial coverage with explicit telemetry (coverage_pct event field + P2 alert at < 80 %) beats a leaky full coverage. - Sample weights from apply_mode=reweight are multiplied with existing class-balancing weights (line 165 of trainer) — order matters, documented in code

4.3 FTF factor matrix (declarations)¶

classDiagram
    class AblationFactor {
        +str name
        +str factor_type
        +str category
        +Dict env_vars
        +str description
    }
    class factor_label_smoothing {
        +name = "label_smoothing"
        +category = "model"
        +variants none, mild, aggressive
        +env_keys CVN_LABEL_SMOOTH_EPS_BUY,
        CVN_LABEL_SMOOTH_EPS_HOLD
    }
    class factor_cleanlab {
        +name = "cleanlab"
        +category = "model"
        +variants off, filter, reweight
        +env_keys CVN_CLEANLAB_MODE,
        CVN_CLEANLAB_MAX_DROP_PCT,
        CVN_CLEANLAB_REWEIGHT_FLOOR
    }

    AblationFactor <|-- factor_label_smoothing
    AblationFactor <|-- factor_cleanlab

Variant table :

Factor	Variant	env_vars
`label_smoothing`	`none`	`CVN_LABEL_SMOOTH_EPS_BUY=0.0`, `CVN_LABEL_SMOOTH_EPS_HOLD=0.0`
`label_smoothing`	`mild`	`CVN_LABEL_SMOOTH_EPS_BUY=0.15`, `CVN_LABEL_SMOOTH_EPS_HOLD=0.075` (per committee reco 17, Müller 2019 calibrated for 20 % BUY rate)
`label_smoothing`	`aggressive`	`CVN_LABEL_SMOOTH_EPS_BUY=0.30`, `CVN_LABEL_SMOOTH_EPS_HOLD=0.15` (per committee reco 17)
`cleanlab`	`off`	`CVN_CLEANLAB_MODE=off`
`cleanlab`	`filter`	`CVN_CLEANLAB_MODE=filter`, `CVN_CLEANLAB_MAX_DROP_PCT=5`
`cleanlab`	`reweight`	`CVN_CLEANLAB_MODE=reweight`, `CVN_CLEANLAB_MAX_DROP_PCT=5`, `CVN_CLEANLAB_REWEIGHT_FLOOR=0.5`

Guardrails (per ADR-58, in src/commun/finetune/validation.py) :

label_smoothing.epsilon_buy ∈ [0.0, 0.5] — silly-bounds check
label_smoothing.epsilon_hold ∈ [0.0, epsilon_buy] — asymmetry preserved (no symmetric or inverted smoothing)
cleanlab.max_drop_pct ≤ 5.0 — F1 plan §6 risk #4 (cap label drop rate at 5 %)
cleanlab.mode in {off, filter, reweight} — closed enum

Integration test (ADR-58) : - tests/integration/test_track5_label_smoothing.py — exercises apply_label_pipeline end-to-end (in-process, no PostgreSQL, no Airflow runner) for each of the 5 unique configs from §4.2 + the joint config. Asserts (a) no crash, (b) (X_final, y_final, sample_weights_final) shapes are mutually consistent, (c) y_final dtype is float32 or float64, (d) cleanlab filter mode drops ≤ max_drop_pct rows (strict cap, no tolerance), (e) the output is consumable by a real xgb.DMatrix → xgb.train pair (training succeeds and the produced booster predicts in [0, 1]). - The PostgreSQL finetune_results write path is not exercised here ; it's covered by the existing FTF persistence tests (tests/unit/finetune/test_persist_predictions.py) and verified manually in Phase 3 of §6 when the operator runs the FTF sweep on the cluster.

4.4 Hamilton dataflow¶

flowchart TD
    raw_y["raw_y_train: np.ndarray"]
    eps_buy["epsilon_buy: float"]
    eps_hold["epsilon_hold: float"]
    cl_mode["cleanlab_mode: str"]
    cl_max_drop["cleanlab_max_drop_pct: float"]
    cl_floor["cleanlab_reweight_floor: float"]
    X_train["X_train: np.ndarray"]
    base_w["base_sample_weights: np.ndarray"]

    smoothed_y["smoothed_y(raw_y, eps_buy, eps_hold)"]
    cv_probs["cv_pred_probs(X_train, smoothed_y)
2-fold purged+embargoed CV
per ADR-14"]
    suspect["suspect_mask(smoothed_y, cv_probs)"]
    final_y["final_y(smoothed_y, suspect, cl_mode, cl_max_drop)"]
    final_w["final_weights(base_w, suspect, cl_mode, cl_floor)"]

    raw_y --> smoothed_y
    eps_buy --> smoothed_y
    eps_hold --> smoothed_y
    smoothed_y --> cv_probs
    X_train --> cv_probs
    smoothed_y --> suspect
    cv_probs --> suspect
    smoothed_y --> final_y
    suspect --> final_y
    cl_mode --> final_y
    cl_max_drop --> final_y
    base_w --> final_w
    suspect --> final_w
    cl_mode --> final_w
    cl_floor --> final_w

    classDef out fill:#dcfce7,stroke:#16a34a
    class final_y,final_w out

File : src/training/labels/label_pipeline.py (Hamilton dataflow module).

Trainer-side invocation (actual implementation as of committee pr_review 989a6567 blocker 3 fix : apply_label_pipeline now uses hamilton.driver.Driver(...).execute(...) with a DictResult adapter ; the previous imperative path was rejected for bypassing Hamilton's lineage tracking) :

from training.labels import apply_label_pipeline
X_train, y_train, sample_weights = apply_label_pipeline(
    X_train,
    y_train,
    base_sample_weights=sample_weights,
    crypto=os.environ.get("CVN_CRYPTO_SYMBOL", "unknown"),
    fold_id=int(os.environ.get("CVN_FOLD_ID", "-1")),
)
# env vars read inside: CVN_LABEL_SMOOTH_EPS_BUY, CVN_LABEL_SMOOTH_EPS_HOLD,
# CVN_CLEANLAB_MODE (validated fail-fast against {off, filter, reweight} per
# ADR-25 — a typo aborts the variant cleanly), CVN_CLEANLAB_MAX_DROP_PCT,
# CVN_CLEANLAB_REWEIGHT_FLOOR, CVN_CLEANLAB_CV_FOLDS (default 2),
# CVN_CLEANLAB_FOLD_BUDGET_S (default 60), CVN_PURGE_BARS + CVN_EMBARGO_BARS
# (consumed by the existing PurgedKFold splitter — single source of truth
# for purge / embargo, no new knob).

Lineage emission (per ADR-61) : Hamilton's dr.visualize_execution(...) writes an SVG to MLflow as artefact label_pipeline_lineage_<crypto>_<fold>.svg.

4.5 OpenTelemetry instrumentation (ADR-62)¶

sequenceDiagram
    participant Runner as ablation_runner
    participant Trainer as XGBoost trainer
    participant Pipeline as Hamilton label pipeline
    participant Tracer as OTel tracer
    participant Coll as OTel collector

    Runner->>Tracer: span "ftf.variant" (factor=label_smoothing, variant=mild, crypto=BTCUSDC, fold_id=0)
    Trainer->>Tracer: span "training.label_pipeline"
    Trainer->>Pipeline: dr.execute(...)
    Pipeline->>Tracer: span "label.smooth" (eps_buy=0.10, eps_hold=0.05)
    Tracer-->>Coll: emit
    Pipeline->>Tracer: span "label.cleanlab.cv_probs"
    Tracer-->>Coll: emit
    Pipeline->>Tracer: span "label.cleanlab.find_issues" (suspect_count=N, drop_pct=2.3)
    Tracer-->>Coll: emit
    Pipeline->>Tracer: span "label.apply_mode" (mode=filter)
    Tracer-->>Coll: emit
    Trainer->>Tracer: span "xgb.train" (n_samples=N')
    Tracer-->>Coll: emit
    Trainer->>Tracer: span "metrics.compute" (f1_buy, ECE, ...)
    Tracer-->>Coll: emit

Spans emitted (one per step, ADR-62 golden-field attributes : factor, variant, crypto, fold_id) :

Span	Attributes (in addition to golden fields)	Purpose
`label.smooth`	`epsilon_buy`, `epsilon_hold`	timing + parameter trace
`label.cleanlab.cv_probs`	`cv_folds=2`, `model=xgb`, `purge_embargo=enabled`, `mem_mb`, `cpu_pct`	wall time + resource footprint of CV pre-step (committee reco 13)
`label.cleanlab.find_issues`	`suspect_count`, `drop_pct`, `buy_drop_pct` (separate, reco 4), `hold_drop_pct`	granular drop visibility on minority class
`label.apply_mode`	`mode`, `effective_drop_pct`, `effective_buy_drop_pct`	confirm guardrail (drop ≤ 5 %) and minority preservation
`training.label_pipeline`	`total_duration_ms`	parent span over all label work

Hamilton ↔ OTel naming consistency (committee reco 8) : Hamilton node names (smoothed_y, cv_pred_probs, suspect_mask, final_y, final_weights) appear as the second-level segment of the corresponding span name (label.smooth, label.cleanlab.cv_probs, label.cleanlab.find_issues, label.apply_mode). The cross-reference is asserted in a unit test that diffs the Hamilton node graph against the span catalogue.

Metrics (Prometheus, per ADR-62) : - cvntrade_label_smooth_eps_applied{factor, variant} (gauge) - cvntrade_cleanlab_drop_pct{crypto, fold} (gauge) - cvntrade_cleanlab_buy_drop_pct{crypto, fold} (gauge — committee reco 4 — minority class granularity) - cvntrade_cleanlab_hold_drop_pct{crypto, fold} (gauge) - cvntrade_label_pipeline_duration_seconds{stage} (histogram) - cvntrade_label_final_weights_min{crypto, fold} (gauge — committee reco 5) - cvntrade_label_final_weights_max{crypto, fold} (gauge) - cvntrade_label_final_weights_mean{crypto, fold} (gauge) - cvntrade_label_final_weights_std{crypto, fold} (gauge) - cvntrade_label_pipeline_mem_peak_mb{crypto, fold} (gauge — committee reco 13, K8s pressure visibility)

Grafana alerts (committee reco 19) : - cvntrade_cleanlab_drop_pct > 4.5 for any (crypto × fold) → P2 alert "preempt 5 % cap breach", runbook documentation/runbooks/cleanlab_over_drop.md - cvntrade_cleanlab_buy_drop_pct > 3.0 → P2 alert "minority decimation" — separate threshold, BUY samples are precious

4.6 Console UI integration (ADR-59 + ADR-65)¶

flowchart LR
    Op[Operator]
    Console["Console UI
scripts/ftf_config_ui.py"]
    Param["PARAM_OPTIONS dict
(line 60-101)"]
    PG[("ftf_config.base_env
JSONB")]
    Hist[("ftf_config_history
audit trail")]

    Op -- "selects variant
'aggressive' from dropdown" --> Console
    Console -- "reads schema" --> Param
    Console -- "UPDATE base_env
SET label_smoothing.epsilon_buy=0.20" --> PG
    PG -. "trigger" .-> Hist
    Console -- "shows diff: was=0.0, now=0.20" --> Op

Registration (in scripts/ftf_config_ui.py:60-101) :

PARAM_OPTIONS = {
    # ... existing ...
    "CVN_LABEL_SMOOTH_EPS_BUY":   ("number", 0.0, 0.5, 0.05),    # (type, min, max, step)
    "CVN_LABEL_SMOOTH_EPS_HOLD":  ("number", 0.0, 0.5, 0.05),
    "CVN_CLEANLAB_MODE":          ("enum", ["off", "filter", "reweight"]),
    "CVN_CLEANLAB_MAX_DROP_PCT":  ("number", 0.0, 5.0, 0.5),
    "CVN_CLEANLAB_REWEIGHT_FLOOR":("number", 0.0, 1.0, 0.05),
}

The Streamlit form auto-renders the right widget per type. No code change in the UI beyond the dict ; that's the ADR-59 promise.

4.7 Observability fan-out¶

flowchart LR
    Trainer[Trainer + Hamilton]
    OTel[OTel collector]
    Tempo[Tempo
traces]
    Prom[Prometheus
metrics]
    Loki[Loki
structured logs]
    Graf[Grafana
dashboard 'F1 Boost Track 5']

    Trainer -- "spans" --> OTel
    Trainer -- "metrics" --> OTel
    Trainer -- "log_event" --> Loki
    OTel --> Tempo
    OTel --> Prom
    Tempo --> Graf
    Prom --> Graf
    Loki --> Graf

    Graf -.- Op[Operator dashboard]

New Grafana panels (added in infra/grafana/dashboards/cvntrade-f1-boost.json post-implementation) :

Panel 1 : cvntrade_label_smooth_eps_applied heatmap by (factor, variant)
Panel 2 : cvntrade_cleanlab_drop_pct per crypto over the FTF run
Panel 3 : cvntrade_label_pipeline_duration_seconds p95 — catches cleanlab CV pre-step regressions
Panel 4 : Joint metric scoreboard (f1_buy delta, expectancy_net delta, sortino delta, max_dd delta) per variant from Postgres finetune_results

5. Falsifiability + rollback¶

5.1 Falsifiable hypotheses (Story-level)¶

H	What would falsify	Action if falsified
H1 (smoothing helps)	All 3 `label_smoothing` variants miss per-track gate AND ΔECE < 0.02	Keep factor available with `none` default ; advance to Track 6
H2 (cleanlab helps)	All 3 `cleanlab` variants miss per-track gate AND no improvement on the calibration curves	Keep factor available with `off` default ; advance to Track 6
H3 (joint super-additive)	Joint config (best LS × best CL) ≤ max(LS-only, CL-only) on f1_buy	Drop joint, ship best single-factor variant only
H4 (cleanlab over-drops)	Drop rate hits the 5 % cap on > 1 crypto	Lower `max_drop_pct` to 2 % or revert cleanlab to `off` ; document in Story closure
H5 (silent training degradation)	OOS distribution shifts (training accuracy ↑ but validation accuracy ↓ by > 0.05)	This is overfitting from over-smoothing ; cap `epsilon_buy` at 0.10 ; document

A REJECTED Story (H1 + H2 both) is still a successful Story — it falsifies cleanly and informs the next track. Per F1 plan §6, a NO-GO at sprint S1 close triggers escalation to big-bet bundle (S06-S08).

5.2 Rollback (config-only per ADR-71 + ADR-56)¶

Per-variant rollback : set factor_label_smoothing=none and factor_cleanlab=off in ftf_config.base_env via Console — < 1 minute.
Per-factor rollback : remove the factor from ablation_matrix.py (revert one commit). Schema unchanged because all params live in JSONB.
No live deployment to roll back — Track 5 ships to FTF only, no production trading impact.

5.3 What's NOT a failure mode¶

A variant taking longer than the baseline (cleanlab adds ~30 s/fold for the CV pre-step) is expected, not a failure. Logged as cvntrade_label_pipeline_duration_seconds.
A variant changing the per-fold prediction distribution is expected, not a failure. The gate evaluates outcomes, not internals.
A variant beating the gate on 4 of 5 cryptos but failing on 1 (e.g. SHIBUSDC) is acceptable per gate criterion — the outlier crypto stays on the none variant.

6. Action plan (sequenced)¶

Phase 1 — Plan review (this dossier)¶

#	Action	Owner	Output
1.1	Submit this dossier to Expert Committee `plan_review`	Claude	committee verdict (PASSED / OK / EXECUTION_RISK / REJECTED)
1.2	Apply blockers (if REJECTED) or actionable recommendations	Claude + operator	updated dossier
1.3	Operator sign-off : approve to proceed to Phase 2	Operator	OP wp#40 comment

Phase 2 — Implementation (post-committee OK)¶

#	Action	File	Test
2.1	Create Story dossier dir	`documentation/stories/CVN-N001-EE-S01/`	n/a
2.2	Add `label_smoothing` factor declaration	`src/commun/finetune/ablation_matrix.py` (after line 351)	unit
2.3	Add `cleanlab` factor declaration	`src/commun/finetune/ablation_matrix.py`	unit
2.4	Add 4 guardrails	`src/commun/finetune/validation.py`	`tests/unit/test_ftf_guardrails.py`
2.5	Register 5 keys in Console	`scripts/ftf_config_ui.py` (PARAM_OPTIONS dict)	manual UI smoke
2.6	Implement label preprocessing module	`src/training/labels/label_pipeline.py` (Hamilton)	unit + property-based : value-range checks on `y_final ∈ [0, 1]`, `sample_weights >= 0`, `len(y_final) == len(X_train)`, no NaN/Inf, monotonic shrinkage of confidence (per committee reco 12)
2.7	Add OTel spans + metrics	`src/training/labels/label_pipeline.py`	unit (mock collector)
2.8	Wire into XGBoost trainer	`src/training/XGBoost/cvntrade_XGBoost_trainer.py:144-145`	unit + integration
2.9	Add integration test (ADR-58)	`tests/integration/test_track5_label_smoothing.py`	passes locally + CI
2.10	Fill MLOps readiness template	`documentation/stories/CVN-N001-EE-S01/mlops_readiness.md`	reviewer checks 6 sections
2.11	Add Grafana panel JSON	`infra/grafana/dashboards/cvntrade-f1-boost.json`	manual smoke
2.12	Update CLAUDE.md if patterns changed	`CLAUDE.md`	n/a
2.13	Pin cleanlab version in dependencies (committee reco 14)	`pyproject.toml` + `poetry.lock`	CI pinned-version check
2.14	Review K8s training pod resources for cleanlab CV pre-step (committee reco 11) — bump memory request if cleanlab CV peaks > 2× baseline ; document in `infra/helm/cvntrade-airflow/values.yaml` if changed	`infra/helm/cvntrade-airflow/values.yaml` (if needed)	k8s smoke pod boot

Phase 3 — Validation (FTF run)¶

#	Action	Owner	Output
3.1	Run FTF preflight	Operator (Airflow trigger `dag_ftf__preflight`)	preflight green / red
3.2	Run FTF sweep on 5 unique configs	Operator (Airflow `dag_finetune__pte` with `factor=label_smoothing,cleanlab`)	125 rows in `finetune_results`
3.3	Generate stats : bootstrap CI + BH-correction + Cohen's d	Operator (existing FTF report generator)	PDF report
3.3b	PDF report integrity check (committee reco 15) : non-empty file, valid PDF magic bytes, contains the 5 statistical evidence boxes from §2.3	Operator script (`scripts/check_pdf_integrity.py`)	green = continue ; red = re-run report generator
3.4	Decide : per-track gate cleared?	Operator + Claude	go / no-go per variant
3.5	If go : submit joint config (best LS × best CL)	Operator	+25 rows
3.6	Write results dossier	Claude	`documentation/reviews/2026-04-29-track5-label-smoothing-results.md`

Phase 4 — Merge + close Story¶

#	Action	Owner	Output
4.1	Open PR (squash, base main)	Claude	PR #
4.2	CodeRabbit review cycle (4-5 passes)	Claude (apply fixes)	clean CR
4.3	Submit committee `pr_review`	Claude	session id
4.4	Apply pr_review fixes	Claude	iteration
4.5	Merge	Operator	commit on main
4.6	CI Docs site green	Claude (verify)	✅
4.7	Close OP Story wp#40 (ADR-69 step 14)	Claude	wp#40 → Closed
4.8	If S02 (#713 focal loss) is the last in sprint S1, close sprint per OPERATIONS §16.4	Claude	sprint closed

Phase 5 — Post-merge instrumentation¶

#	Action	Owner
5.1	Operator runs FTF sweep with new factor enabled in baseline	Operator
5.2	Grafana dashboard "F1 Boost Track 5" populated	Claude (panels exist post-merge)
5.3	Lock decision per FTF tuning protocol (BH significance + Cohen's d ≥ 0.3)	Operator
5.4	If lock : update `ftf_baseline.json` with the winning variant as new baseline	Operator

7. Risks + mitigations¶

#	Risk	Likelihood	Impact	Mitigation
R1	Cleanlab CV pre-step adds > 30 s/fold ; FTF run wall-time blows budget	M	M	Span timing measured ; if > 60 s/fold, drop CV folds from 3 to 2
R2	Asymmetric smoothing degrades calibration on majority class (HOLD)	M	L	ECE measured per class ; if HOLD ECE worsens > 0.02, ship `mild` only
R3	Cleanlab drops samples that are real BUYs (the rare ones we need most)	M	H	F1 plan §6 risk #4 cap : drop ≤ 5 % ; cleanlab `reweight` mode is safer than `filter`
R4	Operator forgets to run FTF preflight → guardrail trips mid-run	L	M	Preflight is part of the standard `dag_finetune__pte` flow per ADR-64, hard to skip
R5	Joint LS×CL config worse than either alone (anti-additive)	M	L	H3 hypothesis ; ship single-factor winner
R6	The 5 % drop cap is hit on > 1 crypto → cleanlab effectively useless	L	L	H4 hypothesis ; lower drop cap or revert
R7	OTel collector unreachable → spans drop ; FTF run continues	L	L	OTel emission is non-blocking per `src/commun/observability/otel.py` ; logged warning only
R8	Hamilton dataflow lineage SVG too large for MLflow	L	L	Cap nodes shown ; fall back to JSON adjacency list
R9	Cleanlab model instability : poor `pred_probs` from internal CV → misidentified label issues, false positives on suspect samples (committee reco 9)	M	M	Monitor `suspect_count` and `effective_drop_pct` per fold ; alert on `cleanlab_buy_drop_pct > 4.5%` (preempt 5 % cap) ; if instability seen on > 1 crypto, fall back to 1-fold mode or switch to `cleanlab=off`
R10	XGBoost training time inflates : altered label distribution + sample weights makes optimization slower (committee reco 10)	L	L	Monitor `xgb.train` OTel span duration p95 ; if regression > 50 % vs baseline, investigate before increasing variant count
R11	K8s resource pressure during cleanlab CV pre-step (memory + CPU) (committee reco 11 + 13)	M	L	Per-pod memory + CPU monitoring during the `label.cleanlab.cv_probs` span ; review training pod requests / limits before sweep ; fall back to 1-fold cleanlab if OOMKilled
R12	Cleanlab dependency drift between dev / CI / prod (committee reco 14)	L	M	Pin `cleanlab` exact version in `pyproject.toml` + `poetry.lock` ; CI verifies pinned version matches lock

8. MLOps readiness preview¶

The full MLOps readiness file (per ADR-70) will be at documentation/stories/CVN-N001-EE-S01/mlops_readiness.md and committed in Phase 2 step 2.10. Preview of the 6 sections :

§	Section	Track 5 specifics
1	Production monitoring	`cvntrade_label_smooth_eps_applied`, `cvntrade_cleanlab_drop_pct`, `cvntrade_cleanlab_buy_drop_pct`, `cvntrade_cleanlab_hold_drop_pct`, `cvntrade_label_pipeline_duration_seconds`, `cvntrade_label_final_weights_{min,max,mean,std}`, `cvntrade_label_pipeline_mem_peak_mb`
2	Alerting & runbooks	P2 alert on `cleanlab_drop_pct > 4.5%` (preempt 5 % cap, reco 19) ; P2 alert on `cleanlab_buy_drop_pct > 3.0%` (minority decimation) ; P3 alert on `label_pipeline_mem_peak_mb > 2× baseline` (K8s pressure, reco 11+13) ; runbooks `documentation/runbooks/cleanlab_over_drop.md`, `cleanlab_minority_decimation.md`, `cleanlab_oom_pressure.md`
3	Drift detection	PSI on top-K features unchanged (Track 5 doesn't touch features) ; concept drift on `f1_buy` per crypto ; post-Track 5 funnel re-evaluation (committee reco 3) — measure 9-filter chain pass-rate distribution shift
4	Staged rollout	FTF-only this Story (no live) ; if ever promoted to live trading, shadow ≥ 7d on `BTCUSDC` (deepest order book)
5	Rollback plan	Console flip `factor_label_smoothing=none, factor_cleanlab=off` (< 1 min) ; no code deploy needed ; cleanlab pinned in `pyproject.toml` (reco 14) so rollback doesn't trigger transitive dep upgrade
6	DRI	Operator (`dococeven`) ; backup TBD ; sunset 2026-07-28 (90d)

9. Open questions for committee¶

Smoothing values (§4.3 variant table) : mild=(eps_buy=0.15, eps_hold=0.075) and aggressive=(eps_buy=0.30, eps_hold=0.15) — appropriate for a 20 % BUY rate? (Resolved by committee reco 17 ; values shown here are post-amendment.)
Asymmetry direction : we apply ε_buy > ε_hold (more smoothing on minority) — this is the standard "label smoothing for imbalanced classification" recipe. Is there literature pushing the opposite direction we should consider?
Cleanlab reweight floor at 0.5 : reasonable, or push to 0.25 (halve weight twice)?
CV folds for cleanlab pre-step : 3 — increases wall time but improves pred_probs quality. Worth 5 folds for cost?
Joint config (best LS × best CL) : run unconditionally or only if both single-factor variants pass the per-track gate? Defaulting to gated to save compute.
Sample weights ordering : cleanlab reweight applies AFTER existing class-balancing weights. Multiplicative composition. Is there a known interaction with XGBoost gradient that makes additive composition safer?
Smoothing on validation/test : we apply smoothing on training labels only ; val/test labels stay hard for honest evaluation. Confirm this is the standard practice (it is for image classification but may differ for HFT-style binary tasks).
Hamilton vs imperative for this 4-step preprocessing : Hamilton is mandated by ADR-61 for batch flows. The pipeline is small (4 nodes) — any concern about over-engineering for 4 nodes vs gain in lineage emission?

10. Acceptance criteria (Story level)¶

Plan dossier merged with committee plan_review verdict ≥ ACCEPTED (≥ 8.0 avg, no unresolved blockers)
All committee actionable recommendations applied to dossier or PR
Phase 2 implementation complete (10 of 12 sub-steps green ; CLAUDE.md update only if patterns change)
FTF preflight green, FTF sweep complete (125 + optional 25 rows in finetune_results)
Results dossier merged at documentation/reviews/2026-04-29-track5-label-smoothing-results.md
Story OP wp#40 closed with PR + commit reference + MLflow run IDs
If gate cleared : winning variant locked in ftf_baseline.json via Console flip (operator action)
If gate not cleared : H1/H2 falsified entry written into Story closure note + sprint S1 retrospective triggered

11. References¶

11.1 Existing artefacts (built on)¶

documentation/F1_BUY_BOOST_PLAN.md — canonical plan, §5 Track 5 spec
documentation/epics/CVN-N001-EE-f1-buy-boost.md — Epic doc with Story tracking table
documentation/templates/TEMPLATE_mlops_readiness.md — to be filled per ADR-70
documentation/adr/0070-mlops-readiness-template-mandatory.md — gate
documentation/adr/0071-trading-kill-switch-invariants.md — sister DESIGN

11.2 Code surfaces¶

src/commun/finetune/ablation_matrix.py:342 (factor_calibration model pattern)
src/commun/finetune/ablation_runner.py:110 (_extract_extended_metrics)
src/commun/finetune/persistence.py:34 (finetune_results insert)
src/commun/finetune/preflight/base.py (FTF preflight ADR-64)
src/commun/finetune/preflight/hamilton_exec.py (Hamilton driver pattern)
src/commun/observability/otel.py:88 (OTel singleton)
src/training/XGBoost/cvntrade_XGBoost_trainer.py:144 (label ingestion point)
scripts/ftf_config_ui.py:60 (PARAM_OPTIONS registration)
tests/unit/test_ftf_guardrails.py (guardrail test pattern)

11.3 Existing ADRs the plan builds on¶

ADR-25 — No silent fallback (cleanlab errors must be loud)
ADR-29 — Naïve baseline mandatory (already in baseline)
ADR-30, ADR-32, ADR-33 — Structured logs (golden-field attributes on spans)
ADR-56 — Every change FTF-testable (factor matrix is the contract)
ADR-58 — Every factor has a guardrail + integration test (mandatory)
ADR-59 — All params in PostgreSQL ftf_config (Console-only edit)
ADR-61 — Hamilton for batch flows (label preprocessing)
ADR-62 — OpenTelemetry spans (instrumentation contract)
ADR-63 — Binary BUY/NOT_BUY mission mode
ADR-64 — FTF preflight first-class
ADR-65 — Airflow DAG params run-level only
ADR-67 — Pluggable feature selection (architectural sibling)
ADR-68 — Committee = default review channel (this dossier)
ADR-69 — OpenProject orchestrator (Story discipline)
ADR-70 — MLOps readiness template (must fill, Phase 2 step 2.10)
ADR-71 — Kill-switch invariants (live-deploy gate ; not relevant for FTF-only Track 5 but documented)

11.4 Committee sessions (precedent)¶

9d4942cb — F1 boost plan v3 PASSED OK avg 8.96 (parent approval)
3e0a3008 — sprint S0 / S01 ADR-70 PASSED OK avg 8.6 (mode B precedent for docs)
4c388b4c — sprint S0 / S02 ADR-71 PASSED EXECUTION_RISK avg ~8.3 (mode B with 1 BLOCKER)
8a202a18 — this Story's plan_review : PASSED / OK avg 8.7 (architect 9.0, ops 9.0, ml-eng 8.5, crypto-trader 8.5, data-sci 8.5), 0 blockers, 4 dissents (Q1 / Q4 / Q5 / temporal leakage), 19 recommendations of which 18 actionable applied to this dossier in revisions below ; reco 16 ("consult ML expert on sample-weight × XGBoost gradient interaction") considered satisfied by this very committee session

11.4.1 Recommendations applied (committee `8a202a18`)¶

#	Reco	Where applied
1	Detail cost formula v3	§2.1 with formula + link to `economic_thresholds.py`
2	Clarify fold construction (walk-forward + purge + embargo, ADR-14)	§2.1
3	Note post-Track 5 9-filter chain re-evaluation	§3.2 + §8 §3 (drift)
4	OTel metric `cleanlab_buy_drop_pct`	§4.5 metrics + alerts
5	OTel metrics for `final_weights` distribution (min/max/mean/std)	§4.5 metrics
6	Cleanlab CV folds purged + embargoed (ADR-14 temporal leakage)	§4.2 + §4.4 + Hamilton driver pseudocode
7	Per-class HOLD ECE gate (`ECE_HOLD ≤ baseline + 0.01`)	§1.3 success criterion
8	Hamilton node ↔ OTel span naming consistency + asserted via unit test	§4.5
9	Risk : cleanlab model instability	§7 R9 + monitoring (recos 4, 19)
10	Risk : XGBoost training time inflation	§7 R10
11	K8s resource scaling check	§6 Phase 2 step 2.14 + §8 §2 alerting
12	Property-based tests details for `label_pipeline.py`	§6 Phase 2 step 2.6
13	Memory + CPU monitoring during cleanlab CV	§4.5 attributes + metric `mem_peak_mb`
14	Pin cleanlab version in dependencies	§6 Phase 2 step 2.13 + §8 §5 rollback
15	PDF report integrity check	§6 Phase 3 step 3.3b
17	Adjusted smoothing values : `mild=(eps_buy=0.15, eps_hold=0.075)`, `aggressive=(eps_buy=0.30, eps_hold=0.15)` (canonical (eps_buy, eps_hold) tuple ordering)	§4.3 variant table
18	Cleanlab CV folds : 3 → 2 (fallback 1 if > 60 s/fold)	§4.2 + §4.4 + Hamilton pseudocode + §7 R1
19	Grafana alert `cleanlab_drop_pct > 4.5%`	§4.5 alerts + §8 §2

11.4.2 Dissents recorded (not amendments)¶

Q1 smoothing values : 4 experts split, settled by reco 17 amendment
Q4 cleanlab CV folds : settled by reco 18 amendment (3 → 2 with fallback)
Q5 joint config strategy : kept gated (run only if both single-factor pass) — operator may override
Temporal leakage on cleanlab CV : critical, settled by reco 6 amendment

11.5 Issues + sprint¶

Need CVN-N001 (#608)
Epic CVN-N001-EE (#707) — F1_buy boost
Story CVN-N001-EE-S01 (#712) — this Story
Operational prereqs : ✅ #708 (kill-switch DESIGN), ✅ #709 (MLOps template DESIGN)
Sprint F1B-S1-QW-PhaseA (OP version id 6, dates 2026-05-04 → 2026-05-17 — note : we're starting early ; sprint dates are aspirational)
Sibling Story (next in sprint) : CVN-N001-EE-S02 (#713) — focal loss, gated by Track 5 result

11.6 External¶

Müller, R., et al. When Does Label Smoothing Help? NeurIPS 2019 — asymmetric smoothing for imbalanced classification
Northcutt, C., et al. Confident Learning: Estimating Uncertainty in Dataset Labels — cleanlab paper
de Prado, M.L. Advances in Financial Machine Learning §3.5 — meta-labeling and label noise in HFT