Plan Dossier — Track 5 : Asymmetric label smoothing + cleanlab (CVN-N001-EE-S01)¶
Date: 2026-04-28
Issue: #712
Story: CVN-N001-EE-S01 (OP wp#40, sprint F1B-S1-QW-PhaseA)
Epic: CVN-N001-EE (#707) — F1_buy boost (13-track plan)
Need: CVN-N001 (F1 mission, #608)
Author: Dominique (operator) + Claude
Status: awaiting plan_review (ADR-68 mode A — committee BEFORE implementation)
Reviewers requested: Expert Committee (5 personas + consolidator)
Operational prereqs sign-off: ✅ ADR-70 (#709 wp#55, PR #730), ✅ ADR-71 (#708 wp#56, PR #731)
0. Executive summary¶
One paragraph — the rest of the dossier expands on this.
CVNTrade's binary classifier plateaus at f1_buy ∈ [0.40, 0.46] across 400 Phase 2 runs (median 0.418, peak 0.541). The hypothesis under test : the model overfits noise on the minority BUY class (~20 % rate) because hard binary cross-entropy on {0, 1} labels rewards over-confident wrong predictions. Track 5 introduces two independent levers : (a) asymmetric label smoothing (y_buy = 1 − ε_buy, y_hold = ε_hold with ε_buy > ε_hold to shrink the BUY tail more than the HOLD bulk) ; (b) cleanlab-based label-noise filtering (drop or reweight samples where confident-learning identifies likely mislabelling). Both ship as independent FTF factors (ADR-58 guardrails + integration tests), parameters editable from Console (ADR-59), label preprocessing implemented as a Hamilton dataflow (ADR-61) emitting OpenTelemetry spans (ADR-62). Success = per-track gate (F1 plan §6) cleared on the FTF rerun against baseline ftf_20260427_170614_06626e_*. If neither lever fires the gate, the result still falsifies the smoothing hypothesis cleanly and unlocks Track 6 (focal loss).
1. Objective¶
1.1 Problem¶
f1_buy plateau on the live FTF surface :
| Quantile | f1_buy | Source |
|---|---|---|
| Median (n=400 variants) | 0.418 | Phase 2 audit, 2026-04-26 |
| Peak | 0.541 | Phase 2 audit |
| Baseline naïve "always BUY" | ~0.20 | ADR-29 |
| Economic break-even (cost-net) | ~0.500 | F1 plan §4 |
The median is below break-even — current model is structurally unprofitable. Tuning inside the existing OHLCV + XGBoost + binary-triple-barrier surface has reached an asymptote (12 levers tried in v1, all hit the same wall). Track 5 exits the asymptote by changing the label structure without changing inputs, model class, or barriers.
1.2 Hypothesis (falsifiable)¶
H : Hard binary labels
{0, 1}cause XGBoost to overfit on the minority BUY class. Replacing them with asymmetric soft labels (withε_buy > ε_hold) and removing systematically mislabelled samples (cleanlab confident-learning) will reduce calibration error and improvef1_buyby ≥ +0.015 on the joint metric gate.
The hypothesis is falsifiable if none of the three following holds on the FTF rerun : - ECE drops by ≥ 0.02 (calibration improvement) - Δf1_buy ≥ +0.01 (Story-specific success, AC #712) - Per-track gate (F1 plan §6) cleared on at least one variant of either factor
If all three fail across all variants, the hypothesis is rejected. The Story still merges (the FTF result is a permanent record per ADR-29), the factor stays available with none as the default variant, and we proceed to Track 6 (focal loss) with a tighter prior.
1.3 Success criterion (from #712 + F1 plan §6)¶
The Story is considered successful if all of the following hold on a single winning variant configuration :
| Criterion | Threshold | Source |
|---|---|---|
| ECE drop OR f1_buy gain | Δ ECE ≥ 0.02 OR Δ f1_buy ≥ +0.01 | #712 Story-specific |
| Per-class HOLD ECE | ECE_HOLD ≤ baseline + 0.01 (asymmetric smoothing must NOT degrade majority calibration) |
committee reco 7 |
| Joint primary — f1_buy | ≥ +0.015 with 95 % bootstrap CI excluding 0 | F1 plan §6 |
| Joint primary — expectancy_net | ≥ baseline (cost formula v3) | F1 plan §4 + §6 |
| Joint primary — sortino | ≥ baseline | F1 plan §6 |
| Joint primary — max_drawdown | ≤ baseline + 1 % | F1 plan §6 |
| Distribution — cryptos improving | ≥ 4 of 5 | F1 plan §6 |
| Density — BUY trades per fold | ≥ 50 | F1 plan §6 |
| Statistical — Cohen's d | ≥ 0.3 vs baseline on f1_buy | FTF tuning protocol |
| Statistical — BH-corrected p | < 0.05 vs baseline on f1_buy | FTF tuning protocol |
A variant that beats f1_buy alone but degrades expectancy_net is automatically rejected by the joint gate (F1 plan §6 §risk #2).
2. How we prove the objective is met¶
2.1 Baseline anchor¶
Baseline run: ftf_20260427_170614_06626e_* (Phase 2 rerun, 550 results, post-#704 FS↔FE drift fix). All metrics are read from PostgreSQL finetune_results (src/commun/finetune/persistence.py:34) with WHERE run_id LIKE 'ftf_20260427_170614_06626e_%'.
For each (crypto × fold) cell, the baseline value is the none variant of every existing factor. Track 5 introduces two new factors (label_smoothing + cleanlab) ; the baseline = both at none / off (i.e., the existing trainer behavior, unchanged).
Fold construction (per ADR-14 + committee reco 2) : 5 outer folds via walk-forward time-series split with purge + embargo (no random KFold). Each fold's train / val / test windows are contiguous in time, and a 1-day embargo separates consecutive splits to prevent leakage from triple-barrier label horizons (H4). Same construction as the baseline run for direct comparability.
Cost formula v3 (per F1 plan §4 + committee reco 1) : expectancy_net = expectancy_gross − (taker_fee + spread + slippage_10bps + funding_cost), where taker_fee=0.10%, spread=0.05%, slippage=10 bps interim (until #711 dynamic slippage ships), funding=0.01% per 8h. Implementation in src/commun/audit/economic_thresholds.py. The same formula is applied to the baseline rows ; deltas are like-for-like.
2.2 Experimental matrix¶
Per ADR-56 (every factor is independent and A/B-testable), the FTF runs each new factor against baseline with one factor active at a time :
| Factor | Variants | Baseline cross-factor |
|---|---|---|
label_smoothing |
none, mild, aggressive |
cleanlab=off |
cleanlab |
off, filter, reweight |
label_smoothing=none |
Total : 5 unique configs (3 + 3 − 1 shared none/off baseline) × 5 cryptos × 5 folds = 125 trained models. Plus a joint confirmation run (best label_smoothing × best cleanlab) to test for super-additivity, run only if both factors individually pass the per-track gate. Joint adds ≤ 1 additional config × 25 = +25 models. Worst case 150 models.
2.3 Statistical evidence required for a "PASS" verdict¶
The factor is kept (won the gate) only if :
- Per-track gate (F1 plan §6) clears on the joint metric vector, evaluated on the winning variant.
- Bootstrap CI 95 % on the (factor × variant − baseline) f1_buy delta excludes 0. Computed per (crypto × fold) cell, then aggregated.
- Benjamini-Hochberg correction : the variant's p-value (paired Wilcoxon vs baseline) is significant after BH-correction across all variants tested in this Story (5 variants → BH cutoff = α × rank/m).
- Cohen's d ≥ 0.3 on the f1_buy delta distribution (small effect size minimum).
- At least 4 of 5 cryptos show positive Δ f1_buy individually (no portfolio-only wins).
2.4 Reporting artefact¶
The PDF report from documentation/reviews/2026-04-29-track5-label-smoothing-results.md (post-rerun, separate dossier) MUST contain :
- The 5 statistical evidence boxes from §2.3
- A per-crypto × per-fold heatmap of f1_buy deltas
- Calibration curves (reliability diagrams) for baseline vs winning variant
- ECE tables (winning variant per crypto)
- Bootstrap CI plots
- Lockfile : the
ftf_configsnapshot (PostgreSQL) at run time, so the result is reproducible - MLflow run IDs for top-3 models per crypto
- Operator decision : lock, keep available, or abandon
This artefact is the deliverable that closes the Story — not the code merge.
3. What we'll do¶
3.1 Functional decomposition¶
Five thin slices, ordered by dependency :
- Label preprocessing module (
src/training/labels/) — pure functions for asymmetric smoothing + cleanlab application, no XGBoost dependency. - FTF factor declarations (
src/commun/finetune/ablation_matrix.py) — two new factors (label_smoothing,cleanlab), env-var-driven per the existing pattern (line 342factor_calibrationis the model). - Console params registration (
scripts/ftf_config_ui.py) — register the new env keys inPARAM_OPTIONSso operators edit them from Console only. - Trainer integration (
src/training/XGBoost/cvntrade_XGBoost_trainer.py:144-145) — call the label preprocessing pipeline beforexgb.DMatrixcreation, gated by env vars. - Hamilton orchestration + OTel — wrap the label preprocessing as a Hamilton dataflow (ADR-61) emitting OTel spans (ADR-62) per step.
Plus the cross-cutting :
- Guardrail tests + integration test (ADR-58) for both new factors
- MLOps readiness template filled (documentation/stories/CVN-N001-EE-S01/mlops_readiness.md) per ADR-70
- Migration : no SQL migration required (all new keys live in ftf_config.base_env JSONB ; factor catalogue is Python data per ADR-56)
3.2 What we explicitly do NOT do¶
- No model class change — XGBoost stays
- No barrier change —
ATR0.5_1.5_H4stays per project policy - No feature change — same OHLCV + enrichment + FE pipeline
- No CUSUM / filter chain change — out of Track 5 scope
- No live deployment — paper-FTF only ; live deployment gated by ADR-71 + EG-S06
flatten_all - No 9-filter chain re-evaluation in this Story — the FTF metrics are computed inside the existing filter chain, but a separate downstream evaluation (per committee reco 3) will assess whether the winning Track 5 variant changes the funnel pass-rate distribution. That evaluation is added to the post-merge instrumentation phase (§6 Phase 5).
4. Detailed architecture¶
4.1 System integration overview¶
flowchart TB
Console["Console UI
(Streamlit, port 8501)
scripts/ftf_config_ui.py"]
PG[("PostgreSQL
ftf_config
(base_env JSONB)")]
AblMatrix["ablation_matrix.py
+ 2 new AblationFactors"]
Runner["ablation_runner.py
variant × crypto × fold loop"]
PreflightFTF["FTF preflight
(ADR-64)
guardrail validation"]
LabelHam["Hamilton dataflow
(ADR-61)
src/training/labels/"]
Trainer["XGBoost trainer
cvntrade_XGBoost_trainer.py:144"]
OTel["OTel collector
(ADR-62)"]
Loki["Loki
structured logs"]
Persist["finetune_results table
(persistence.py:34)"]
Report["PDF report + dossier
2026-04-29-track5-results.md"]
Console -- "operator edits
label_smoothing.epsilon_buy
cleanlab.mode" --> PG
PG -- "loaded at FTF run start" --> AblMatrix
AblMatrix -- "factor declarations" --> Runner
Runner --> PreflightFTF
PreflightFTF -- "validate (factor, variant) compat" --> Runner
Runner -- "iterate variants" --> LabelHam
LabelHam -- "smoothed + cleanlab-filtered y_train" --> Trainer
Trainer -- "f1_buy, expectancy_net, ..." --> Persist
LabelHam -- "spans" --> OTel
Trainer -- "spans" --> OTel
LabelHam -- "log_event" --> Loki
Persist --> Report
4.2 Label transformation pipeline (per-fold)¶
flowchart LR
raw["y_train (raw)
shape (N,)
dtype int8 ∈ {0,1}"]
smooth["asymmetric_smooth
ε_hold, ε_buy"]
soft["y_smoothed
shape (N,)
dtype float32"]
clean["cleanlab.find_label_issues
+ pred_probs from CV fold"]
mask["suspect_mask
shape (N,) bool"]
apply["apply_mode
filter | reweight | off"]
final["y_final + sample_weights
passed to xgb.DMatrix"]
raw --> smooth
smooth --> soft
soft -.optional.-> clean
clean --> mask
soft --> apply
mask --> apply
apply --> final
classDef new fill:#fef3c7,stroke:#d97706
class smooth,clean,mask,apply,soft,final new
Notes :
- asymmetric_smooth is unconditional (variant none = identity, mild = (eps_buy=0.15, eps_hold=0.075), aggressive = (eps_buy=0.30, eps_hold=0.15) — values per committee reco 17, see §4.3 variant table for the canonical declaration)
- cleanlab step is gated by factor_cleanlab variant ; when off it's a no-op pass-through
- pred_probs for cleanlab come from the existing PurgedKFold splitter (training.cv.purged_kfold, López de Prado AFML Ch 7), reused so the inner cleanlab CV honours the same purge / embargo contract as the outer FTF folds. PurgedKFold reads CVN_PURGE_BARS and CVN_EMBARGO_BARS from the env, so this Story doesn't introduce a parallel configuration knob (CR PR #734 round 2). Default n_splits=2 per committee reco 18. Soft per-fold wall-time budget : if any fold exceeds CVN_CLEANLAB_FOLD_BUDGET_S (default 60 s), the loop short-circuits and unprocessed slices keep the initial 0.5 prior — find_label_issues then treats those samples as ambiguous and skips them rather than mis-flagging them. No leaky fallback (committee pr_review session 989a6567 blocker 1) : the prior "single-fold tail train-on-self" path was removed because it produced biased pred_probs that silently corrupted the cleanlab signal. Honest partial coverage with explicit telemetry (coverage_pct event field + P2 alert at < 80 %) beats a leaky full coverage.
- Sample weights from apply_mode=reweight are multiplied with existing class-balancing weights (line 165 of trainer) — order matters, documented in code
4.3 FTF factor matrix (declarations)¶
classDiagram
class AblationFactor {
+str name
+str factor_type
+str category
+Dict env_vars
+str description
}
class factor_label_smoothing {
+name = "label_smoothing"
+category = "model"
+variants none, mild, aggressive
+env_keys CVN_LABEL_SMOOTH_EPS_BUY,
CVN_LABEL_SMOOTH_EPS_HOLD
}
class factor_cleanlab {
+name = "cleanlab"
+category = "model"
+variants off, filter, reweight
+env_keys CVN_CLEANLAB_MODE,
CVN_CLEANLAB_MAX_DROP_PCT,
CVN_CLEANLAB_REWEIGHT_FLOOR
}
AblationFactor <|-- factor_label_smoothing
AblationFactor <|-- factor_cleanlab
Variant table :
| Factor | Variant | env_vars |
|---|---|---|
label_smoothing |
none |
CVN_LABEL_SMOOTH_EPS_BUY=0.0, CVN_LABEL_SMOOTH_EPS_HOLD=0.0 |
label_smoothing |
mild |
CVN_LABEL_SMOOTH_EPS_BUY=0.15, CVN_LABEL_SMOOTH_EPS_HOLD=0.075 (per committee reco 17, Müller 2019 calibrated for 20 % BUY rate) |
label_smoothing |
aggressive |
CVN_LABEL_SMOOTH_EPS_BUY=0.30, CVN_LABEL_SMOOTH_EPS_HOLD=0.15 (per committee reco 17) |
cleanlab |
off |
CVN_CLEANLAB_MODE=off |
cleanlab |
filter |
CVN_CLEANLAB_MODE=filter, CVN_CLEANLAB_MAX_DROP_PCT=5 |
cleanlab |
reweight |
CVN_CLEANLAB_MODE=reweight, CVN_CLEANLAB_MAX_DROP_PCT=5, CVN_CLEANLAB_REWEIGHT_FLOOR=0.5 |
Guardrails (per ADR-58, in src/commun/finetune/validation.py) :
label_smoothing.epsilon_buy ∈ [0.0, 0.5]— silly-bounds checklabel_smoothing.epsilon_hold ∈ [0.0, epsilon_buy]— asymmetry preserved (no symmetric or inverted smoothing)cleanlab.max_drop_pct ≤ 5.0— F1 plan §6 risk #4 (cap label drop rate at 5 %)cleanlab.mode in {off, filter, reweight}— closed enum
Integration test (ADR-58) :
- tests/integration/test_track5_label_smoothing.py — exercises apply_label_pipeline end-to-end (in-process, no PostgreSQL, no Airflow runner) for each of the 5 unique configs from §4.2 + the joint config. Asserts (a) no crash, (b) (X_final, y_final, sample_weights_final) shapes are mutually consistent, (c) y_final dtype is float32 or float64, (d) cleanlab filter mode drops ≤ max_drop_pct rows (strict cap, no tolerance), (e) the output is consumable by a real xgb.DMatrix → xgb.train pair (training succeeds and the produced booster predicts in [0, 1]).
- The PostgreSQL finetune_results write path is not exercised here ; it's covered by the existing FTF persistence tests (tests/unit/finetune/test_persist_predictions.py) and verified manually in Phase 3 of §6 when the operator runs the FTF sweep on the cluster.
4.4 Hamilton dataflow¶
flowchart TD
raw_y["raw_y_train: np.ndarray"]
eps_buy["epsilon_buy: float"]
eps_hold["epsilon_hold: float"]
cl_mode["cleanlab_mode: str"]
cl_max_drop["cleanlab_max_drop_pct: float"]
cl_floor["cleanlab_reweight_floor: float"]
X_train["X_train: np.ndarray"]
base_w["base_sample_weights: np.ndarray"]
smoothed_y["smoothed_y(raw_y, eps_buy, eps_hold)"]
cv_probs["cv_pred_probs(X_train, smoothed_y)
2-fold purged+embargoed CV
per ADR-14"]
suspect["suspect_mask(smoothed_y, cv_probs)"]
final_y["final_y(smoothed_y, suspect, cl_mode, cl_max_drop)"]
final_w["final_weights(base_w, suspect, cl_mode, cl_floor)"]
raw_y --> smoothed_y
eps_buy --> smoothed_y
eps_hold --> smoothed_y
smoothed_y --> cv_probs
X_train --> cv_probs
smoothed_y --> suspect
cv_probs --> suspect
smoothed_y --> final_y
suspect --> final_y
cl_mode --> final_y
cl_max_drop --> final_y
base_w --> final_w
suspect --> final_w
cl_mode --> final_w
cl_floor --> final_w
classDef out fill:#dcfce7,stroke:#16a34a
class final_y,final_w out
File : src/training/labels/label_pipeline.py (Hamilton dataflow module).
Trainer-side invocation (actual implementation as of committee pr_review 989a6567 blocker 3 fix : apply_label_pipeline now uses hamilton.driver.Driver(...).execute(...) with a DictResult adapter ; the previous imperative path was rejected for bypassing Hamilton's lineage tracking) :
from training.labels import apply_label_pipeline
X_train, y_train, sample_weights = apply_label_pipeline(
X_train,
y_train,
base_sample_weights=sample_weights,
crypto=os.environ.get("CVN_CRYPTO_SYMBOL", "unknown"),
fold_id=int(os.environ.get("CVN_FOLD_ID", "-1")),
)
# env vars read inside: CVN_LABEL_SMOOTH_EPS_BUY, CVN_LABEL_SMOOTH_EPS_HOLD,
# CVN_CLEANLAB_MODE (validated fail-fast against {off, filter, reweight} per
# ADR-25 — a typo aborts the variant cleanly), CVN_CLEANLAB_MAX_DROP_PCT,
# CVN_CLEANLAB_REWEIGHT_FLOOR, CVN_CLEANLAB_CV_FOLDS (default 2),
# CVN_CLEANLAB_FOLD_BUDGET_S (default 60), CVN_PURGE_BARS + CVN_EMBARGO_BARS
# (consumed by the existing PurgedKFold splitter — single source of truth
# for purge / embargo, no new knob).
Lineage emission (per ADR-61) : Hamilton's dr.visualize_execution(...) writes an SVG to MLflow as artefact label_pipeline_lineage_<crypto>_<fold>.svg.
4.5 OpenTelemetry instrumentation (ADR-62)¶
sequenceDiagram
participant Runner as ablation_runner
participant Trainer as XGBoost trainer
participant Pipeline as Hamilton label pipeline
participant Tracer as OTel tracer
participant Coll as OTel collector
Runner->>Tracer: span "ftf.variant" (factor=label_smoothing, variant=mild, crypto=BTCUSDC, fold_id=0)
Trainer->>Tracer: span "training.label_pipeline"
Trainer->>Pipeline: dr.execute(...)
Pipeline->>Tracer: span "label.smooth" (eps_buy=0.10, eps_hold=0.05)
Tracer-->>Coll: emit
Pipeline->>Tracer: span "label.cleanlab.cv_probs"
Tracer-->>Coll: emit
Pipeline->>Tracer: span "label.cleanlab.find_issues" (suspect_count=N, drop_pct=2.3)
Tracer-->>Coll: emit
Pipeline->>Tracer: span "label.apply_mode" (mode=filter)
Tracer-->>Coll: emit
Trainer->>Tracer: span "xgb.train" (n_samples=N')
Tracer-->>Coll: emit
Trainer->>Tracer: span "metrics.compute" (f1_buy, ECE, ...)
Tracer-->>Coll: emit
Spans emitted (one per step, ADR-62 golden-field attributes : factor, variant, crypto, fold_id) :
| Span | Attributes (in addition to golden fields) | Purpose |
|---|---|---|
label.smooth |
epsilon_buy, epsilon_hold |
timing + parameter trace |
label.cleanlab.cv_probs |
cv_folds=2, model=xgb, purge_embargo=enabled, mem_mb, cpu_pct |
wall time + resource footprint of CV pre-step (committee reco 13) |
label.cleanlab.find_issues |
suspect_count, drop_pct, buy_drop_pct (separate, reco 4), hold_drop_pct |
granular drop visibility on minority class |
label.apply_mode |
mode, effective_drop_pct, effective_buy_drop_pct |
confirm guardrail (drop ≤ 5 %) and minority preservation |
training.label_pipeline |
total_duration_ms |
parent span over all label work |
Hamilton ↔ OTel naming consistency (committee reco 8) : Hamilton node names (smoothed_y, cv_pred_probs, suspect_mask, final_y, final_weights) appear as the second-level segment of the corresponding span name (label.smooth, label.cleanlab.cv_probs, label.cleanlab.find_issues, label.apply_mode). The cross-reference is asserted in a unit test that diffs the Hamilton node graph against the span catalogue.
Metrics (Prometheus, per ADR-62) :
- cvntrade_label_smooth_eps_applied{factor, variant} (gauge)
- cvntrade_cleanlab_drop_pct{crypto, fold} (gauge)
- cvntrade_cleanlab_buy_drop_pct{crypto, fold} (gauge — committee reco 4 — minority class granularity)
- cvntrade_cleanlab_hold_drop_pct{crypto, fold} (gauge)
- cvntrade_label_pipeline_duration_seconds{stage} (histogram)
- cvntrade_label_final_weights_min{crypto, fold} (gauge — committee reco 5)
- cvntrade_label_final_weights_max{crypto, fold} (gauge)
- cvntrade_label_final_weights_mean{crypto, fold} (gauge)
- cvntrade_label_final_weights_std{crypto, fold} (gauge)
- cvntrade_label_pipeline_mem_peak_mb{crypto, fold} (gauge — committee reco 13, K8s pressure visibility)
Grafana alerts (committee reco 19) :
- cvntrade_cleanlab_drop_pct > 4.5 for any (crypto × fold) → P2 alert "preempt 5 % cap breach", runbook documentation/runbooks/cleanlab_over_drop.md
- cvntrade_cleanlab_buy_drop_pct > 3.0 → P2 alert "minority decimation" — separate threshold, BUY samples are precious
4.6 Console UI integration (ADR-59 + ADR-65)¶
flowchart LR
Op[Operator]
Console["Console UI
scripts/ftf_config_ui.py"]
Param["PARAM_OPTIONS dict
(line 60-101)"]
PG[("ftf_config.base_env
JSONB")]
Hist[("ftf_config_history
audit trail")]
Op -- "selects variant
'aggressive' from dropdown" --> Console
Console -- "reads schema" --> Param
Console -- "UPDATE base_env
SET label_smoothing.epsilon_buy=0.20" --> PG
PG -. "trigger" .-> Hist
Console -- "shows diff: was=0.0, now=0.20" --> Op
Registration (in scripts/ftf_config_ui.py:60-101) :
PARAM_OPTIONS = {
# ... existing ...
"CVN_LABEL_SMOOTH_EPS_BUY": ("number", 0.0, 0.5, 0.05), # (type, min, max, step)
"CVN_LABEL_SMOOTH_EPS_HOLD": ("number", 0.0, 0.5, 0.05),
"CVN_CLEANLAB_MODE": ("enum", ["off", "filter", "reweight"]),
"CVN_CLEANLAB_MAX_DROP_PCT": ("number", 0.0, 5.0, 0.5),
"CVN_CLEANLAB_REWEIGHT_FLOOR":("number", 0.0, 1.0, 0.05),
}
The Streamlit form auto-renders the right widget per type. No code change in the UI beyond the dict ; that's the ADR-59 promise.
4.7 Observability fan-out¶
flowchart LR
Trainer[Trainer + Hamilton]
OTel[OTel collector]
Tempo[Tempo
traces]
Prom[Prometheus
metrics]
Loki[Loki
structured logs]
Graf[Grafana
dashboard 'F1 Boost Track 5']
Trainer -- "spans" --> OTel
Trainer -- "metrics" --> OTel
Trainer -- "log_event" --> Loki
OTel --> Tempo
OTel --> Prom
Tempo --> Graf
Prom --> Graf
Loki --> Graf
Graf -.- Op[Operator dashboard]
New Grafana panels (added in infra/grafana/dashboards/cvntrade-f1-boost.json post-implementation) :
- Panel 1 :
cvntrade_label_smooth_eps_appliedheatmap by (factor, variant) - Panel 2 :
cvntrade_cleanlab_drop_pctper crypto over the FTF run - Panel 3 :
cvntrade_label_pipeline_duration_secondsp95 — catches cleanlab CV pre-step regressions - Panel 4 : Joint metric scoreboard (f1_buy delta, expectancy_net delta, sortino delta, max_dd delta) per variant from Postgres
finetune_results
5. Falsifiability + rollback¶
5.1 Falsifiable hypotheses (Story-level)¶
| H | What would falsify | Action if falsified |
|---|---|---|
| H1 (smoothing helps) | All 3 label_smoothing variants miss per-track gate AND ΔECE < 0.02 |
Keep factor available with none default ; advance to Track 6 |
| H2 (cleanlab helps) | All 3 cleanlab variants miss per-track gate AND no improvement on the calibration curves |
Keep factor available with off default ; advance to Track 6 |
| H3 (joint super-additive) | Joint config (best LS × best CL) ≤ max(LS-only, CL-only) on f1_buy | Drop joint, ship best single-factor variant only |
| H4 (cleanlab over-drops) | Drop rate hits the 5 % cap on > 1 crypto | Lower max_drop_pct to 2 % or revert cleanlab to off ; document in Story closure |
| H5 (silent training degradation) | OOS distribution shifts (training accuracy ↑ but validation accuracy ↓ by > 0.05) | This is overfitting from over-smoothing ; cap epsilon_buy at 0.10 ; document |
A REJECTED Story (H1 + H2 both) is still a successful Story — it falsifies cleanly and informs the next track. Per F1 plan §6, a NO-GO at sprint S1 close triggers escalation to big-bet bundle (S06-S08).
5.2 Rollback (config-only per ADR-71 + ADR-56)¶
- Per-variant rollback : set
factor_label_smoothing=noneandfactor_cleanlab=offinftf_config.base_envvia Console — < 1 minute. - Per-factor rollback : remove the factor from
ablation_matrix.py(revert one commit). Schema unchanged because all params live in JSONB. - No live deployment to roll back — Track 5 ships to FTF only, no production trading impact.
5.3 What's NOT a failure mode¶
- A variant taking longer than the baseline (cleanlab adds ~30 s/fold for the CV pre-step) is expected, not a failure. Logged as
cvntrade_label_pipeline_duration_seconds. - A variant changing the per-fold prediction distribution is expected, not a failure. The gate evaluates outcomes, not internals.
- A variant beating the gate on 4 of 5 cryptos but failing on 1 (e.g. SHIBUSDC) is acceptable per gate criterion — the outlier crypto stays on the
nonevariant.
6. Action plan (sequenced)¶
Phase 1 — Plan review (this dossier)¶
| # | Action | Owner | Output |
|---|---|---|---|
| 1.1 | Submit this dossier to Expert Committee plan_review |
Claude | committee verdict (PASSED / OK / EXECUTION_RISK / REJECTED) |
| 1.2 | Apply blockers (if REJECTED) or actionable recommendations | Claude + operator | updated dossier |
| 1.3 | Operator sign-off : approve to proceed to Phase 2 | Operator | OP wp#40 comment |
Phase 2 — Implementation (post-committee OK)¶
| # | Action | File | Test |
|---|---|---|---|
| 2.1 | Create Story dossier dir | documentation/stories/CVN-N001-EE-S01/ |
n/a |
| 2.2 | Add label_smoothing factor declaration |
src/commun/finetune/ablation_matrix.py (after line 351) |
unit |
| 2.3 | Add cleanlab factor declaration |
src/commun/finetune/ablation_matrix.py |
unit |
| 2.4 | Add 4 guardrails | src/commun/finetune/validation.py |
tests/unit/test_ftf_guardrails.py |
| 2.5 | Register 5 keys in Console | scripts/ftf_config_ui.py (PARAM_OPTIONS dict) |
manual UI smoke |
| 2.6 | Implement label preprocessing module | src/training/labels/label_pipeline.py (Hamilton) |
unit + property-based : value-range checks on y_final ∈ [0, 1], sample_weights >= 0, len(y_final) == len(X_train), no NaN/Inf, monotonic shrinkage of confidence (per committee reco 12) |
| 2.7 | Add OTel spans + metrics | src/training/labels/label_pipeline.py |
unit (mock collector) |
| 2.8 | Wire into XGBoost trainer | src/training/XGBoost/cvntrade_XGBoost_trainer.py:144-145 |
unit + integration |
| 2.9 | Add integration test (ADR-58) | tests/integration/test_track5_label_smoothing.py |
passes locally + CI |
| 2.10 | Fill MLOps readiness template | documentation/stories/CVN-N001-EE-S01/mlops_readiness.md |
reviewer checks 6 sections |
| 2.11 | Add Grafana panel JSON | infra/grafana/dashboards/cvntrade-f1-boost.json |
manual smoke |
| 2.12 | Update CLAUDE.md if patterns changed | CLAUDE.md |
n/a |
| 2.13 | Pin cleanlab version in dependencies (committee reco 14) | pyproject.toml + poetry.lock |
CI pinned-version check |
| 2.14 | Review K8s training pod resources for cleanlab CV pre-step (committee reco 11) — bump memory request if cleanlab CV peaks > 2× baseline ; document in infra/helm/cvntrade-airflow/values.yaml if changed |
infra/helm/cvntrade-airflow/values.yaml (if needed) |
k8s smoke pod boot |
Phase 3 — Validation (FTF run)¶
| # | Action | Owner | Output |
|---|---|---|---|
| 3.1 | Run FTF preflight | Operator (Airflow trigger dag_ftf__preflight) |
preflight green / red |
| 3.2 | Run FTF sweep on 5 unique configs | Operator (Airflow dag_finetune__pte with factor=label_smoothing,cleanlab) |
125 rows in finetune_results |
| 3.3 | Generate stats : bootstrap CI + BH-correction + Cohen's d | Operator (existing FTF report generator) | PDF report |
| 3.3b | PDF report integrity check (committee reco 15) : non-empty file, valid PDF magic bytes, contains the 5 statistical evidence boxes from §2.3 | Operator script (scripts/check_pdf_integrity.py) |
green = continue ; red = re-run report generator |
| 3.4 | Decide : per-track gate cleared? | Operator + Claude | go / no-go per variant |
| 3.5 | If go : submit joint config (best LS × best CL) | Operator | +25 rows |
| 3.6 | Write results dossier | Claude | documentation/reviews/2026-04-29-track5-label-smoothing-results.md |
Phase 4 — Merge + close Story¶
| # | Action | Owner | Output |
|---|---|---|---|
| 4.1 | Open PR (squash, base main) | Claude | PR # |
| 4.2 | CodeRabbit review cycle (4-5 passes) | Claude (apply fixes) | clean CR |
| 4.3 | Submit committee pr_review |
Claude | session id |
| 4.4 | Apply pr_review fixes | Claude | iteration |
| 4.5 | Merge | Operator | commit on main |
| 4.6 | CI Docs site green | Claude (verify) | ✅ |
| 4.7 | Close OP Story wp#40 (ADR-69 step 14) | Claude | wp#40 → Closed |
| 4.8 | If S02 (#713 focal loss) is the last in sprint S1, close sprint per OPERATIONS §16.4 | Claude | sprint closed |
Phase 5 — Post-merge instrumentation¶
| # | Action | Owner |
|---|---|---|
| 5.1 | Operator runs FTF sweep with new factor enabled in baseline | Operator |
| 5.2 | Grafana dashboard "F1 Boost Track 5" populated | Claude (panels exist post-merge) |
| 5.3 | Lock decision per FTF tuning protocol (BH significance + Cohen's d ≥ 0.3) | Operator |
| 5.4 | If lock : update ftf_baseline.json with the winning variant as new baseline |
Operator |
7. Risks + mitigations¶
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| R1 | Cleanlab CV pre-step adds > 30 s/fold ; FTF run wall-time blows budget | M | M | Span timing measured ; if > 60 s/fold, drop CV folds from 3 to 2 |
| R2 | Asymmetric smoothing degrades calibration on majority class (HOLD) | M | L | ECE measured per class ; if HOLD ECE worsens > 0.02, ship mild only |
| R3 | Cleanlab drops samples that are real BUYs (the rare ones we need most) | M | H | F1 plan §6 risk #4 cap : drop ≤ 5 % ; cleanlab reweight mode is safer than filter |
| R4 | Operator forgets to run FTF preflight → guardrail trips mid-run | L | M | Preflight is part of the standard dag_finetune__pte flow per ADR-64, hard to skip |
| R5 | Joint LS×CL config worse than either alone (anti-additive) | M | L | H3 hypothesis ; ship single-factor winner |
| R6 | The 5 % drop cap is hit on > 1 crypto → cleanlab effectively useless | L | L | H4 hypothesis ; lower drop cap or revert |
| R7 | OTel collector unreachable → spans drop ; FTF run continues | L | L | OTel emission is non-blocking per src/commun/observability/otel.py ; logged warning only |
| R8 | Hamilton dataflow lineage SVG too large for MLflow | L | L | Cap nodes shown ; fall back to JSON adjacency list |
| R9 | Cleanlab model instability : poor pred_probs from internal CV → misidentified label issues, false positives on suspect samples (committee reco 9) |
M | M | Monitor suspect_count and effective_drop_pct per fold ; alert on cleanlab_buy_drop_pct > 4.5% (preempt 5 % cap) ; if instability seen on > 1 crypto, fall back to 1-fold mode or switch to cleanlab=off |
| R10 | XGBoost training time inflates : altered label distribution + sample weights makes optimization slower (committee reco 10) | L | L | Monitor xgb.train OTel span duration p95 ; if regression > 50 % vs baseline, investigate before increasing variant count |
| R11 | K8s resource pressure during cleanlab CV pre-step (memory + CPU) (committee reco 11 + 13) | M | L | Per-pod memory + CPU monitoring during the label.cleanlab.cv_probs span ; review training pod requests / limits before sweep ; fall back to 1-fold cleanlab if OOMKilled |
| R12 | Cleanlab dependency drift between dev / CI / prod (committee reco 14) | L | M | Pin cleanlab exact version in pyproject.toml + poetry.lock ; CI verifies pinned version matches lock |
8. MLOps readiness preview¶
The full MLOps readiness file (per ADR-70) will be at documentation/stories/CVN-N001-EE-S01/mlops_readiness.md and committed in Phase 2 step 2.10. Preview of the 6 sections :
| § | Section | Track 5 specifics |
|---|---|---|
| 1 | Production monitoring | cvntrade_label_smooth_eps_applied, cvntrade_cleanlab_drop_pct, cvntrade_cleanlab_buy_drop_pct, cvntrade_cleanlab_hold_drop_pct, cvntrade_label_pipeline_duration_seconds, cvntrade_label_final_weights_{min,max,mean,std}, cvntrade_label_pipeline_mem_peak_mb |
| 2 | Alerting & runbooks | P2 alert on cleanlab_drop_pct > 4.5% (preempt 5 % cap, reco 19) ; P2 alert on cleanlab_buy_drop_pct > 3.0% (minority decimation) ; P3 alert on label_pipeline_mem_peak_mb > 2× baseline (K8s pressure, reco 11+13) ; runbooks documentation/runbooks/cleanlab_over_drop.md, cleanlab_minority_decimation.md, cleanlab_oom_pressure.md |
| 3 | Drift detection | PSI on top-K features unchanged (Track 5 doesn't touch features) ; concept drift on f1_buy per crypto ; post-Track 5 funnel re-evaluation (committee reco 3) — measure 9-filter chain pass-rate distribution shift |
| 4 | Staged rollout | FTF-only this Story (no live) ; if ever promoted to live trading, shadow ≥ 7d on BTCUSDC (deepest order book) |
| 5 | Rollback plan | Console flip factor_label_smoothing=none, factor_cleanlab=off (< 1 min) ; no code deploy needed ; cleanlab pinned in pyproject.toml (reco 14) so rollback doesn't trigger transitive dep upgrade |
| 6 | DRI | Operator (dococeven) ; backup TBD ; sunset 2026-07-28 (90d) |
9. Open questions for committee¶
- Smoothing values (§4.3 variant table) :
mild=(eps_buy=0.15, eps_hold=0.075)andaggressive=(eps_buy=0.30, eps_hold=0.15)— appropriate for a 20 % BUY rate? (Resolved by committee reco 17 ; values shown here are post-amendment.) - Asymmetry direction : we apply
ε_buy > ε_hold(more smoothing on minority) — this is the standard "label smoothing for imbalanced classification" recipe. Is there literature pushing the opposite direction we should consider? - Cleanlab
reweightfloor at 0.5 : reasonable, or push to 0.25 (halve weight twice)? - CV folds for cleanlab pre-step : 3 — increases wall time but improves
pred_probsquality. Worth 5 folds for cost? - Joint config (best LS × best CL) : run unconditionally or only if both single-factor variants pass the per-track gate? Defaulting to gated to save compute.
- Sample weights ordering : cleanlab reweight applies AFTER existing class-balancing weights. Multiplicative composition. Is there a known interaction with XGBoost gradient that makes additive composition safer?
- Smoothing on validation/test : we apply smoothing on training labels only ; val/test labels stay hard for honest evaluation. Confirm this is the standard practice (it is for image classification but may differ for HFT-style binary tasks).
- Hamilton vs imperative for this 4-step preprocessing : Hamilton is mandated by ADR-61 for batch flows. The pipeline is small (4 nodes) — any concern about over-engineering for 4 nodes vs gain in lineage emission?
10. Acceptance criteria (Story level)¶
- Plan dossier merged with committee
plan_reviewverdict ≥ ACCEPTED (≥ 8.0 avg, no unresolved blockers) - All committee actionable recommendations applied to dossier or PR
- Phase 2 implementation complete (10 of 12 sub-steps green ; CLAUDE.md update only if patterns change)
- FTF preflight green, FTF sweep complete (125 + optional 25 rows in
finetune_results) - Results dossier merged at
documentation/reviews/2026-04-29-track5-label-smoothing-results.md - Story OP wp#40 closed with PR + commit reference + MLflow run IDs
- If gate cleared : winning variant locked in
ftf_baseline.jsonvia Console flip (operator action) - If gate not cleared : H1/H2 falsified entry written into Story closure note + sprint S1 retrospective triggered
11. References¶
11.1 Existing artefacts (built on)¶
documentation/F1_BUY_BOOST_PLAN.md— canonical plan, §5 Track 5 specdocumentation/epics/CVN-N001-EE-f1-buy-boost.md— Epic doc with Story tracking tabledocumentation/templates/TEMPLATE_mlops_readiness.md— to be filled per ADR-70documentation/adr/0070-mlops-readiness-template-mandatory.md— gatedocumentation/adr/0071-trading-kill-switch-invariants.md— sister DESIGN
11.2 Code surfaces¶
src/commun/finetune/ablation_matrix.py:342(factor_calibration model pattern)src/commun/finetune/ablation_runner.py:110(_extract_extended_metrics)src/commun/finetune/persistence.py:34(finetune_resultsinsert)src/commun/finetune/preflight/base.py(FTF preflight ADR-64)src/commun/finetune/preflight/hamilton_exec.py(Hamilton driver pattern)src/commun/observability/otel.py:88(OTel singleton)src/training/XGBoost/cvntrade_XGBoost_trainer.py:144(label ingestion point)scripts/ftf_config_ui.py:60(PARAM_OPTIONS registration)tests/unit/test_ftf_guardrails.py(guardrail test pattern)
11.3 Existing ADRs the plan builds on¶
- ADR-25 — No silent fallback (cleanlab errors must be loud)
- ADR-29 — Naïve baseline mandatory (already in baseline)
- ADR-30, ADR-32, ADR-33 — Structured logs (golden-field attributes on spans)
- ADR-56 — Every change FTF-testable (factor matrix is the contract)
- ADR-58 — Every factor has a guardrail + integration test (mandatory)
- ADR-59 — All params in PostgreSQL ftf_config (Console-only edit)
- ADR-61 — Hamilton for batch flows (label preprocessing)
- ADR-62 — OpenTelemetry spans (instrumentation contract)
- ADR-63 — Binary BUY/NOT_BUY mission mode
- ADR-64 — FTF preflight first-class
- ADR-65 — Airflow DAG params run-level only
- ADR-67 — Pluggable feature selection (architectural sibling)
- ADR-68 — Committee = default review channel (this dossier)
- ADR-69 — OpenProject orchestrator (Story discipline)
- ADR-70 — MLOps readiness template (must fill, Phase 2 step 2.10)
- ADR-71 — Kill-switch invariants (live-deploy gate ; not relevant for FTF-only Track 5 but documented)
11.4 Committee sessions (precedent)¶
9d4942cb— F1 boost plan v3 PASSED OK avg 8.96 (parent approval)3e0a3008— sprint S0 / S01 ADR-70 PASSED OK avg 8.6 (mode B precedent for docs)4c388b4c— sprint S0 / S02 ADR-71 PASSED EXECUTION_RISK avg ~8.3 (mode B with 1 BLOCKER)8a202a18— this Story's plan_review : PASSED / OK avg 8.7 (architect 9.0, ops 9.0, ml-eng 8.5, crypto-trader 8.5, data-sci 8.5), 0 blockers, 4 dissents (Q1 / Q4 / Q5 / temporal leakage), 19 recommendations of which 18 actionable applied to this dossier in revisions below ; reco 16 ("consult ML expert on sample-weight × XGBoost gradient interaction") considered satisfied by this very committee session
11.4.1 Recommendations applied (committee 8a202a18)¶
| # | Reco | Where applied |
|---|---|---|
| 1 | Detail cost formula v3 | §2.1 with formula + link to economic_thresholds.py |
| 2 | Clarify fold construction (walk-forward + purge + embargo, ADR-14) | §2.1 |
| 3 | Note post-Track 5 9-filter chain re-evaluation | §3.2 + §8 §3 (drift) |
| 4 | OTel metric cleanlab_buy_drop_pct |
§4.5 metrics + alerts |
| 5 | OTel metrics for final_weights distribution (min/max/mean/std) |
§4.5 metrics |
| 6 | Cleanlab CV folds purged + embargoed (ADR-14 temporal leakage) | §4.2 + §4.4 + Hamilton driver pseudocode |
| 7 | Per-class HOLD ECE gate (ECE_HOLD ≤ baseline + 0.01) |
§1.3 success criterion |
| 8 | Hamilton node ↔ OTel span naming consistency + asserted via unit test | §4.5 |
| 9 | Risk : cleanlab model instability | §7 R9 + monitoring (recos 4, 19) |
| 10 | Risk : XGBoost training time inflation | §7 R10 |
| 11 | K8s resource scaling check | §6 Phase 2 step 2.14 + §8 §2 alerting |
| 12 | Property-based tests details for label_pipeline.py |
§6 Phase 2 step 2.6 |
| 13 | Memory + CPU monitoring during cleanlab CV | §4.5 attributes + metric mem_peak_mb |
| 14 | Pin cleanlab version in dependencies | §6 Phase 2 step 2.13 + §8 §5 rollback |
| 15 | PDF report integrity check | §6 Phase 3 step 3.3b |
| 17 | Adjusted smoothing values : mild=(eps_buy=0.15, eps_hold=0.075), aggressive=(eps_buy=0.30, eps_hold=0.15) (canonical (eps_buy, eps_hold) tuple ordering) |
§4.3 variant table |
| 18 | Cleanlab CV folds : 3 → 2 (fallback 1 if > 60 s/fold) | §4.2 + §4.4 + Hamilton pseudocode + §7 R1 |
| 19 | Grafana alert cleanlab_drop_pct > 4.5% |
§4.5 alerts + §8 §2 |
11.4.2 Dissents recorded (not amendments)¶
- Q1 smoothing values : 4 experts split, settled by reco 17 amendment
- Q4 cleanlab CV folds : settled by reco 18 amendment (3 → 2 with fallback)
- Q5 joint config strategy : kept gated (run only if both single-factor pass) — operator may override
- Temporal leakage on cleanlab CV : critical, settled by reco 6 amendment
11.5 Issues + sprint¶
- Need
CVN-N001(#608) - Epic
CVN-N001-EE(#707) — F1_buy boost - Story
CVN-N001-EE-S01(#712) — this Story - Operational prereqs : ✅ #708 (kill-switch DESIGN), ✅ #709 (MLOps template DESIGN)
- Sprint
F1B-S1-QW-PhaseA(OP version id 6, dates 2026-05-04 → 2026-05-17 — note : we're starting early ; sprint dates are aspirational) - Sibling Story (next in sprint) :
CVN-N001-EE-S02(#713) — focal loss, gated by Track 5 result
11.6 External¶
- Müller, R., et al. When Does Label Smoothing Help? NeurIPS 2019 — asymmetric smoothing for imbalanced classification
- Northcutt, C., et al. Confident Learning: Estimating Uncertainty in Dataset Labels — cleanlab paper
- de Prado, M.L. Advances in Financial Machine Learning §3.5 — meta-labeling and label noise in HFT