PR review dossier — CVN-N001-EI-S12 increment 2 : finetune_baselines per (crypto, fold, cost), in-runner¶
- PR: #1117 —
feat(CVN-N001-EI-S12): finetune_baselines per (crypto,fold,cost), in-runner (increment 2) - Story: CVN-N001-EI-S12 (economic observability substrate), OpenProject wp / GH issue
- Branch:
feat/CVN-N001-EI-S12-increment2-baselines - Session type: pr_review (ADR-68) — touches
src/commun/finetune/ - Date: 2026-06-07
- Companion: RCA
documentation/reviews/2026-06-06-cvn-n001-ei-s12-inc2-btc-features-8gi-oom-rca.md(the cluster dry-run that validated this PR and incidentally exposed thebtc_featuressizing bug, fixed + deployed separately in #1124).
1. Goal¶
Increment 2 of the S12 economic-observability substrate (Option B). It populates finetune_baselines with a naïve-baseline comparator per (crypto, fold, cost_bps) so every model row in finetune_results has a same-basis benchmark (ADR-29 — naïve baseline mandatory; ADR-27 — intra-crypto comparison only). Increment 1 (#1115, merged) added the model-side keys (model_hyperparams, feature_hash); increment 2 adds the baseline side.
2. Why per (crypto, fold, cost) — the load-bearing design decision¶
The §0bis probe of finetune_results found cost_bps is an intra-run ablation sweep dimension: ablation variants override CVN_TRADE_FEE_BPS so a single run sweeps multiple costs {10, 15, 30}. A baseline is only a valid comparator at the same cost basis as the model row it benchmarks. So the key must be (crypto, fold, cost_bps), not (crypto, fold) — otherwise a 10-bps model row would be compared to a baseline computed at a different cost. Migration 014 adds cost_bps to finetune_baselines + the uq constraint.
3. Architecture — same-basis-by-construction, in-runner¶
Baselines are computed inside AblationRunner, reusing the exact feature_store (fs), fold window, and cost the model row used — rather than re-derived in a separate pass (which would re-anchor folds via datetime.now()-based _generate_folds → clock-skew → a different test window than the model saw). Key elements:
_ensure_baseline(crypto, fold, fs, cost_bps)(ablation_runner.py): memoised by(crypto, fold, cost)via_baseline_keys_done— computes the 3 baselines (buy_and_hold,random,naive) once per key onfs[["close"]]over the fold'stest_start/test_end, at the model's exact cost, and persists. Marks the key done only if all 3 persist (CR r1 fix — was hiding non-exceptional persist failures).- Cost threading —
compute_trade_cost(0, 0, cost_bps, slippage, 0)gives the round-trip cost;buy_and_hold/random_baselinesubtract it per trade (naive= 0 trades, unaffected). _check_baseline_completeness()— emitsevent=baselines_incompleteif any expected key never persisted (no-silent-partial, ADR-25). Called after eachpersist_resultsite and atrun_allend.baselines.py—random_baselineis seeded + averaged (for seed in range(n_seeds=100): rng=np.random.default_rng(seed)+np.mean), streaming (one transient array per seed, scalar accumulators → KBs footprint, verified in code).round_trip_costparam added tobuy_and_hold+random_baseline.persistence.py—persist_baselineINSERT gets thecost_bpscolumn + value.ablation_report.py— removed a deadpersist_baseline({})loop (baselines now persist in-runner).
4. Migration 014 (applied to prod PG, verbatim, checksum-verified)¶
ADD COLUMN cost_bps INT NOT NULL DEFAULT 0 then DROP DEFAULT (table empty → backfills nothing, then fail-loud if a future caller omits cost) + CHECK(cost_bps >= 0) + uq_finetune_baseline = (run_id, strategy, baseline_type, crypto, fold_id, cost_bps). strategy kept in the uq (defensive: random is strategy-independent today, but keeping it makes a silent ON-CONFLICT collapse impossible if a baseline ever becomes strategy-dependent; run_id already implies strategy → zero downside).
5. Validation¶
5.1 Unit (green)¶
test_finetune_baselines.py::TestCostBasis— cost netting + monotonicity.test_baselines_increment2.py—persist_baselinecost reaches the INSERT;_ensure_baselinememo (once per key), distinct-cost = distinct key, None-cost skipped, completeness flags missing.
5.2 Cluster dry-run (the load-bearing gate — committee #257 mandate)¶
A pre-merge dry-run ran the branch image on both prod pod profiles at real cgroup limits:
- A — n_features @ HEAVY 24Gi: SUCCESS, no OOM. Baselines persisted (cost_bps-keyed), keys populated. Since increment-2 is factor-independent (_ensure_baseline reuses fs[["close"]]), A's in-pod validation transfers to all factors.
- B — btc_features @ STANDARD 8Gi: OOMKilled — but the root cause was a pre-existing sizing mis-classification (btc loads a 2nd asset via the BTC overlay → working-set 7.68 Gi, razor-margin vs 8 Gi), not increment-2 (whose footprint is KBs, verified in code §3). Fixed + deployed independently in #1124 (btc_features → HEAVY_FACTORS); post-deploy btc re-ran @ 24Gi to SUCCESS, no OOM, measured working-set 7.68 Gi (≈ 2× n_features → 2-asset root cause confirmed as a number). Full RCA in the companion dossier.
- DB end-state: finetune_baselines populated (3 types, cost_bps-keyed) across folds; finetune_results rows all keyed (model_hyperparams non-NULL, distinct feature_hash) — the increment-1 re-probe is green too.
6. Risks / open questions for the committee¶
- Baseline correctness — is the close-only
random_baseline(fixedhold_barsexit, 100 seeds averaged, cost-netted) an appropriate naïve comparator under ADR-29, or should the naïve baseline be the sole ADR-29 comparator and random/B&H be auxiliary? - Cost-basis — is per-(crypto, fold, cost) the right grain, or is there a sweep dimension we missed (e.g. slippage) that also changes the valid basis?
- Same-basis-by-construction vs a documented re-derivation — is reusing the in-runner fs/window/cost the right call, and is the completeness check sufficient to guarantee no silently-missing baseline?
strategyin the uq — kept defensively; any downside?- Anything that should block merge.
7. Out of scope (tracked separately)¶
- Increment 1b (
optuna_trial_id) — deferred (YAGNI). l2_featuressizing — NOT reclassed by analogy (no overlay in code); its own profile to be verified.- 3rd ~12–16 Gi pod profile (btc at 32% of 24 Gi) — optimization issue.
- Increment 4 (cost-validity) — next S12 increment.
Question for the committee¶
Is the increment-2 finetune_baselines implementation correct and safe to merge — in particular the same-cost-basis, in-runner baseline construction (ADR-29/27/29), the per-(crypto, fold, cost) memoisation, the no-silent-partial completeness check, the cost-netting in random/buy_and_hold, and memory safety — given the cluster dry-run validation (A success in-pod, factor-independent; B's OOM traced to a separate sizing bug already fixed)?