Skip to content

PR review dossier — CVN-N001-EI-S12 increment 2 : finetune_baselines per (crypto, fold, cost), in-runner

  • PR: #1117 — feat(CVN-N001-EI-S12): finetune_baselines per (crypto,fold,cost), in-runner (increment 2)
  • Story: CVN-N001-EI-S12 (economic observability substrate), OpenProject wp / GH issue
  • Branch: feat/CVN-N001-EI-S12-increment2-baselines
  • Session type: pr_review (ADR-68) — touches src/commun/finetune/
  • Date: 2026-06-07
  • Companion: RCA documentation/reviews/2026-06-06-cvn-n001-ei-s12-inc2-btc-features-8gi-oom-rca.md (the cluster dry-run that validated this PR and incidentally exposed the btc_features sizing bug, fixed + deployed separately in #1124).

1. Goal

Increment 2 of the S12 economic-observability substrate (Option B). It populates finetune_baselines with a naïve-baseline comparator per (crypto, fold, cost_bps) so every model row in finetune_results has a same-basis benchmark (ADR-29 — naïve baseline mandatory; ADR-27 — intra-crypto comparison only). Increment 1 (#1115, merged) added the model-side keys (model_hyperparams, feature_hash); increment 2 adds the baseline side.

2. Why per (crypto, fold, cost) — the load-bearing design decision

The §0bis probe of finetune_results found cost_bps is an intra-run ablation sweep dimension: ablation variants override CVN_TRADE_FEE_BPS so a single run sweeps multiple costs {10, 15, 30}. A baseline is only a valid comparator at the same cost basis as the model row it benchmarks. So the key must be (crypto, fold, cost_bps), not (crypto, fold) — otherwise a 10-bps model row would be compared to a baseline computed at a different cost. Migration 014 adds cost_bps to finetune_baselines + the uq constraint.

3. Architecture — same-basis-by-construction, in-runner

Baselines are computed inside AblationRunner, reusing the exact feature_store (fs), fold window, and cost the model row used — rather than re-derived in a separate pass (which would re-anchor folds via datetime.now()-based _generate_folds → clock-skew → a different test window than the model saw). Key elements:

  • _ensure_baseline(crypto, fold, fs, cost_bps) (ablation_runner.py): memoised by (crypto, fold, cost) via _baseline_keys_done — computes the 3 baselines (buy_and_hold, random, naive) once per key on fs[["close"]] over the fold's test_start/test_end, at the model's exact cost, and persists. Marks the key done only if all 3 persist (CR r1 fix — was hiding non-exceptional persist failures).
  • Cost threadingcompute_trade_cost(0, 0, cost_bps, slippage, 0) gives the round-trip cost; buy_and_hold / random_baseline subtract it per trade (naive = 0 trades, unaffected).
  • _check_baseline_completeness() — emits event=baselines_incomplete if any expected key never persisted (no-silent-partial, ADR-25). Called after each persist_result site and at run_all end.
  • baselines.pyrandom_baseline is seeded + averaged (for seed in range(n_seeds=100): rng=np.random.default_rng(seed) + np.mean), streaming (one transient array per seed, scalar accumulators → KBs footprint, verified in code). round_trip_cost param added to buy_and_hold + random_baseline.
  • persistence.pypersist_baseline INSERT gets the cost_bps column + value.
  • ablation_report.py — removed a dead persist_baseline({}) loop (baselines now persist in-runner).

4. Migration 014 (applied to prod PG, verbatim, checksum-verified)

ADD COLUMN cost_bps INT NOT NULL DEFAULT 0 then DROP DEFAULT (table empty → backfills nothing, then fail-loud if a future caller omits cost) + CHECK(cost_bps >= 0) + uq_finetune_baseline = (run_id, strategy, baseline_type, crypto, fold_id, cost_bps). strategy kept in the uq (defensive: random is strategy-independent today, but keeping it makes a silent ON-CONFLICT collapse impossible if a baseline ever becomes strategy-dependent; run_id already implies strategy → zero downside).

5. Validation

5.1 Unit (green)

  • test_finetune_baselines.py::TestCostBasis — cost netting + monotonicity.
  • test_baselines_increment2.pypersist_baseline cost reaches the INSERT; _ensure_baseline memo (once per key), distinct-cost = distinct key, None-cost skipped, completeness flags missing.

5.2 Cluster dry-run (the load-bearing gate — committee #257 mandate)

A pre-merge dry-run ran the branch image on both prod pod profiles at real cgroup limits: - A — n_features @ HEAVY 24Gi: SUCCESS, no OOM. Baselines persisted (cost_bps-keyed), keys populated. Since increment-2 is factor-independent (_ensure_baseline reuses fs[["close"]]), A's in-pod validation transfers to all factors. - B — btc_features @ STANDARD 8Gi: OOMKilled — but the root cause was a pre-existing sizing mis-classification (btc loads a 2nd asset via the BTC overlay → working-set 7.68 Gi, razor-margin vs 8 Gi), not increment-2 (whose footprint is KBs, verified in code §3). Fixed + deployed independently in #1124 (btc_features → HEAVY_FACTORS); post-deploy btc re-ran @ 24Gi to SUCCESS, no OOM, measured working-set 7.68 Gi (≈ 2× n_features → 2-asset root cause confirmed as a number). Full RCA in the companion dossier. - DB end-state: finetune_baselines populated (3 types, cost_bps-keyed) across folds; finetune_results rows all keyed (model_hyperparams non-NULL, distinct feature_hash) — the increment-1 re-probe is green too.

6. Risks / open questions for the committee

  1. Baseline correctness — is the close-only random_baseline (fixed hold_bars exit, 100 seeds averaged, cost-netted) an appropriate naïve comparator under ADR-29, or should the naïve baseline be the sole ADR-29 comparator and random/B&H be auxiliary?
  2. Cost-basis — is per-(crypto, fold, cost) the right grain, or is there a sweep dimension we missed (e.g. slippage) that also changes the valid basis?
  3. Same-basis-by-construction vs a documented re-derivation — is reusing the in-runner fs/window/cost the right call, and is the completeness check sufficient to guarantee no silently-missing baseline?
  4. strategy in the uq — kept defensively; any downside?
  5. Anything that should block merge.

7. Out of scope (tracked separately)

  • Increment 1b (optuna_trial_id) — deferred (YAGNI).
  • l2_features sizing — NOT reclassed by analogy (no overlay in code); its own profile to be verified.
  • 3rd ~12–16 Gi pod profile (btc at 32% of 24 Gi) — optimization issue.
  • Increment 4 (cost-validity) — next S12 increment.

Question for the committee

Is the increment-2 finetune_baselines implementation correct and safe to merge — in particular the same-cost-basis, in-runner baseline construction (ADR-29/27/29), the per-(crypto, fold, cost) memoisation, the no-silent-partial completeness check, the cost-netting in random/buy_and_hold, and memory safety — given the cluster dry-run validation (A success in-pod, factor-independent; B's OOM traced to a separate sizing bug already fixed)?