Planning Dossier — CVN-N001-EK-S02: D2 — Quantitative pre-study (Phase 1)¶
Story: CVN-N001-EK-S02 · parent Epic CVN-N001-EK
OpenProject: wp#272 · GitHub issue: #1173
Type: implementation plan (ADR-68), for committee plan_review.
Date: 2026-06-09 · Revision: r4 (restructured to the ADR-0101 canonical plan: problématisation → user stories → hypotheses → state of the art → DoD → consolidation; technical contracts renumbered as §7–§18 "Approach")
Story exit gate: requires methodology committee review PASSED before transition to Specified. INFEASIBLE is an allowed, successful outcome.
Revision journal. - r1 (PR #1175) — right governance boundary (analysis-only). - r2 (PR #1176) — hardened engineering/statistical contracts (REWORK, 4.2/5): reference-capacity rule, cost evidence tiers, corrected power wording + simulation contract, exploratory anti-snooping boundary, null-gate selection rule, expanded budgets,
INFEASIBLEnext-action table, "signed derivation", Threats. - r3 (PR #1177) — sharp P1 edits (PASS-with-edits, 4.75/5): Tier-C completes S02 but blocks S03; conservative capacity tie-breaker; Tier-B bounded; mapping monotonicity; null-invalidity criteria. - r4 (this revision) — restructured to the ADR-0101 canonical plan shape: added Problématisation (§1), User stories (§2), Hypotheses (§3), State of the art (§4), Definition of Done (§5), Consolidation (§6). The committee-reviewed technical contracts are unchanged in substance — renumbered as §7–§18 (Approach / design) with internal cross-references updated.
Executive summary¶
S02 derives the quantitative values required to lock the Phase-0 charter (ADR-0102).
It is analysis-only and may legitimately end in INFEASIBLE. It cannot prove signal, choose a
model, authorize trading, or launch runs. It separates durable rules (already in the ADR) from the
instance values it now derives, each backed by a signed, versioned derivation or a typed
INFEASIBLE record. INFEASIBLE is a successful S02 outcome, but it blocks the S03 charter lock unless
and until the typed cause is remediated.
Partie I — Cadrage (ADR-0101)¶
1. Problématisation¶
La question, en une phrase. Avant de chercher à prouver que notre signal d'entrée « marche », peut-on dire — de façon défendable et écrite à l'avance — combien il devrait valoir pour être réellement tradable, et si nos données permettent seulement de le mesurer ?
Pourquoi maintenant. Le pilote précédent est ressorti non concluant : les fourchettes étaient compatibles avec une perte comme avec un gain, à tous les niveaux de coût. La tentation naturelle serait de relancer « encore un essai » jusqu'à tomber sur un résultat favorable. C'est précisément le piège : en multipliant les essais sans règle fixée d'avance, on finit toujours par trouver un chiffre qui plaît — sans qu'il signifie quoi que ce soit. Cette Story (S02) refuse ce piège : elle pose d'abord les chiffres-seuils et les règles de mesure, avant tout test du signal.
Les illusions à éviter — quatre pièges qui ont déjà coûté au projet :
- Confondre une bonne note d'examen et de l'argent. Un modèle peut avoir d'excellentes métriques statistiques sans générer le moindre gain net une fois les frais payés.
- Un coût « moyen » qui rend tout rentable. Si l'on suppose un coût de transaction trop optimiste ou « par défaut », n'importe quel signal paraît tradable. Le coût doit être réel, mesuré à une taille d'ordre précise.
- Choisir les règles après avoir vu les résultats. Décider de la taille d'ordre, du point de comparaison (le « null ») ou de la métrique après avoir regardé ce qui arrange — c'est se mentir à soi-même.
- Conclure « rentable » sur trop peu d'observations. Quelques trades chanceux ne font pas une preuve.
Ce qu'on mesure réellement. S02 dérive, à partir des données déjà existantes (aucun nouveau calcul lourd,
aucun lancement) : (a) le coût réel d'un aller-retour à une taille d'ordre de référence explicitement
« recherche, non-déploiement » ; (b) le seuil d'espérance économique au-dessus duquel on rentre dans ses frais
(E_econ_min) ; (c) l'effet de prédiction minimal correspondant (E_pred_min) ; (d) la puissance
disponible — nos données sont-elles seulement assez nombreuses pour détecter un tel effet ? ; (e) un point
de comparaison honnête (le null) que le signal devra battre.
L'honnêteté du verdict. S02 a le droit de conclure « infaisable » — et c'est un succès, pas un échec. Mieux vaut dire « nos données ne permettent pas de trancher » que fabriquer une fausse certitude. L'infaisabilité est même typée (coût, capacité, puissance, mapping, null) pour indiquer exactement quelle action corrige le blocage.
Pourquoi ça compte. Ces chiffres ne sont pas académiques : ils sont la condition pour verrouiller le protocole (S03) puis tester le signal (S04). Sans eux, tout ce qui suit n'est que tâtonnement. S02 transforme « on essaie et on verra » en « on sait ce qu'il faut atteindre, et on sait si c'est mesurable ».
2. Besoins — User stories¶
- US1 — En tant qu'owner scientifique, je veux que les seuils économiques soient dérivés avant tout test du signal, afin d'éviter le garden of forking paths. → §7, §10
- US2 — En tant que risk owner, je veux un coût P90 tracé à une capacité de référence avec son niveau de preuve (tier), afin de ne jamais confondre un proxy fragile avec un coût verrouillable. → §9
- US3 — En tant que méthodologiste, je veux une simulation de puissance reproductible, afin de savoir si le substrat peut détecter l'effet économiquement pertinent. → §11
- US4 — En tant qu'opérateur, je veux un verdict
INFEASIBLEtypé, afin de savoir exactement quelle remédiation engager (instrumenter le coût, élargir l'univers, changer la métrique…). → §15 - US5 — En tant que reviewer, je veux que les résultats exploratoires existants ne contaminent pas les choix de calibration, afin de préserver la falsifiabilité du protocole. → §8.1
- US6 — En tant que risk owner, je veux qu'un coût de niveau Tier-C ne puisse jamais verrouiller S03, afin d'empêcher une fausse tradabilité. → §9.2
- US7 — En tant qu'owner scientifique, je veux une définition arrêtée du Sortino (R1), afin qu'aucune conclusion « sous le plancher » ne soit invoquée sans base comparable. → §13
3. Hypotheses (EN)¶
S02 is calibration, not signal-proof, so its hypotheses concern feasibility / derivability, not the signal.
Each carries the null tested and the test method; the verdict on a non-rejected null is a typed
INFEASIBLE (§15).
- H1 — capacity derivable. A conservative, defensible research-reference order size can be derived from
existing
defi_top5liquidity without inspecting outcome performance. Null: no such rule exists without outcome inspection. Test: the pre-specified liquidity rule + tie-breaker (§9.1), rejected alternatives recorded; failure →INFEASIBLE-capacity. - H2 — cost lockable. Existing trade/cost logs support a Tier A/B (lockable) P90 round-trip cost at the
reference size.
Null: only Tier C/D is supported. Test: evidence tiering + Tier-B error bounding (§9.2); Tier C/D →
non-lockable /
INFEASIBLE-cost-data. - H3 — mapping monotonic/stable. A documented predictive→economic mapping exists and is monotonic/stable
with gross/net expectancy over the action-policy range.
Null: no monotonic/stable mapping. Test: §10; if unshown, the predictive metric (e.g.
f1_buy) cannot be the primary gate →INFEASIBLE-mapping. - H4 — substrate powered. The current universe/folds are adequately powered:
MDE_available ≤ E_pred_min. Null:MDE_available > E_pred_min(underpowered). Test: purged/embargoed block-bootstrap power simulation (§11); ifN_mininfeasible →INFEASIBLE-power. - H5 — valid null exists. A conservative, valid primary null-gate is constructible (preserving base rate,
temporal dependence, purge/embargo, trade-count / action-policy).
Null: no defensible null. Test: candidate comparison + invalidity criteria (§12); else
INFEASIBLE-null.
Working assumptions surfaced (not tested in S02): defi_top5 as the control group · ATR-H4 triple-barrier
labels · GBT model class. These are inherited from the charter / ADR-0102 thesis and are tested in S04, not
here.
4. State of the art (EN, references)¶
Each invariant is grounded in established practice and mapped to the hypotheses:
- Leakage-aware validation — purged & embargoed cross-validation for overlapping, horizon-dependent labels [1]; grounds H4's dependency model and H5's null (§11, §12).
- Backtest overfitting under multiple trials — Probability of Backtest Overfitting / CSCV [2]; motivates the budget/FDR discipline and the anti-snooping boundary (§8.1, §14).
- Multiple-testing control — False Discovery Rate [3]; grounds the FDR budget (§14).
- Researcher degrees of freedom — the garden of forking paths [4]; grounds §8.1 + pre-registering all thresholds (H1–H5 are fixed before measurement).
- Pre-specified experimental design — design before measurement [5]; grounds the "values before test" stance (§1).
- Statistical power & MDE — power analysis and the minimum detectable effect [6]; grounds H4 (§11).
- Resampling under dependence — moving-block bootstrap [7] and the stationary bootstrap [8]; ground the block-bootstrap power contract (§11.1).
- Transaction cost & market impact — optimal execution / impact at size [9]; grounds the capacity-dependence of P90 cost (H2, §9).
- Downside-risk performance — the Sortino ratio and downside deviation [10]; grounds the R1 note (§13).
References: [1] López de Prado, Advances in Financial Machine Learning, 2018. [2] Bailey, Borwein, López de Prado & Zhu, "The Probability of Backtest Overfitting", J. Computational Finance, 2014. [3] Benjamini & Hochberg, "Controlling the False Discovery Rate", JRSS-B, 1995. [4] Gelman & Loken, "The garden of forking paths", 2013. [5] NIST/SEMATECH, e-Handbook of Statistical Methods (Design of Experiments). [6] Cohen, Statistical Power Analysis for the Behavioral Sciences, 1988. [7] Künsch, "The jackknife and the bootstrap for general stationary observations", Ann. Statist., 1989. [8] Politis & Romano, "The Stationary Bootstrap", JASA, 1994. [9] Almgren & Chriss, "Optimal execution of portfolio transactions", J. Risk, 2000. [10] Sortino & Price, "Performance measurement in a downside risk framework", J. Investing, 1994.
5. Definition of Done¶
S02 is complete only if:
- reference capacity derived by the §9.1 rule (non-deployment) or
INFEASIBLE-capacity; - P90 cost measured at Tier A/B (lockable) or Tier-C provisional (non-lockable, labelled) or
INFEASIBLE-cost-data; -
E_econ_minandE_pred_minderived (distinct) orINFEASIBLE-mapping; -
MDE_available, andN_minif underpowered, simulated under the §11.1 contract (block design, deps, purge/embargo, reps, seed, sensitivity, reproducibility metadata) orINFEASIBLE-power; - primary null-gate justified (candidates compared per §12) or
INFEASIBLE-null; - Sortino R1 resolved (definition note) or explicitly non-actionable; not used as a gate;
- budgets proposed with the full §14 content;
- every value carries a signed derivation (§16);
- no predictive run / training / Airflow launch performed (analysis-only attested).
Operational readiness (process/governance Story, ADR-0101 Inv 4): each derivation is reproducible from its
signed-provenance record (§16); the verdict (proceed-to-S03 / typed INFEASIBLE) routes to a recorded next
action (§15); rollback = revert the derivations (no runtime impact); owner handoff = the unlocked charter +
this dossier.
6. Consolidation & traceability¶
No dangling thread — every problem (Ch.1) maps to a hypothesis (Ch.3), a user story (Ch.2), an approach section (§7–§18), and the literature (Ch.4):
| Problem (§1) | Hypothesis | US | Approach section | Literature |
|---|---|---|---|---|
| metric ≠ money | H3 | US1 | §10 mapping (monotonicity) | [1][2] |
| optimistic/placeholder cost | H2 | US2, US6 | §9.2 evidence tiers | [9] |
| rules chosen after results | H1 + all | US5 | §8.1 anti-snooping · §9.1 capacity rule | [4][5] |
| conclude on too few obs | H4 | US3 | §11 power / MDE | [6][7][8] |
| honesty of the verdict | all | US4 | §15 typed INFEASIBLE |
— |
| Sortino comparability | — | US7 | §13 Sortino R1 | [10] |
Decision routing. Each hypothesis whose null is not rejected → the corresponding typed INFEASIBLE
(§15) with its required artifact and next action; all nulls rejected (all values derived) → proceed to S03
(charter lock), subject to the Tier-A/B lockability gate (§9.2). Coherence check: the approach (§7–§18)
delivers exactly the values the problématisation (§1) requires and tests exactly the hypotheses (§3) the user
stories (§2) demand — no section without a thread, no thread without a section.
Partie II — Approach / design (the how)¶
7. Decision boundary¶
S02 is analysis / calibration only. It derives values; it proves nothing about the signal.
Hard constraint (operator, 2026-06-09): no Airflow launcher, no training, no Phase-2 predictive run, no model selection. If any step requires a new training run, an Airflow launch, a cluster job, or a predictive Phase-2 run, S02 STOPs and returns to the operator — it does not launch autonomously. No charter lock (S03); no trading authority.
8. Inputs — allowed / forbidden¶
Allowed (read-only, existing): existing OHLCV cache · existing labels (ATR-H4 triple-barrier) · existing trade / cost logs · existing exploratory results (e.g. the cost-sensitivity report) · derivation notebooks / scripts (read-only over the above).
Forbidden: any new training · any predictive sweep / Phase-2 run · any cluster / Airflow launch · OOS threshold optimisation · meta-labeling · any model comparison that picks a winner.
8.1 Exploratory-results / anti-snooping boundary¶
Existing exploratory results may be used only to identify known failure modes and data-quality
threats. They may not be used to choose the reference capacity, primary metric, null-gate, universe,
label/horizon, action policy, FDR budget, or any value that would improve the chance of a future PROMOTE.
If an exploratory finding influences a tuple coordinate, that influence must be recorded as prior
rationale, and the corresponding tuple must be registered and budgeted (ADR-0102 Invariant 5).
9. Cost & capacity derivation (the first blocker)¶
E_econ_min is meaningless without a real P90 round-trip cost at a defined target capacity. History shows
cost was never pinned (the cost-sensitivity report swept {30,40,50,70} bps for lack of a real round-trip
cost; the §0bis economic probe found a keyed-but-wrong-cost placeholder trap). No abstract / default /
placeholder cost is permitted.
9.1 Reference-capacity rule (pre-specified, before cost)¶
Target capacity = a research reference capacity, explicitly non-deployment — it fixes a target order size before any cost estimation, and is not a deployment authorization. The reference order size must be derived by a documented liquidity rule, before cost estimation, not chosen after seeing cost outcomes. The rule MUST specify:
- sampling window;
- venue(s);
- asset-level liquidity metric (e.g. ADV, top-of-book depth, median hourly volume);
- sizing mode: per-asset equal-notional · liquidity-scaled · worst-asset-constrained;
- participation cap;
- minimum tradable notional;
- treatment of missing / stale liquidity;
- whether P90 cost is computed per asset then aggregated, or directly at portfolio level.
Tie-breaker. If several defensible reference-capacity rules exist, choose the most conservative rule that still satisfies the minimum-tradable-notional constraint. The selected rule and the rejected alternatives must be recorded — the capacity must not be chosen to make the cost favourable.
If no rule can be defended without inspecting outcome performance → INFEASIBLE (reason: capacity).
9.2 P90 cost — evidence tiers¶
P90 cost is not assumed measured; it is graded by the evidence the existing data actually supports:
| Tier | Basis | Lockable? |
|---|---|---|
| A | directly observed fills at comparable asset / venue / size / regime | Yes |
| B | observed fills adjusted by a pre-specified liquidity model | Yes |
| C | order-book / spread proxy with a conservative stress multiplier | No — provisional, analysis-only bound, labelled non-lockable |
| D | unsupported placeholder | No → INFEASIBLE (reason: cost-data) |
Only Tier A or B may produce a lockable P90 cost (for S03). The derivation states which tier, with sample sizes, asset/venue/size/regime coverage, maker/taker mix, spread+impact+latency treatment, and tail slippage. Bad cost measurement is the fastest path from rigorous protocol to fake tradability — so the tier is explicit and auditable.
Tier B is bounded. Tier B requires a pre-specified adjustment model with documented inputs, calibration window, residual / error analysis, and a conservative stress factor. If the adjustment model cannot bound its error conservatively, Tier B is downgraded to Tier C (a sophisticated-but-fragile proxy is not lockable).
Tier C consequence for S03. A Tier-C provisional cost may complete S02 as an analysis artifact, but it cannot feed an S03 charter lock. If S02 ends with Tier C, the S03 path is blocked until Tier A/B evidence is produced, or the risk owner explicitly approves a non-lockable exploratory charter state (recorded). "S02 done, Tier C labelled" never authorises a lock.
10. E_econ_min / E_pred_min mapping¶
E_econ_min= economic break-even from the §9 (Tier A/B) conservative P90 cost + reference capacity.- A documented predictive→economic mapping turns a predictive effect into per-trade economic expectancy →
E_pred_min= the minimum predictive effect that meetsE_econ_min. The two stay distinct (ADR-0102 Invariant 8); a predictive lift is not an economic edge. - Stability / monotonicity. The mapping MUST document whether the predictive metric is monotonic with
gross/net expectancy over the observed or simulated action-policy range. If monotonicity or stability
cannot be shown, that predictive metric (e.g.
f1_buy) cannot be the primary gate metric — the project history shows exactly this AUC/f1→tradability non-transfer. - If no defensible mapping can be constructed from existing data →
INFEASIBLE(reason:mapping).
11. Power simulation → MDE / N_min¶
Power feasibility rule (corrected):
- Compute
MDE_available(the minimum detectable effect at the current universe/folds) under the purged/embargoed dependency model. - Derive
E_pred_minfromE_econ_min(§10). - If
MDE_available > E_pred_min, the current substrate is underpowered for the economically relevant effect. - Estimate
N_min= the smallest feasible universe / fold / sample configuration for whichMDE(N_min) ≤ E_pred_min. - If
N_minis not operationally feasible →INFEASIBLE(reason:power) → widen universe / folds.
N_trades_min ≥ 30is a minimum sanity floor, not a power guarantee — the MDE simulation decides feasibility (MDE wins over the heuristic).
11.1 Power-simulation contract (reproducible)¶
The simulation MUST specify: block-length selection · whether blocks are by time / asset / fold / cluster · whether cross-asset dependence is preserved · whether resampling preserves label base rate · whether fold boundaries are respected · how purge/embargo are applied inside bootstrap samples · the statistic powered · how CIs / rejection criteria are computed · number of replications · seed policy · sensitivity of the result to block length. The power report carries full reproducibility metadata (§16).
12. Null-gate candidate selection¶
Do not lock "random entry" as the default null-gate — a random-entry baseline is often too weak as the primary null. Random entry is a diagnostic null candidate, not the primary.
Primary null-gate selection rule. Among implementable candidate nulls, choose the most conservative
null that: preserves label base rate by asset/fold · preserves temporal dependence at the horizon scale ·
respects purge/embargo · preserves trade-count / action-policy constraints · is reproducible · does not leak
future information. If candidates disagree, conservatism wins unless the more conservative candidate is
demonstrably invalid. A more conservative null may be rejected only if it violates the registered
action-policy constraints, destroys the label base-rate structure, breaks purge/embargo comparability, or
produces an economically non-comparable trade-count / payoff distribution — and the rejection rationale must
be recorded (so "this null is too hard, therefore invalid" is not a valid rejection). If no defensible null
can be constructed (split/label issue) → INFEASIBLE (reason: null).
A tuple cannot
PROMOTEby beating only diagnostic nulls (restated from ADR-0102 because S02 selects the candidate).
13. Sortino R1 definition¶
R1 (Sortino) is an S02 deliverable, resolved before any actionable "sub-floor" statement.
- Research convention (default): Sortino on per-period strategy returns · MAR = 0 unless a risk policy defines otherwise · annualisation only if the return periodicity is fixed and documented. Do not freeze the convention without first checking how existing reports compute Sortino (comparability of the ~1.0 floor).
- Role boundary: Sortino R1 resolves metric comparability only. It is not an S02 gate and confers no sub-floor or deployment claim. Output = a Sortino definition note (input series · periodicity · MAR · annualisation factor · treatment of zero/downside observations · comparability of the ~1.0 floor) + a decision on whether prior Sortino observations are comparable / non-comparable / non-actionable.
14. Budgets (proposed, not locked)¶
A budget proposal (ADR-0102 Invariant 6) MUST include: family definition · max number of registered tuples ·
max ONE-ITERATION attempts · FDR method · alpha / FDR level · allocation rule across tuples/stages · what
consumes budget · what does not · final-holdout access policy · stop rule when budget is exhausted. These
are proposals for the S03 lock, not locked values.
15. INFEASIBLE taxonomy — typed reasons on the single verdict¶
ADR-0102 reconciliation. ADR-0102 Invariant 4 defines a single
INFEASIBLEstate (noDEFERRED-INFEASIBLE, noPARK). The types below arereasonannotations on that one verdict, not new verdict states — so this plan does not contradict the ADR.
| reason | Trigger | Blocks S03 | Substrate work | Charter revision | Kills thesis | Required artifact |
|---|---|---|---|---|---|---|
cost-data |
no defensible (Tier A/B) P90 cost | yes | — | — | no | cost-evidence note (tiers attempted) |
capacity |
no defensible reference order size | yes | — | re-define/justify capacity | no | capacity-rule note |
power |
MDE_available > E_pred_min & N_min infeasible |
yes | widen universe / folds | — | no | power report + N_min estimate |
mapping |
no defensible predictive→economic mapping | yes | — | change primary metric / mapping basis | no | mapping-attempt note |
null |
no defensible null (split/label) | yes | — | fix split / labels | no | null-candidate comparison |
Not allowed for any reason: weaken E_pred_min, change the metric, or use proxy/Tier-C/Tier-D cost to
proceed to an S03 lock. A typed INFEASIBLE is a successful S02 outcome.
16. Deliverables¶
- The filled charter values (still
UNLOCKED): P90 cost (with tier) ·E_econ_min· mapping ·E_pred_min· MDE /N_min· primary null-gate (justified) · proposed budgets · Sortino R1 note — each with a signed derivation — or the corresponding typedINFEASIBLErecord + its required artifact.
"Signed derivation" means: immutable artifact path or MLflow run id · git commit SHA · input dataset versions / hashes · code version · parameters · author · review status · generated timestamp · a reproducible command or notebook execution record. "Signed" without these is just prose.
MLflow is provenance-only here: an artifact / provenance tracker for analysis notebooks — not evidence of a training or predictive run. An "MLflow run id" in S02 never denotes a model-training run.
17. Threats to validity¶
- cost logs not representative of the target order size;
- P90 unstable due to small sample;
- liquidity-regime shift;
- block bootstrap misspecified / cross-asset dependence underestimated;
- null too weak (diagnostic) or too strong (invalid);
E_pred_minmapping fragile;- prior exploratory results contaminate calibration choices (§8.1);
- Sortino comparability unresolved;
- the non-deployment reference capacity misread as deployment capacity;
- Tier-C provisional cost accidentally treated as lockable;
- reference-capacity rule selected to minimise estimated cost rather than conservatively represent deployable liquidity.
Each is addressed by the corresponding contract above; residual ones are recorded with the derivation.
18. Review gates¶
- S02 exit: committee review PASSED (this dossier first, then the derivations).
INFEASIBLEis an allowed, successful outcome. No charter lock (S03); no run (S04).
Committee questions¶
- Is a conservative research-reference order size from
defi_top5liquidity (non-deployment, §9.1 rule) an acceptable capacity basis, or must S02 block on a real AUM/target before any cost? - What evidence tier is required for P90 cost to be lockable in S03 (§9.2) — is Tier B the floor, or only Tier A?
- Does the plan adequately prevent existing exploratory results from influencing S02 calibration choices (§8.1)?
- Is the purged/embargoed block-bootstrap MDE (§11.1 contract) the right power method given analysis-only over existing data?
- Is the primary-null selection rule (§12, not defaulting to random entry) correctly the conservative choice, and is "cannot PROMOTE on diagnostic nulls" correctly restated here?
- Does encoding the
INFEASIBLEtypes as reasons on the single verdict (§15) correctly honour ADR-0102 Invariant 4? - Tier-C outcome semantics (§9.2): if S02 produces only Tier-C cost evidence, should the Story be
complete but S03 blocked (plan's position, when Tier C is clearly labelled non-lockable), or must S02
itself return
INFEASIBLE-cost-data?