CVN-N015-EA-S02 — Requirements expression for the test stack foundation Epic¶

Date : 2026-05-06 Story : CVN-N015-EA-S02 (OP wp#117) GH issue : #837 Parent Epic : CVN-N015-EA — Test stack foundation (OP wp#107 / GH #827) Depends on : CVN-N015-EA-S01 (test strategy — OP wp#116 / GH #836 — currently in PR #852) Operator decisions locked : 2026-05-06 — A=ok (initially 6 use cases U1-U6 ; amended same-day to 7 use cases by adding U7 = Story workflow gate enforcer per the strategic invariant in S01 §11 — see §2 amendment note), B=ok (CI budgets p95 ≤ 2 min fast / ≤ 10 min integration / ≤ 30 min nightly), C=time_machine Status : proposed (committee plan_review pending)

0. Intent + scope¶

S02 refines the strategic taxonomy from S01 into the concrete functional + non-functional requirements of the foundation Epic (CVN-N015-EA only — pytest, Testcontainers, factories, flaky-test detector). The other Epics (EB-EI) refine S01 into their own subsets via their own requirements Stories.

S02 is the contract that S03 (architecture) builds against — without it, S03 risks designing against an under-specified target and re-deriving requirements at architecture time.

Out of scope (covered by other Stories) : the test-type catalogue itself (S01), the specific service contracts (S03), the implementation choices like fixture file layout (S04+), the data quality / ML behaviour / system-E2E / contract specifics (Epics EB-EI requirements Stories).

1. Operator decisions — locked 2026-05-06¶

#	Decision	Locked value	Rationale
A	Stakeholders + use cases scope	Initially 6 kept ; amended same-day to 7 : developer pre-commit (U1), PR reviewer (U2), oncall reading test failure (U3), FTF launcher pre-sweep (U4), MLflow promotion gate (U5), audit reviewer post-incident (U6), U7 added = Story workflow gate enforcer (per S01 §11 strategic invariant — ties U1-U6 together with enforcement at every ADR-81 transition)	Each maps to a distinct lifecycle stage where test feedback drives a decision ; trimming any leaves a stage without a usability bar. Operator confirmed the original 6, then amended to 7 same-day after the S01 §11 Story-phase × test-artefacts integration matrix surfaced the missing enforcement use case.
B	CI wall-clock budgets	fast tier p95 ≤ 2 min ; integration tier p95 ≤ 10 min ; nightly tier p95 ≤ 30 min	Aligned with strategy doc §5 performance budgets. 2 min fast is the "PR feedback loop stays interactive" bar (per UX research on attention switching). 10 min integration covers Testcontainers cold-start + the 5 services contracted in F2 (Postgres, Redis, MinIO, MLflow, Airflow scheduler+webserver) ; if downstream Epics EB-EI surface a need for a 6th service the budget gets re-evaluated via amendment Story, not pre-allocated here. 30 min nightly is the budget that lets us absorb data-quality + ML-behaviour suites without crossing the next-morning standup.
C	Time-freezing library	`time_machine`	C-level `ctypes` patching covers `time.time()` and pandas `Timestamp.now()` consistently (freezegun has known divergences here that have already bitten the FTF tests in the legacy suite). 5-10× faster than freezegun on setup/teardown — material at our CI budget B. Drop-in syntax compat with freezegun makes future migration cheap if the call ever flips. Roll-our-own ruled out (no exotic clock semantics in CVNTrade ; would be pure NIH).

2. Stakeholders + 7 canonical use cases (U1-U7)¶

Each use case is one person × one moment where test feedback determines a decision. The Story's requirements MUST trace back to ≥ 1 use case + a numerical bar.

#	Stakeholder	Moment	What they need from the suite	Numerical bar
U1	Developer	pre-commit (`make test-fast`)	"Did my last 30 min of work break the contracts I touched ?"	p95 wall-clock ≤ 30 s on developer laptop ; output ≤ 1 page when green ; failure points to ≤ 3 candidate files
U2	PR reviewer	reading the PR's CI summary	"Did the code change pass the gates the strategy says it must ?"	every gate listed in strategy §4 maps 1:1 to a CI check name with a clear PASSED/FAILED ; reviewer can decide on the gate without opening the run log
U3	Oncall	parsing a CI failure on `main` post-merge	"What broke + what runbook applies ?"	failure log includes (a) the failing assertion (b) the test docstring (c) the runbook link if the test belongs to a runbook'd subsystem ; resolution path under 10 min from failure email to acknowledged Slack
U4	FTF launcher	pre-sweep validation	"Will this 4-hour sweep crash on hour 1 because of a guardrail miss / config drift ?"	guardrail unit tests (per ADR-58) cover every variant in the matrix ; integration smoke against Testcontainers' `ftf_config` PG image runs in ≤ 60 s ; pre-sweep gate fails fast before submitting to Airflow
U5	MLflow promotion gate	post-FTF "should we promote this model ?"	"Does this candidate model pass the contract + ML-behaviour tests on the held-out window ?"	contract + ML-behaviour suites runnable against any MLflow run id in ≤ 10 min ; output is a single PASSED/FAILED + per-test detail in MLflow run tags
U6	Audit reviewer	post-incident root-cause	"Was this code path tested at the time of the incident, and what did the test assert ?"	every test has a stable docstring + git-blame trail ; `pytest --collect-only --quiet` is committed as snapshot per release tag so audit can reconstruct the test surface at any point in time
U7	Story workflow gate enforcer	every ADR-81 transition (`In specification → Specified` ; `In progress → Developed` ; `Developed → In testing` ; `Tested → Closed`)	"Are the test artefacts required to enter this Story phase present + verified ?"	per-Story `documentation/stories/<cvn_id>/tests/` folder populated with the required artefact at each gate (strategy.md → test code → test_run_.md → manifest.yaml) ; CI guardrail G5 fails the PR if a transition is attempted without its artefacts ; committee `plan_review` + `pr_review` issue an explicit verdict on the test artefacts (S01 §11.4)

Why all 7 (extends operator decision A — added 2026-05-06 per the strategic invariant in S01 §11) : U7 is the use case that ties the rest together — without it, U1-U6 are just developer-side ergonomics with no enforcement. The OP-Story-phase × test-artefacts integration matrix (S01 §11.1) is the canonical reference U7 enforces against. Trimming U7 = "tests are nice to have", which is exactly the discipline the Epic was created to break.

3. Functional needs — what the foundation MUST support¶

#	Need	Driven by	Acceptance signal
F1	Test-type coverage : `unit`, `property`, `contract`, `cache`, `integration`, `DAG smoke` (the 6 from strategy §2 layer = base + middle that the foundation Epic owns)	strategy §2 + §3 cadence matrix	each type has a `pytest -m <type>` selector that returns a non-empty set ; CI matrix has a job per type
F2	Service virtualisation : Postgres, Redis, MinIO, MLflow, Airflow scheduler+webserver — runnable as Testcontainers ephemeral containers	U4 (FTF pre-sweep) + U5 (promotion gate)	`pytest tests/integration/services/` spins up + tears down the 5 services in CI under the integration budget B
F3	Data shape coverage : OHLCV (15min × 5 cryptos × 30 day window), model artefacts (xgb/lgb/cb pickles + MLflow metadata), FTF results rows (`finetune_runs` + `finetune_results` PG schema), signals (BUY/SELL/HOLD + confidence + filter trace)	F1 + F2 ; every data shape must have a factory	every shape has a `factories.py` builder under `tests/factories/` returning realistic-but-deterministic data ; factories accept `seed` for reproducibility
F4	Fixture / factory hierarchy : project-level `conftest.py` for cross-cutting fixtures (PG container, MLflow tracking server) ; per-package `conftest.py` for narrow fixtures (single OHLCV window) ; factories produce data, fixtures wire it into the system under test	U1 + U2 (PR reviewer reads the test)	fixtures discoverable via `pytest --fixtures` with one-line description ; no fixture > 50 LoC ; no fixture has > 3 parameters
F5	Flaky-test detector : retry layer + flake-rate dashboard ; CI flags any test that flips PASSED/FAILED across the same SHA on rerun	U2 (PR reviewer) + U3 (oncall)	`pytest --flaky-detect` (or `pytest-rerunfailures` config) re-runs FAILED tests once ; CI exposes flake rate per test on a Grafana panel ; tests with flake rate > 5 % over the last 7 days get an automatic GH issue
F6	Time-freezing primitive : `time_machine` integration (per decision C) ; pytest fixture `frozen_time(when)` ; works with `pd.Timestamp.now()`, `datetime.now()`, `time.time()`, `asyncio.get_event_loop().time()`	U4 (FTF determinism) + U5 (model artefacts time-deterministic)	`frozen_time` fixture in project `conftest.py` ; assertions on `pd.Timestamp.now()` + `datetime.now()` agree to the second when frozen ; `tests/README.md` (shipped with S04 plugin) documents the tz-aware pattern with ≥ 1 example each for naive, UTC-aware, and local-tz-aware usage (e.g., `pd.Timestamp('2024-01-01', tz='UTC')` vs naive) including explicit normalisation guidance ; committee `pr_review` on S04 PR MUST verify the examples + that tests demonstrating each tz mode exist
F7	Story-phase test integration (per S01 §11) : every ADR-81 transition has a test-artefact gate. `@pytest.mark.story("<cvn_id>")` discoverable via the S04 pytest plugin's `--story` CLI flag (per S03 architecture — pytest plugin design specified there) ; per-Story `documentation/stories/<cvn_id>/tests/` folder created at `In specification` and populated through transitions ; CI guardrail G5 (extends G1-G4) blocks a transition attempt without its artefacts ; committee `plan_review` + `pr_review` issue an explicit `tests:` sub-verdict (S01 §11.4)	U7 + S01 §11 strategic invariant + ADR-81 + ADR-68	`pytest --collect-only --story <cvn_id>` returns ≥ 1 test per acceptance criterion ; G5 CI job FAILS if `documentation/stories/<cvn_id>/tests/manifest.yaml` is absent at `Tested → Closed` ; committee `pr_review` verdict body includes the 4 sub-verdicts from S01 §11.4 (any `INSUFFICIENT` blocks merge)
F8	Test cases + datasets versioned + provenance-tracked (per S01 §11.3 vocabulary) : `tests/cases/<cvn_id>/` for data-driven scenarios (YAML or `pytest.parametrize`) ; `tests/datasets/<cvn_id>/` for content-addressed test data (DVC for > 10 MB, in-repo for ≤ 10 MB) ; every `(test_sha, dataset_sha)` pair pins to a deterministic outcome	U6 + S01 §11 strategic invariant + ADR-23 (version-pinned features)	every test that uses a dataset cites it via a fixture that reads from `tests/datasets/<cvn_id>/<name>.parquet` (or DVC pointer) ; CI fails if a test-data file is read from outside `tests/datasets/` (catches operator pulling from `data/` accidentally) ; the per-Story manifest.yaml lists `(test_sha, dataset_sha, run_id)` for each acceptance criterion

4. Non-functional needs — quantitative budgets¶

#	Need	Bound	Driven by	Owner of the metric
NF1	Fast tier wall-clock	p95 ≤ 2 min in CI ; p95 ≤ 30 s on developer laptop	decision B + U1	CI Grafana panel `cvntrade-test-tier-latency` (TBD by S03)
NF2	Integration tier wall-clock	p95 ≤ 10 min in CI	decision B + U4/U5	same panel
NF3	Nightly tier wall-clock	p95 ≤ 30 min	decision B	same panel
NF4	Parallelism contract	Each test MUST be safe under `pytest -n auto` (no shared mutable state across tests). Postgres / Redis / MLflow containers MUST be either shared with read-only assertions OR per-worker spawned.	F4 + decision B (parallelism is the only way to hit the wall-clock budgets)	enforced by an `xdist`-mode CI job that runs the full suite under `-n auto` ; failure breaks merge
NF5	Determinism	Every test that involves a model train, a `np.random` call, a UUID, or a `datetime.now()` MUST seed explicitly. Re-running the same test under the same SHA produces the same artefact bit-for-bit.	U6 (audit) + U4 (FTF reproducibility)	`pytest-randomly` configured + a `seed_all` fixture in project `conftest.py` that sets `random.seed`, `np.random.seed`, `os.environ['PYTHONHASHSEED']`, and torch / xgboost seeds in one call
NF6	Reproducibility from clean clone	Any developer must be able to run `git clone … && cd champollion && make test-fast` in < 5 min total (clone + `.venv_airflow` create + dep install + first test run)	U1 + onboarding new contributor	timed in CI's "fresh-clone" smoke job ; failure breaks merge
NF7	Open-source only	No paid SaaS in the test stack. (Testcontainers is OSS ; pytest is OSS ; time_machine is OSS ; everything must stay this way.)	constraint from issue body + cost discipline	dependency licence check in CI (`pip-licenses --fail-on=Commercial`)
NF8	Python 3.12 fixed	All test infra targets Python 3.12. No conditional logic for 3.11 or 3.13.	constraint from issue body	CI matrix has Python 3.12 only ; pytest skips that branch on `sys.version_info` are forbidden by lint rule
NF9	macOS spawn-mode aware	Tests touching multiprocessing MUST work under macOS spawn mode (workers re-setup `sys.path` + `dotenv` per CLAUDE.md memory).	constraint from issue body + CLAUDE.md memory	CI matrix runs the multiprocessing-touching tests on a macOS runner once per nightly cycle ; if the macOS run diverges from Linux, the test is flagged as macOS-incompatible until fixed
NF10	Reproducibility from `(git_sha, dataset_sha, deps_lock_sha)`	Any historic test run MUST be reproducible within ε-tolerance by checking out the recorded `git_sha` + materialising the recorded `dataset_sha` (DVC pull or in-repo file at that SHA) + restoring the recorded `deps_lock_sha` (S03 picks the lock-file format — likely `requirements.lock` or `uv.lock`). Tolerance bands are explicit in the manifest schema : exact for categoricals (PASSED/FAILED, predicted class), ε ≤ 1e-6 for float metrics (f1_buy, sortino, return %), ε ≤ 0.1 % for cumulative wealth simulations. "Bit-for-bit" is not promised — it would require pinning Linux kernel versions + glibc + BLAS implementations, which is out of scope for v1. Strict dependency pinning (the lock-file SHA in the manifest) absorbs the dominant non-determinism source.	F8 + NF5 + U6 (audit)	for any closed Story, `make reproduce CVN_ID=<cvn_id>` checks out the manifest's recorded SHAs, materialises the datasets, restores the deps lock, runs `pytest --story <cvn_id>` (per S03 plugin), and asserts the recorded outcomes match within their declared ε bands. CI runs this on a random closed Story per nightly cycle (canary) ; failure auto-files a GH issue tagged `audit-regression`.
NF11	Test artefacts immutability after `Closed`	Once a Story transitions to `Closed`, its `documentation/stories/<cvn_id>/tests/manifest.yaml` is immutable. Any future commit modifying a `Closed` Story's manifest is rejected by a CI guardrail (G6).	U6 (audit) + U7 (Story workflow gate enforcer) + ADR-81 closure semantics	git pre-commit hook OR CI gate (G6) compares the staged file against the `manifest_pinned_at_close` SHA (recorded in OP wp at closure time) ; mismatch fails the commit/PR. Amendments require an explicit `amendment Story` per S01's invariant (cannot be done in-place).

5. Constraints — explicit boundary conditions¶

Open-source only (NF7) — no DataDog test agent, no Sentry test transport, no commercial pytest plugins. Self-hosted observability via Grafana + Loki + Prometheus.
Python 3.12 fixed (NF8) — chosen because 3.12 is the production Airflow / Console runtime (memory : .venv_airflow).
macOS spawn-mode aware (NF9) — operator's primary dev box is macOS ; CI Linux runners alone don't catch spawn-mode bugs.
Single-operator reality — no cross-team test-ownership negotiation (per strategy §7).
Existing Testcontainers Story #757 supersedes any conflicting design here — when S03 architecture lands, it consumes #757's container conventions.

6. Glossary — fixture / factory / mock / stub / spy¶

Avoids the pytest ecosystem's frequent debates by pinning canonical definitions for this Epic.

Term	Canonical definition (this Epic)	Example
fixture	A pytest construct that produces an object + handles its lifecycle (setup / teardown). Discoverable via `pytest --fixtures`. SHOULD wire factory output into the system under test ; SHOULD NOT contain non-trivial business logic.	`pg_engine` fixture spinning up a Testcontainers Postgres + yielding a SQLAlchemy engine + tearing down the container
factory	A pure function (or `factory_boy`-style class) that produces a deterministic data instance. Accepts `seed` for reproducibility. Lives under `tests/factories/`.	`make_ohlcv_window(seed=0, n_bars=300, freq='15min', crypto='BTCUSDC')` returning a `pd.DataFrame`
mock	A test double that records calls AND can be configured to return preset values. Use for incoming dependencies (the SUT calls something we want to control). Use sparingly — prefer real implementations + factories for in-process collaborators.	`mock_mlflow_client.log_metric.assert_called_with('f1_buy', 0.42)`
stub	A simpler test double : returns a preset value, does NOT record calls. Use when you need a dependency to "say yes" without verifying the interaction.	`stub_kill_switch.is_engaged = lambda: False`
spy	Wraps a real implementation, records calls, lets the real call through. Use when you want to assert "X was called" without changing the SUT's behaviour.	`spy = mocker.spy(filter_chain, 'execute') ; …; assert spy.call_count == 1`

The "system vs integration" debate is closed by strategy §10 ; not re-opened here. S02 inherits those definitions.

7. Acceptance criteria¶

#	Criterion	Evidence
1	Committee `plan_review` PASSED (or operator waiver in PR body)	session JSON link in PR body
2	Dossier merged at `documentation/reviews/2026-05-06-cvn-n015-ea-s02-requirements-plan.md`	this file
3	Every requirement (F1-F8 + NF1-NF11) traces back to ≥ 1 use case (U1-U7)	column "Driven by" in §3 + §4
4	Every requirement has a numerical acceptance bar	column "Acceptance signal" / "Bound" in §3 + §4
5	All 3 operator decisions A/B/C documented with rationale	§1 of this dossier
6	Glossary covers fixture / factory / mock / stub / spy with one example each	§6
7	wp#117 OP transition `New → In specification → Specified`	OP audit comment trail
8	F7 (Story-phase test integration) + F8 (test cases + datasets versioning) requirements explicitly trace to S01 §11 strategic invariant + future ADR-0087 / ADR-0088	references in §3 F7 + F8 rows + S01 cross-link
9	U7 (Story workflow gate enforcer) explicitly maps to ADR-81 transitions × test artefacts	§2 U7 row + cross-link to S01 §11.1 canonical matrix
10	Committee `plan_review` + `pr_review` test-verdict scope (S01 §11.4) is captured as a requirement	§3 F7 acceptance signal explicitly references S01 §11.4

8. Out of scope (explicit)¶

Choosing the specific factory library (e.g., factory_boy vs hand-rolled) — that's an S03 architecture decision against the F4 requirement.
Specifying the Testcontainers image versions — S03 picks the versions ; S02 only requires "the 5 services exist as containers".
Setting up the CI pipeline — S04 implementation Story.
Authoring the actual fixtures + factories — S04+ implementation work.
Cross-Epic concerns (data quality, ML behaviour, system-E2E, etc.) — those Epics' own requirements Stories handle their own scope.
Visual / UI testing — out of foundation Epic ; covered by EA-adjacent Story if/when Console deserves screenshot-diff coverage.
Mutation testing (mutmut, cosmic-ray) — interesting but premature ; revisit after the foundation lands.

9. Risks¶

Risk	Likelihood	Impact	Mitigation
Testcontainers cold-start dominates the 10 min integration budget	medium	medium	Pre-pull container images in CI cache (NF6 fresh-clone bound is separate from steady-state CI) ; use `--reuse` flag where safe ; container-per-test-class instead of per-test where state is read-only
Flaky-test detector's 5 % flake rate threshold (F5) is too strict on day 1	medium	low	Bootstrap with 10 % threshold + ratchet down monthly ; per-test allow-list for known-flaky external deps until they're moved to mocks
`time_machine` C-level patching surfaces a subtle bug on macOS spawn-mode workers	low	medium	NF9 macOS nightly job catches it within 24h ; if it surfaces, fall back to `freezegun` for those specific tests with an explicit `xfail_on_spawn` marker until upstream fix
Single-operator means no second pair-of-eyes on the requirements during normal flow	high	low	Committee `plan_review` IS the second pair-of-eyes for this Story ; operator-discipline rule "every requirements Story MUST go through committee" enforced by ADR-68
Scope creep — operator or committee asks to add EB-EI cross-cutting requirements here	medium	medium	§8 out-of-scope is exhaustive ; any cross-cutting requirement gets a dedicated Story under the relevant Epic, NOT a §3 / §4 row addition

10. Sequencing + dependencies¶

S01 (test strategy, wp#116, PR #852) ────┐
                                          ├──► S02 (this Story, wp#117)
                                          │     plan_review committee
                                          │     PR open + CR + merge
                                          │     wp#117 New → In spec → Specified → In progress → Developed → Closed
                                          │
                                          └──► (other Epics EB-EI requirements Stories)
                                                each refines its own subset of S01

S02 PR can land BEFORE S01 PR merges (S02 references S01 strategy via the PR branch path).
S02 closure requires S01 to have merged (the trace requires the strategy doc to exist on main).

S02 ──► S03 (foundation Epic architecture)
       designs against S02 requirements
S03 ──► S04+ (implementation Stories per service / fixture layer)

Single-WIP rule (per ADR-69) : S02 sits in the Specified buffer once committee plan_review PASSES, waiting for capacity in In progress. Currently In progress Stories : wp#103 (S14 Track 1 leakage), wp#46 (S07 Track 2), wp#45 (S06 Track 11). S02 doesn't compete for capacity until one of those closes.

11. References¶

OP wp#117 / GH #837 : this Story's tracking
S01 dossier : documentation/reviews/2026-05-05-cvn-n015-ea-s01-test-strategy-plan.md (lands on PR #852 merge — backtick-quoted path here instead of markdown link to avoid mkdocs strict cross-PR resolution failure)
S01 strategy doc : documentation/strategy/CVN-N015-test-strategy.md (lands on PR #852 merge)
S01 ADR : documentation/adr/0083-test-taxonomy-and-gate-hierarchy.md (lands on PR #852 merge)
Testcontainers integration Story : #757 — F2 requirement is shaped by its design
Existing pytest factories Story : #586 — F4 inherits its scope
Flaky-test detector Story : #756 — F5 is its requirements expression
ADR-58 (FTF guardrails) — drives U4 (FTF launcher pre-sweep)
ADR-68 (committee = default review channel) — drives §7 criterion 1
ADR-77 (MkDocs SSoT) — this dossier is the SSoT for S02 requirements
ADR-81 (8-state Story workflow) — drives §7 criterion 7

12. Plan-review questions for committee¶

Use case U6 (audit reviewer) numerical bar : pytest --collect-only snapshot per release tag is the proposed evidence. Is that sufficient to reconstruct the test surface during a 6-month-after incident review, or do we need a richer artefact (e.g., mapping test → covered code lines via coverage.py per release) ?
NF4 parallelism contract : we mandate -n auto safety as a merge gate. Is this realistic from day 1 (Testcontainers spin-up cost may dominate) or should we phase it in (allow -n 1 initially with a 1-month migration window) ?
NF6 fresh-clone < 5 min : aggressive bound. The current .venv_airflow install is ~3 min wall-clock on a fast laptop ; tight margin. Acceptable as a stretch goal with an alert at 4 min, or relax to < 7 min ?
Glossary §6 mock vs spy : the line is famously fuzzy. Is the "use mock for incoming, spy for outgoing-while-keeping-real-behaviour" rule clear enough for the next contributor, or do we want a short decision tree (if you need to verify a call AND let the real impl run → spy ; otherwise mock) ?
Scope of "service virtualisation" F2 : we list 5 services (Postgres, Redis, MinIO, MLflow, Airflow). Are we missing any (e.g., the Console Streamlit app, Loki, Grafana), or does the test stack consciously stop at the 5 ?
Decision C (time_machine) lock-in risk : if the lib goes unmaintained in 18 months, the migration cost back to freezegun is a few hours per test. Is that risk profile acceptable vs picking the more entrenched freezegun from day 1 ?