CVN-N015-EA-S02 — Requirements expression for the test stack foundation Epic¶
Date : 2026-05-06
Story : CVN-N015-EA-S02 (OP wp#117)
GH issue : #837
Parent Epic : CVN-N015-EA — Test stack foundation (OP wp#107 / GH #827)
Depends on : CVN-N015-EA-S01 (test strategy — OP wp#116 / GH #836 — currently in PR #852)
Operator decisions locked : 2026-05-06 — A=ok (initially 6 use cases U1-U6 ; amended same-day to 7 use cases by adding U7 = Story workflow gate enforcer per the strategic invariant in S01 §11 — see §2 amendment note), B=ok (CI budgets p95 ≤ 2 min fast / ≤ 10 min integration / ≤ 30 min nightly), C=time_machine
Status : proposed (committee plan_review pending)
0. Intent + scope¶
S02 refines the strategic taxonomy from S01 into the concrete functional + non-functional requirements of the foundation Epic (CVN-N015-EA only — pytest, Testcontainers, factories, flaky-test detector). The other Epics (EB-EI) refine S01 into their own subsets via their own requirements Stories.
S02 is the contract that S03 (architecture) builds against — without it, S03 risks designing against an under-specified target and re-deriving requirements at architecture time.
Out of scope (covered by other Stories) : the test-type catalogue itself (S01), the specific service contracts (S03), the implementation choices like fixture file layout (S04+), the data quality / ML behaviour / system-E2E / contract specifics (Epics EB-EI requirements Stories).
1. Operator decisions — locked 2026-05-06¶
| # | Decision | Locked value | Rationale |
|---|---|---|---|
| A | Stakeholders + use cases scope | Initially 6 kept ; amended same-day to 7 : developer pre-commit (U1), PR reviewer (U2), oncall reading test failure (U3), FTF launcher pre-sweep (U4), MLflow promotion gate (U5), audit reviewer post-incident (U6), U7 added = Story workflow gate enforcer (per S01 §11 strategic invariant — ties U1-U6 together with enforcement at every ADR-81 transition) | Each maps to a distinct lifecycle stage where test feedback drives a decision ; trimming any leaves a stage without a usability bar. Operator confirmed the original 6, then amended to 7 same-day after the S01 §11 Story-phase × test-artefacts integration matrix surfaced the missing enforcement use case. |
| B | CI wall-clock budgets | fast tier p95 ≤ 2 min ; integration tier p95 ≤ 10 min ; nightly tier p95 ≤ 30 min | Aligned with strategy doc §5 performance budgets. 2 min fast is the "PR feedback loop stays interactive" bar (per UX research on attention switching). 10 min integration covers Testcontainers cold-start + the 5 services contracted in F2 (Postgres, Redis, MinIO, MLflow, Airflow scheduler+webserver) ; if downstream Epics EB-EI surface a need for a 6th service the budget gets re-evaluated via amendment Story, not pre-allocated here. 30 min nightly is the budget that lets us absorb data-quality + ML-behaviour suites without crossing the next-morning standup. |
| C | Time-freezing library | time_machine |
C-level ctypes patching covers time.time() and pandas Timestamp.now() consistently (freezegun has known divergences here that have already bitten the FTF tests in the legacy suite). 5-10× faster than freezegun on setup/teardown — material at our CI budget B. Drop-in syntax compat with freezegun makes future migration cheap if the call ever flips. Roll-our-own ruled out (no exotic clock semantics in CVNTrade ; would be pure NIH). |
2. Stakeholders + 7 canonical use cases (U1-U7)¶
Each use case is one person × one moment where test feedback determines a decision. The Story's requirements MUST trace back to ≥ 1 use case + a numerical bar.
| # | Stakeholder | Moment | What they need from the suite | Numerical bar |
|---|---|---|---|---|
| U1 | Developer | pre-commit (make test-fast) |
"Did my last 30 min of work break the contracts I touched ?" | p95 wall-clock ≤ 30 s on developer laptop ; output ≤ 1 page when green ; failure points to ≤ 3 candidate files |
| U2 | PR reviewer | reading the PR's CI summary | "Did the code change pass the gates the strategy says it must ?" | every gate listed in strategy §4 maps 1:1 to a CI check name with a clear PASSED/FAILED ; reviewer can decide on the gate without opening the run log |
| U3 | Oncall | parsing a CI failure on main post-merge |
"What broke + what runbook applies ?" | failure log includes (a) the failing assertion (b) the test docstring (c) the runbook link if the test belongs to a runbook'd subsystem ; resolution path under 10 min from failure email to acknowledged Slack |
| U4 | FTF launcher | pre-sweep validation | "Will this 4-hour sweep crash on hour 1 because of a guardrail miss / config drift ?" | guardrail unit tests (per ADR-58) cover every variant in the matrix ; integration smoke against Testcontainers' ftf_config PG image runs in ≤ 60 s ; pre-sweep gate fails fast before submitting to Airflow |
| U5 | MLflow promotion gate | post-FTF "should we promote this model ?" | "Does this candidate model pass the contract + ML-behaviour tests on the held-out window ?" | contract + ML-behaviour suites runnable against any MLflow run id in ≤ 10 min ; output is a single PASSED/FAILED + per-test detail in MLflow run tags |
| U6 | Audit reviewer | post-incident root-cause | "Was this code path tested at the time of the incident, and what did the test assert ?" | every test has a stable docstring + git-blame trail ; pytest --collect-only --quiet is committed as snapshot per release tag so audit can reconstruct the test surface at any point in time |
| U7 | Story workflow gate enforcer | every ADR-81 transition (In specification → Specified ; In progress → Developed ; Developed → In testing ; Tested → Closed) |
"Are the test artefacts required to enter this Story phase present + verified ?" | per-Story documentation/stories/<cvn_id>/tests/ folder populated with the required artefact at each gate (strategy.md → test code → test_run_plan_review + pr_review issue an explicit verdict on the test artefacts (S01 §11.4) |
Why all 7 (extends operator decision A — added 2026-05-06 per the strategic invariant in S01 §11) : U7 is the use case that ties the rest together — without it, U1-U6 are just developer-side ergonomics with no enforcement. The OP-Story-phase × test-artefacts integration matrix (S01 §11.1) is the canonical reference U7 enforces against. Trimming U7 = "tests are nice to have", which is exactly the discipline the Epic was created to break.
3. Functional needs — what the foundation MUST support¶
| # | Need | Driven by | Acceptance signal |
|---|---|---|---|
| F1 | Test-type coverage : unit, property, contract, cache, integration, DAG smoke (the 6 from strategy §2 layer = base + middle that the foundation Epic owns) |
strategy §2 + §3 cadence matrix | each type has a pytest -m <type> selector that returns a non-empty set ; CI matrix has a job per type |
| F2 | Service virtualisation : Postgres, Redis, MinIO, MLflow, Airflow scheduler+webserver — runnable as Testcontainers ephemeral containers | U4 (FTF pre-sweep) + U5 (promotion gate) | pytest tests/integration/services/ spins up + tears down the 5 services in CI under the integration budget B |
| F3 | Data shape coverage : OHLCV (15min × 5 cryptos × 30 day window), model artefacts (xgb/lgb/cb pickles + MLflow metadata), FTF results rows (finetune_runs + finetune_results PG schema), signals (BUY/SELL/HOLD + confidence + filter trace) |
F1 + F2 ; every data shape must have a factory | every shape has a factories.py builder under tests/factories/ returning realistic-but-deterministic data ; factories accept seed for reproducibility |
| F4 | Fixture / factory hierarchy : project-level conftest.py for cross-cutting fixtures (PG container, MLflow tracking server) ; per-package conftest.py for narrow fixtures (single OHLCV window) ; factories produce data, fixtures wire it into the system under test |
U1 + U2 (PR reviewer reads the test) | fixtures discoverable via pytest --fixtures with one-line description ; no fixture > 50 LoC ; no fixture has > 3 parameters |
| F5 | Flaky-test detector : retry layer + flake-rate dashboard ; CI flags any test that flips PASSED/FAILED across the same SHA on rerun | U2 (PR reviewer) + U3 (oncall) | pytest --flaky-detect (or pytest-rerunfailures config) re-runs FAILED tests once ; CI exposes flake rate per test on a Grafana panel ; tests with flake rate > 5 % over the last 7 days get an automatic GH issue |
| F6 | Time-freezing primitive : time_machine integration (per decision C) ; pytest fixture frozen_time(when) ; works with pd.Timestamp.now(), datetime.now(), time.time(), asyncio.get_event_loop().time() |
U4 (FTF determinism) + U5 (model artefacts time-deterministic) | frozen_time fixture in project conftest.py ; assertions on pd.Timestamp.now() + datetime.now() agree to the second when frozen ; tests/README.md (shipped with S04 plugin) documents the tz-aware pattern with ≥ 1 example each for naive, UTC-aware, and local-tz-aware usage (e.g., pd.Timestamp('2024-01-01', tz='UTC') vs naive) including explicit normalisation guidance ; committee pr_review on S04 PR MUST verify the examples + that tests demonstrating each tz mode exist |
| F7 | Story-phase test integration (per S01 §11) : every ADR-81 transition has a test-artefact gate. @pytest.mark.story("<cvn_id>") discoverable via the S04 pytest plugin's --story CLI flag (per S03 architecture — pytest plugin design specified there) ; per-Story documentation/stories/<cvn_id>/tests/ folder created at In specification and populated through transitions ; CI guardrail G5 (extends G1-G4) blocks a transition attempt without its artefacts ; committee plan_review + pr_review issue an explicit tests: sub-verdict (S01 §11.4) |
U7 + S01 §11 strategic invariant + ADR-81 + ADR-68 | pytest --collect-only --story <cvn_id> returns ≥ 1 test per acceptance criterion ; G5 CI job FAILS if documentation/stories/<cvn_id>/tests/manifest.yaml is absent at Tested → Closed ; committee pr_review verdict body includes the 4 sub-verdicts from S01 §11.4 (any INSUFFICIENT blocks merge) |
| F8 | Test cases + datasets versioned + provenance-tracked (per S01 §11.3 vocabulary) : tests/cases/<cvn_id>/ for data-driven scenarios (YAML or pytest.parametrize) ; tests/datasets/<cvn_id>/ for content-addressed test data (DVC for > 10 MB, in-repo for ≤ 10 MB) ; every (test_sha, dataset_sha) pair pins to a deterministic outcome |
U6 + S01 §11 strategic invariant + ADR-23 (version-pinned features) | every test that uses a dataset cites it via a fixture that reads from tests/datasets/<cvn_id>/<name>.parquet (or DVC pointer) ; CI fails if a test-data file is read from outside tests/datasets/ (catches operator pulling from data/ accidentally) ; the per-Story manifest.yaml lists (test_sha, dataset_sha, run_id) for each acceptance criterion |
4. Non-functional needs — quantitative budgets¶
| # | Need | Bound | Driven by | Owner of the metric |
|---|---|---|---|---|
| NF1 | Fast tier wall-clock | p95 ≤ 2 min in CI ; p95 ≤ 30 s on developer laptop | decision B + U1 | CI Grafana panel cvntrade-test-tier-latency (TBD by S03) |
| NF2 | Integration tier wall-clock | p95 ≤ 10 min in CI | decision B + U4/U5 | same panel |
| NF3 | Nightly tier wall-clock | p95 ≤ 30 min | decision B | same panel |
| NF4 | Parallelism contract | Each test MUST be safe under pytest -n auto (no shared mutable state across tests). Postgres / Redis / MLflow containers MUST be either shared with read-only assertions OR per-worker spawned. |
F4 + decision B (parallelism is the only way to hit the wall-clock budgets) | enforced by an xdist-mode CI job that runs the full suite under -n auto ; failure breaks merge |
| NF5 | Determinism | Every test that involves a model train, a np.random call, a UUID, or a datetime.now() MUST seed explicitly. Re-running the same test under the same SHA produces the same artefact bit-for-bit. |
U6 (audit) + U4 (FTF reproducibility) | pytest-randomly configured + a seed_all fixture in project conftest.py that sets random.seed, np.random.seed, os.environ['PYTHONHASHSEED'], and torch / xgboost seeds in one call |
| NF6 | Reproducibility from clean clone | Any developer must be able to run git clone … && cd champollion && make test-fast in < 5 min total (clone + .venv_airflow create + dep install + first test run) |
U1 + onboarding new contributor | timed in CI's "fresh-clone" smoke job ; failure breaks merge |
| NF7 | Open-source only | No paid SaaS in the test stack. (Testcontainers is OSS ; pytest is OSS ; time_machine is OSS ; everything must stay this way.) | constraint from issue body + cost discipline | dependency licence check in CI (pip-licenses --fail-on=Commercial) |
| NF8 | Python 3.12 fixed | All test infra targets Python 3.12. No conditional logic for 3.11 or 3.13. | constraint from issue body | CI matrix has Python 3.12 only ; pytest skips that branch on sys.version_info are forbidden by lint rule |
| NF9 | macOS spawn-mode aware | Tests touching multiprocessing MUST work under macOS spawn mode (workers re-setup sys.path + dotenv per CLAUDE.md memory). |
constraint from issue body + CLAUDE.md memory | CI matrix runs the multiprocessing-touching tests on a macOS runner once per nightly cycle ; if the macOS run diverges from Linux, the test is flagged as macOS-incompatible until fixed |
| NF10 | Reproducibility from (git_sha, dataset_sha, deps_lock_sha) |
Any historic test run MUST be reproducible within ε-tolerance by checking out the recorded git_sha + materialising the recorded dataset_sha (DVC pull or in-repo file at that SHA) + restoring the recorded deps_lock_sha (S03 picks the lock-file format — likely requirements.lock or uv.lock). Tolerance bands are explicit in the manifest schema : exact for categoricals (PASSED/FAILED, predicted class), ε ≤ 1e-6 for float metrics (f1_buy, sortino, return %), ε ≤ 0.1 % for cumulative wealth simulations. "Bit-for-bit" is not promised — it would require pinning Linux kernel versions + glibc + BLAS implementations, which is out of scope for v1. Strict dependency pinning (the lock-file SHA in the manifest) absorbs the dominant non-determinism source. |
F8 + NF5 + U6 (audit) | for any closed Story, make reproduce CVN_ID=<cvn_id> checks out the manifest's recorded SHAs, materialises the datasets, restores the deps lock, runs pytest --story <cvn_id> (per S03 plugin), and asserts the recorded outcomes match within their declared ε bands. CI runs this on a random closed Story per nightly cycle (canary) ; failure auto-files a GH issue tagged audit-regression. |
| NF11 | Test artefacts immutability after Closed |
Once a Story transitions to Closed, its documentation/stories/<cvn_id>/tests/manifest.yaml is immutable. Any future commit modifying a Closed Story's manifest is rejected by a CI guardrail (G6). |
U6 (audit) + U7 (Story workflow gate enforcer) + ADR-81 closure semantics | git pre-commit hook OR CI gate (G6) compares the staged file against the manifest_pinned_at_close SHA (recorded in OP wp at closure time) ; mismatch fails the commit/PR. Amendments require an explicit amendment Story per S01's invariant (cannot be done in-place). |
5. Constraints — explicit boundary conditions¶
- Open-source only (NF7) — no DataDog test agent, no Sentry test transport, no commercial pytest plugins. Self-hosted observability via Grafana + Loki + Prometheus.
- Python 3.12 fixed (NF8) — chosen because 3.12 is the production Airflow / Console runtime (memory :
.venv_airflow). - macOS spawn-mode aware (NF9) — operator's primary dev box is macOS ; CI Linux runners alone don't catch spawn-mode bugs.
- Single-operator reality — no cross-team test-ownership negotiation (per strategy §7).
- Existing Testcontainers Story #757 supersedes any conflicting design here — when S03 architecture lands, it consumes #757's container conventions.
6. Glossary — fixture / factory / mock / stub / spy¶
Avoids the pytest ecosystem's frequent debates by pinning canonical definitions for this Epic.
| Term | Canonical definition (this Epic) | Example |
|---|---|---|
| fixture | A pytest construct that produces an object + handles its lifecycle (setup / teardown). Discoverable via pytest --fixtures. SHOULD wire factory output into the system under test ; SHOULD NOT contain non-trivial business logic. |
pg_engine fixture spinning up a Testcontainers Postgres + yielding a SQLAlchemy engine + tearing down the container |
| factory | A pure function (or factory_boy-style class) that produces a deterministic data instance. Accepts seed for reproducibility. Lives under tests/factories/. |
make_ohlcv_window(seed=0, n_bars=300, freq='15min', crypto='BTCUSDC') returning a pd.DataFrame |
| mock | A test double that records calls AND can be configured to return preset values. Use for incoming dependencies (the SUT calls something we want to control). Use sparingly — prefer real implementations + factories for in-process collaborators. | mock_mlflow_client.log_metric.assert_called_with('f1_buy', 0.42) |
| stub | A simpler test double : returns a preset value, does NOT record calls. Use when you need a dependency to "say yes" without verifying the interaction. | stub_kill_switch.is_engaged = lambda: False |
| spy | Wraps a real implementation, records calls, lets the real call through. Use when you want to assert "X was called" without changing the SUT's behaviour. | spy = mocker.spy(filter_chain, 'execute') ; …; assert spy.call_count == 1 |
The "system vs integration" debate is closed by strategy §10 ; not re-opened here. S02 inherits those definitions.
7. Acceptance criteria¶
| # | Criterion | Evidence |
|---|---|---|
| 1 | Committee plan_review PASSED (or operator waiver in PR body) |
session JSON link in PR body |
| 2 | Dossier merged at documentation/reviews/2026-05-06-cvn-n015-ea-s02-requirements-plan.md |
this file |
| 3 | Every requirement (F1-F8 + NF1-NF11) traces back to ≥ 1 use case (U1-U7) | column "Driven by" in §3 + §4 |
| 4 | Every requirement has a numerical acceptance bar | column "Acceptance signal" / "Bound" in §3 + §4 |
| 5 | All 3 operator decisions A/B/C documented with rationale | §1 of this dossier |
| 6 | Glossary covers fixture / factory / mock / stub / spy with one example each | §6 |
| 7 | wp#117 OP transition New → In specification → Specified |
OP audit comment trail |
| 8 | F7 (Story-phase test integration) + F8 (test cases + datasets versioning) requirements explicitly trace to S01 §11 strategic invariant + future ADR-0087 / ADR-0088 | references in §3 F7 + F8 rows + S01 cross-link |
| 9 | U7 (Story workflow gate enforcer) explicitly maps to ADR-81 transitions × test artefacts | §2 U7 row + cross-link to S01 §11.1 canonical matrix |
| 10 | Committee plan_review + pr_review test-verdict scope (S01 §11.4) is captured as a requirement |
§3 F7 acceptance signal explicitly references S01 §11.4 |
8. Out of scope (explicit)¶
- Choosing the specific factory library (e.g.,
factory_boyvs hand-rolled) — that's an S03 architecture decision against the F4 requirement. - Specifying the Testcontainers image versions — S03 picks the versions ; S02 only requires "the 5 services exist as containers".
- Setting up the CI pipeline — S04 implementation Story.
- Authoring the actual fixtures + factories — S04+ implementation work.
- Cross-Epic concerns (data quality, ML behaviour, system-E2E, etc.) — those Epics' own requirements Stories handle their own scope.
- Visual / UI testing — out of foundation Epic ; covered by EA-adjacent Story if/when Console deserves screenshot-diff coverage.
- Mutation testing (
mutmut,cosmic-ray) — interesting but premature ; revisit after the foundation lands.
9. Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Testcontainers cold-start dominates the 10 min integration budget | medium | medium | Pre-pull container images in CI cache (NF6 fresh-clone bound is separate from steady-state CI) ; use --reuse flag where safe ; container-per-test-class instead of per-test where state is read-only |
| Flaky-test detector's 5 % flake rate threshold (F5) is too strict on day 1 | medium | low | Bootstrap with 10 % threshold + ratchet down monthly ; per-test allow-list for known-flaky external deps until they're moved to mocks |
time_machine C-level patching surfaces a subtle bug on macOS spawn-mode workers |
low | medium | NF9 macOS nightly job catches it within 24h ; if it surfaces, fall back to freezegun for those specific tests with an explicit xfail_on_spawn marker until upstream fix |
| Single-operator means no second pair-of-eyes on the requirements during normal flow | high | low | Committee plan_review IS the second pair-of-eyes for this Story ; operator-discipline rule "every requirements Story MUST go through committee" enforced by ADR-68 |
| Scope creep — operator or committee asks to add EB-EI cross-cutting requirements here | medium | medium | §8 out-of-scope is exhaustive ; any cross-cutting requirement gets a dedicated Story under the relevant Epic, NOT a §3 / §4 row addition |
10. Sequencing + dependencies¶
S01 (test strategy, wp#116, PR #852) ────┐
├──► S02 (this Story, wp#117)
│ plan_review committee
│ PR open + CR + merge
│ wp#117 New → In spec → Specified → In progress → Developed → Closed
│
└──► (other Epics EB-EI requirements Stories)
each refines its own subset of S01
S02 PR can land BEFORE S01 PR merges (S02 references S01 strategy via the PR branch path).
S02 closure requires S01 to have merged (the trace requires the strategy doc to exist on main).
S02 ──► S03 (foundation Epic architecture)
designs against S02 requirements
S03 ──► S04+ (implementation Stories per service / fixture layer)
Single-WIP rule (per ADR-69) : S02 sits in the Specified buffer once committee plan_review PASSES, waiting for capacity in In progress. Currently In progress Stories : wp#103 (S14 Track 1 leakage), wp#46 (S07 Track 2), wp#45 (S06 Track 11). S02 doesn't compete for capacity until one of those closes.
11. References¶
- OP wp#117 / GH #837 : this Story's tracking
- S01 dossier :
documentation/reviews/2026-05-05-cvn-n015-ea-s01-test-strategy-plan.md(lands on PR #852 merge — backtick-quoted path here instead of markdown link to avoid mkdocs strict cross-PR resolution failure) - S01 strategy doc :
documentation/strategy/CVN-N015-test-strategy.md(lands on PR #852 merge) - S01 ADR :
documentation/adr/0083-test-taxonomy-and-gate-hierarchy.md(lands on PR #852 merge) - Testcontainers integration Story : #757 — F2 requirement is shaped by its design
- Existing pytest factories Story : #586 — F4 inherits its scope
- Flaky-test detector Story : #756 — F5 is its requirements expression
- ADR-58 (FTF guardrails) — drives U4 (FTF launcher pre-sweep)
- ADR-68 (committee = default review channel) — drives §7 criterion 1
- ADR-77 (MkDocs SSoT) — this dossier is the SSoT for S02 requirements
- ADR-81 (8-state Story workflow) — drives §7 criterion 7
12. Plan-review questions for committee¶
- Use case U6 (audit reviewer) numerical bar :
pytest --collect-onlysnapshot per release tag is the proposed evidence. Is that sufficient to reconstruct the test surface during a 6-month-after incident review, or do we need a richer artefact (e.g., mapping test → covered code lines viacoverage.pyper release) ? - NF4 parallelism contract : we mandate
-n autosafety as a merge gate. Is this realistic from day 1 (Testcontainers spin-up cost may dominate) or should we phase it in (allow-n 1initially with a 1-month migration window) ? - NF6 fresh-clone < 5 min : aggressive bound. The current
.venv_airflowinstall is ~3 min wall-clock on a fast laptop ; tight margin. Acceptable as a stretch goal with an alert at 4 min, or relax to < 7 min ? - Glossary §6 mock vs spy : the line is famously fuzzy. Is the "use mock for incoming, spy for outgoing-while-keeping-real-behaviour" rule clear enough for the next contributor, or do we want a short decision tree (
if you need to verify a call AND let the real impl run → spy ; otherwise mock) ? - Scope of "service virtualisation" F2 : we list 5 services (Postgres, Redis, MinIO, MLflow, Airflow). Are we missing any (e.g., the Console Streamlit app, Loki, Grafana), or does the test stack consciously stop at the 5 ?
- Decision C (
time_machine) lock-in risk : if the lib goes unmaintained in 18 months, the migration cost back tofreezegunis a few hours per test. Is that risk profile acceptable vs picking the more entrenchedfreezegunfrom day 1 ?