Skip to content

CVN-N015-EA-S02 — Requirements expression for the test stack foundation Epic

Date : 2026-05-06 Story : CVN-N015-EA-S02 (OP wp#117) GH issue : #837 Parent Epic : CVN-N015-EA — Test stack foundation (OP wp#107 / GH #827) Depends on : CVN-N015-EA-S01 (test strategy — OP wp#116 / GH #836 — currently in PR #852) Operator decisions locked : 2026-05-06 — A=ok (initially 6 use cases U1-U6 ; amended same-day to 7 use cases by adding U7 = Story workflow gate enforcer per the strategic invariant in S01 §11 — see §2 amendment note), B=ok (CI budgets p95 ≤ 2 min fast / ≤ 10 min integration / ≤ 30 min nightly), C=time_machine Status : proposed (committee plan_review pending)


0. Intent + scope

S02 refines the strategic taxonomy from S01 into the concrete functional + non-functional requirements of the foundation Epic (CVN-N015-EA only — pytest, Testcontainers, factories, flaky-test detector). The other Epics (EB-EI) refine S01 into their own subsets via their own requirements Stories.

S02 is the contract that S03 (architecture) builds against — without it, S03 risks designing against an under-specified target and re-deriving requirements at architecture time.

Out of scope (covered by other Stories) : the test-type catalogue itself (S01), the specific service contracts (S03), the implementation choices like fixture file layout (S04+), the data quality / ML behaviour / system-E2E / contract specifics (Epics EB-EI requirements Stories).


1. Operator decisions — locked 2026-05-06

# Decision Locked value Rationale
A Stakeholders + use cases scope Initially 6 kept ; amended same-day to 7 : developer pre-commit (U1), PR reviewer (U2), oncall reading test failure (U3), FTF launcher pre-sweep (U4), MLflow promotion gate (U5), audit reviewer post-incident (U6), U7 added = Story workflow gate enforcer (per S01 §11 strategic invariant — ties U1-U6 together with enforcement at every ADR-81 transition) Each maps to a distinct lifecycle stage where test feedback drives a decision ; trimming any leaves a stage without a usability bar. Operator confirmed the original 6, then amended to 7 same-day after the S01 §11 Story-phase × test-artefacts integration matrix surfaced the missing enforcement use case.
B CI wall-clock budgets fast tier p95 ≤ 2 min ; integration tier p95 ≤ 10 min ; nightly tier p95 ≤ 30 min Aligned with strategy doc §5 performance budgets. 2 min fast is the "PR feedback loop stays interactive" bar (per UX research on attention switching). 10 min integration covers Testcontainers cold-start + the 5 services contracted in F2 (Postgres, Redis, MinIO, MLflow, Airflow scheduler+webserver) ; if downstream Epics EB-EI surface a need for a 6th service the budget gets re-evaluated via amendment Story, not pre-allocated here. 30 min nightly is the budget that lets us absorb data-quality + ML-behaviour suites without crossing the next-morning standup.
C Time-freezing library time_machine C-level ctypes patching covers time.time() and pandas Timestamp.now() consistently (freezegun has known divergences here that have already bitten the FTF tests in the legacy suite). 5-10× faster than freezegun on setup/teardown — material at our CI budget B. Drop-in syntax compat with freezegun makes future migration cheap if the call ever flips. Roll-our-own ruled out (no exotic clock semantics in CVNTrade ; would be pure NIH).

2. Stakeholders + 7 canonical use cases (U1-U7)

Each use case is one person × one moment where test feedback determines a decision. The Story's requirements MUST trace back to ≥ 1 use case + a numerical bar.

# Stakeholder Moment What they need from the suite Numerical bar
U1 Developer pre-commit (make test-fast) "Did my last 30 min of work break the contracts I touched ?" p95 wall-clock ≤ 30 s on developer laptop ; output ≤ 1 page when green ; failure points to ≤ 3 candidate files
U2 PR reviewer reading the PR's CI summary "Did the code change pass the gates the strategy says it must ?" every gate listed in strategy §4 maps 1:1 to a CI check name with a clear PASSED/FAILED ; reviewer can decide on the gate without opening the run log
U3 Oncall parsing a CI failure on main post-merge "What broke + what runbook applies ?" failure log includes (a) the failing assertion (b) the test docstring (c) the runbook link if the test belongs to a runbook'd subsystem ; resolution path under 10 min from failure email to acknowledged Slack
U4 FTF launcher pre-sweep validation "Will this 4-hour sweep crash on hour 1 because of a guardrail miss / config drift ?" guardrail unit tests (per ADR-58) cover every variant in the matrix ; integration smoke against Testcontainers' ftf_config PG image runs in ≤ 60 s ; pre-sweep gate fails fast before submitting to Airflow
U5 MLflow promotion gate post-FTF "should we promote this model ?" "Does this candidate model pass the contract + ML-behaviour tests on the held-out window ?" contract + ML-behaviour suites runnable against any MLflow run id in ≤ 10 min ; output is a single PASSED/FAILED + per-test detail in MLflow run tags
U6 Audit reviewer post-incident root-cause "Was this code path tested at the time of the incident, and what did the test assert ?" every test has a stable docstring + git-blame trail ; pytest --collect-only --quiet is committed as snapshot per release tag so audit can reconstruct the test surface at any point in time
U7 Story workflow gate enforcer every ADR-81 transition (In specification → Specified ; In progress → Developed ; Developed → In testing ; Tested → Closed) "Are the test artefacts required to enter this Story phase present + verified ?" per-Story documentation/stories/<cvn_id>/tests/ folder populated with the required artefact at each gate (strategy.md → test code → test_run_.md → manifest.yaml) ; CI guardrail G5 fails the PR if a transition is attempted without its artefacts ; committee plan_review + pr_review issue an explicit verdict on the test artefacts (S01 §11.4)

Why all 7 (extends operator decision A — added 2026-05-06 per the strategic invariant in S01 §11) : U7 is the use case that ties the rest together — without it, U1-U6 are just developer-side ergonomics with no enforcement. The OP-Story-phase × test-artefacts integration matrix (S01 §11.1) is the canonical reference U7 enforces against. Trimming U7 = "tests are nice to have", which is exactly the discipline the Epic was created to break.


3. Functional needs — what the foundation MUST support

# Need Driven by Acceptance signal
F1 Test-type coverage : unit, property, contract, cache, integration, DAG smoke (the 6 from strategy §2 layer = base + middle that the foundation Epic owns) strategy §2 + §3 cadence matrix each type has a pytest -m <type> selector that returns a non-empty set ; CI matrix has a job per type
F2 Service virtualisation : Postgres, Redis, MinIO, MLflow, Airflow scheduler+webserver — runnable as Testcontainers ephemeral containers U4 (FTF pre-sweep) + U5 (promotion gate) pytest tests/integration/services/ spins up + tears down the 5 services in CI under the integration budget B
F3 Data shape coverage : OHLCV (15min × 5 cryptos × 30 day window), model artefacts (xgb/lgb/cb pickles + MLflow metadata), FTF results rows (finetune_runs + finetune_results PG schema), signals (BUY/SELL/HOLD + confidence + filter trace) F1 + F2 ; every data shape must have a factory every shape has a factories.py builder under tests/factories/ returning realistic-but-deterministic data ; factories accept seed for reproducibility
F4 Fixture / factory hierarchy : project-level conftest.py for cross-cutting fixtures (PG container, MLflow tracking server) ; per-package conftest.py for narrow fixtures (single OHLCV window) ; factories produce data, fixtures wire it into the system under test U1 + U2 (PR reviewer reads the test) fixtures discoverable via pytest --fixtures with one-line description ; no fixture > 50 LoC ; no fixture has > 3 parameters
F5 Flaky-test detector : retry layer + flake-rate dashboard ; CI flags any test that flips PASSED/FAILED across the same SHA on rerun U2 (PR reviewer) + U3 (oncall) pytest --flaky-detect (or pytest-rerunfailures config) re-runs FAILED tests once ; CI exposes flake rate per test on a Grafana panel ; tests with flake rate > 5 % over the last 7 days get an automatic GH issue
F6 Time-freezing primitive : time_machine integration (per decision C) ; pytest fixture frozen_time(when) ; works with pd.Timestamp.now(), datetime.now(), time.time(), asyncio.get_event_loop().time() U4 (FTF determinism) + U5 (model artefacts time-deterministic) frozen_time fixture in project conftest.py ; assertions on pd.Timestamp.now() + datetime.now() agree to the second when frozen ; tests/README.md (shipped with S04 plugin) documents the tz-aware pattern with ≥ 1 example each for naive, UTC-aware, and local-tz-aware usage (e.g., pd.Timestamp('2024-01-01', tz='UTC') vs naive) including explicit normalisation guidance ; committee pr_review on S04 PR MUST verify the examples + that tests demonstrating each tz mode exist
F7 Story-phase test integration (per S01 §11) : every ADR-81 transition has a test-artefact gate. @pytest.mark.story("<cvn_id>") discoverable via the S04 pytest plugin's --story CLI flag (per S03 architecture — pytest plugin design specified there) ; per-Story documentation/stories/<cvn_id>/tests/ folder created at In specification and populated through transitions ; CI guardrail G5 (extends G1-G4) blocks a transition attempt without its artefacts ; committee plan_review + pr_review issue an explicit tests: sub-verdict (S01 §11.4) U7 + S01 §11 strategic invariant + ADR-81 + ADR-68 pytest --collect-only --story <cvn_id> returns ≥ 1 test per acceptance criterion ; G5 CI job FAILS if documentation/stories/<cvn_id>/tests/manifest.yaml is absent at Tested → Closed ; committee pr_review verdict body includes the 4 sub-verdicts from S01 §11.4 (any INSUFFICIENT blocks merge)
F8 Test cases + datasets versioned + provenance-tracked (per S01 §11.3 vocabulary) : tests/cases/<cvn_id>/ for data-driven scenarios (YAML or pytest.parametrize) ; tests/datasets/<cvn_id>/ for content-addressed test data (DVC for > 10 MB, in-repo for ≤ 10 MB) ; every (test_sha, dataset_sha) pair pins to a deterministic outcome U6 + S01 §11 strategic invariant + ADR-23 (version-pinned features) every test that uses a dataset cites it via a fixture that reads from tests/datasets/<cvn_id>/<name>.parquet (or DVC pointer) ; CI fails if a test-data file is read from outside tests/datasets/ (catches operator pulling from data/ accidentally) ; the per-Story manifest.yaml lists (test_sha, dataset_sha, run_id) for each acceptance criterion

4. Non-functional needs — quantitative budgets

# Need Bound Driven by Owner of the metric
NF1 Fast tier wall-clock p95 ≤ 2 min in CI ; p95 ≤ 30 s on developer laptop decision B + U1 CI Grafana panel cvntrade-test-tier-latency (TBD by S03)
NF2 Integration tier wall-clock p95 ≤ 10 min in CI decision B + U4/U5 same panel
NF3 Nightly tier wall-clock p95 ≤ 30 min decision B same panel
NF4 Parallelism contract Each test MUST be safe under pytest -n auto (no shared mutable state across tests). Postgres / Redis / MLflow containers MUST be either shared with read-only assertions OR per-worker spawned. F4 + decision B (parallelism is the only way to hit the wall-clock budgets) enforced by an xdist-mode CI job that runs the full suite under -n auto ; failure breaks merge
NF5 Determinism Every test that involves a model train, a np.random call, a UUID, or a datetime.now() MUST seed explicitly. Re-running the same test under the same SHA produces the same artefact bit-for-bit. U6 (audit) + U4 (FTF reproducibility) pytest-randomly configured + a seed_all fixture in project conftest.py that sets random.seed, np.random.seed, os.environ['PYTHONHASHSEED'], and torch / xgboost seeds in one call
NF6 Reproducibility from clean clone Any developer must be able to run git clone … && cd champollion && make test-fast in < 5 min total (clone + .venv_airflow create + dep install + first test run) U1 + onboarding new contributor timed in CI's "fresh-clone" smoke job ; failure breaks merge
NF7 Open-source only No paid SaaS in the test stack. (Testcontainers is OSS ; pytest is OSS ; time_machine is OSS ; everything must stay this way.) constraint from issue body + cost discipline dependency licence check in CI (pip-licenses --fail-on=Commercial)
NF8 Python 3.12 fixed All test infra targets Python 3.12. No conditional logic for 3.11 or 3.13. constraint from issue body CI matrix has Python 3.12 only ; pytest skips that branch on sys.version_info are forbidden by lint rule
NF9 macOS spawn-mode aware Tests touching multiprocessing MUST work under macOS spawn mode (workers re-setup sys.path + dotenv per CLAUDE.md memory). constraint from issue body + CLAUDE.md memory CI matrix runs the multiprocessing-touching tests on a macOS runner once per nightly cycle ; if the macOS run diverges from Linux, the test is flagged as macOS-incompatible until fixed
NF10 Reproducibility from (git_sha, dataset_sha, deps_lock_sha) Any historic test run MUST be reproducible within ε-tolerance by checking out the recorded git_sha + materialising the recorded dataset_sha (DVC pull or in-repo file at that SHA) + restoring the recorded deps_lock_sha (S03 picks the lock-file format — likely requirements.lock or uv.lock). Tolerance bands are explicit in the manifest schema : exact for categoricals (PASSED/FAILED, predicted class), ε ≤ 1e-6 for float metrics (f1_buy, sortino, return %), ε ≤ 0.1 % for cumulative wealth simulations. "Bit-for-bit" is not promised — it would require pinning Linux kernel versions + glibc + BLAS implementations, which is out of scope for v1. Strict dependency pinning (the lock-file SHA in the manifest) absorbs the dominant non-determinism source. F8 + NF5 + U6 (audit) for any closed Story, make reproduce CVN_ID=<cvn_id> checks out the manifest's recorded SHAs, materialises the datasets, restores the deps lock, runs pytest --story <cvn_id> (per S03 plugin), and asserts the recorded outcomes match within their declared ε bands. CI runs this on a random closed Story per nightly cycle (canary) ; failure auto-files a GH issue tagged audit-regression.
NF11 Test artefacts immutability after Closed Once a Story transitions to Closed, its documentation/stories/<cvn_id>/tests/manifest.yaml is immutable. Any future commit modifying a Closed Story's manifest is rejected by a CI guardrail (G6). U6 (audit) + U7 (Story workflow gate enforcer) + ADR-81 closure semantics git pre-commit hook OR CI gate (G6) compares the staged file against the manifest_pinned_at_close SHA (recorded in OP wp at closure time) ; mismatch fails the commit/PR. Amendments require an explicit amendment Story per S01's invariant (cannot be done in-place).

5. Constraints — explicit boundary conditions

  • Open-source only (NF7) — no DataDog test agent, no Sentry test transport, no commercial pytest plugins. Self-hosted observability via Grafana + Loki + Prometheus.
  • Python 3.12 fixed (NF8) — chosen because 3.12 is the production Airflow / Console runtime (memory : .venv_airflow).
  • macOS spawn-mode aware (NF9) — operator's primary dev box is macOS ; CI Linux runners alone don't catch spawn-mode bugs.
  • Single-operator reality — no cross-team test-ownership negotiation (per strategy §7).
  • Existing Testcontainers Story #757 supersedes any conflicting design here — when S03 architecture lands, it consumes #757's container conventions.

6. Glossary — fixture / factory / mock / stub / spy

Avoids the pytest ecosystem's frequent debates by pinning canonical definitions for this Epic.

Term Canonical definition (this Epic) Example
fixture A pytest construct that produces an object + handles its lifecycle (setup / teardown). Discoverable via pytest --fixtures. SHOULD wire factory output into the system under test ; SHOULD NOT contain non-trivial business logic. pg_engine fixture spinning up a Testcontainers Postgres + yielding a SQLAlchemy engine + tearing down the container
factory A pure function (or factory_boy-style class) that produces a deterministic data instance. Accepts seed for reproducibility. Lives under tests/factories/. make_ohlcv_window(seed=0, n_bars=300, freq='15min', crypto='BTCUSDC') returning a pd.DataFrame
mock A test double that records calls AND can be configured to return preset values. Use for incoming dependencies (the SUT calls something we want to control). Use sparingly — prefer real implementations + factories for in-process collaborators. mock_mlflow_client.log_metric.assert_called_with('f1_buy', 0.42)
stub A simpler test double : returns a preset value, does NOT record calls. Use when you need a dependency to "say yes" without verifying the interaction. stub_kill_switch.is_engaged = lambda: False
spy Wraps a real implementation, records calls, lets the real call through. Use when you want to assert "X was called" without changing the SUT's behaviour. spy = mocker.spy(filter_chain, 'execute') ; …; assert spy.call_count == 1

The "system vs integration" debate is closed by strategy §10 ; not re-opened here. S02 inherits those definitions.


7. Acceptance criteria

# Criterion Evidence
1 Committee plan_review PASSED (or operator waiver in PR body) session JSON link in PR body
2 Dossier merged at documentation/reviews/2026-05-06-cvn-n015-ea-s02-requirements-plan.md this file
3 Every requirement (F1-F8 + NF1-NF11) traces back to ≥ 1 use case (U1-U7) column "Driven by" in §3 + §4
4 Every requirement has a numerical acceptance bar column "Acceptance signal" / "Bound" in §3 + §4
5 All 3 operator decisions A/B/C documented with rationale §1 of this dossier
6 Glossary covers fixture / factory / mock / stub / spy with one example each §6
7 wp#117 OP transition New → In specification → Specified OP audit comment trail
8 F7 (Story-phase test integration) + F8 (test cases + datasets versioning) requirements explicitly trace to S01 §11 strategic invariant + future ADR-0087 / ADR-0088 references in §3 F7 + F8 rows + S01 cross-link
9 U7 (Story workflow gate enforcer) explicitly maps to ADR-81 transitions × test artefacts §2 U7 row + cross-link to S01 §11.1 canonical matrix
10 Committee plan_review + pr_review test-verdict scope (S01 §11.4) is captured as a requirement §3 F7 acceptance signal explicitly references S01 §11.4

8. Out of scope (explicit)

  • Choosing the specific factory library (e.g., factory_boy vs hand-rolled) — that's an S03 architecture decision against the F4 requirement.
  • Specifying the Testcontainers image versions — S03 picks the versions ; S02 only requires "the 5 services exist as containers".
  • Setting up the CI pipeline — S04 implementation Story.
  • Authoring the actual fixtures + factories — S04+ implementation work.
  • Cross-Epic concerns (data quality, ML behaviour, system-E2E, etc.) — those Epics' own requirements Stories handle their own scope.
  • Visual / UI testing — out of foundation Epic ; covered by EA-adjacent Story if/when Console deserves screenshot-diff coverage.
  • Mutation testing (mutmut, cosmic-ray) — interesting but premature ; revisit after the foundation lands.

9. Risks

Risk Likelihood Impact Mitigation
Testcontainers cold-start dominates the 10 min integration budget medium medium Pre-pull container images in CI cache (NF6 fresh-clone bound is separate from steady-state CI) ; use --reuse flag where safe ; container-per-test-class instead of per-test where state is read-only
Flaky-test detector's 5 % flake rate threshold (F5) is too strict on day 1 medium low Bootstrap with 10 % threshold + ratchet down monthly ; per-test allow-list for known-flaky external deps until they're moved to mocks
time_machine C-level patching surfaces a subtle bug on macOS spawn-mode workers low medium NF9 macOS nightly job catches it within 24h ; if it surfaces, fall back to freezegun for those specific tests with an explicit xfail_on_spawn marker until upstream fix
Single-operator means no second pair-of-eyes on the requirements during normal flow high low Committee plan_review IS the second pair-of-eyes for this Story ; operator-discipline rule "every requirements Story MUST go through committee" enforced by ADR-68
Scope creep — operator or committee asks to add EB-EI cross-cutting requirements here medium medium §8 out-of-scope is exhaustive ; any cross-cutting requirement gets a dedicated Story under the relevant Epic, NOT a §3 / §4 row addition

10. Sequencing + dependencies

S01 (test strategy, wp#116, PR #852) ────┐
                                          ├──► S02 (this Story, wp#117)
                                          │     plan_review committee
                                          │     PR open + CR + merge
                                          │     wp#117 New → In spec → Specified → In progress → Developed → Closed
                                          └──► (other Epics EB-EI requirements Stories)
                                                each refines its own subset of S01

S02 PR can land BEFORE S01 PR merges (S02 references S01 strategy via the PR branch path).
S02 closure requires S01 to have merged (the trace requires the strategy doc to exist on main).

S02 ──► S03 (foundation Epic architecture)
       designs against S02 requirements
S03 ──► S04+ (implementation Stories per service / fixture layer)

Single-WIP rule (per ADR-69) : S02 sits in the Specified buffer once committee plan_review PASSES, waiting for capacity in In progress. Currently In progress Stories : wp#103 (S14 Track 1 leakage), wp#46 (S07 Track 2), wp#45 (S06 Track 11). S02 doesn't compete for capacity until one of those closes.


11. References

  • OP wp#117 / GH #837 : this Story's tracking
  • S01 dossier : documentation/reviews/2026-05-05-cvn-n015-ea-s01-test-strategy-plan.md (lands on PR #852 merge — backtick-quoted path here instead of markdown link to avoid mkdocs strict cross-PR resolution failure)
  • S01 strategy doc : documentation/strategy/CVN-N015-test-strategy.md (lands on PR #852 merge)
  • S01 ADR : documentation/adr/0083-test-taxonomy-and-gate-hierarchy.md (lands on PR #852 merge)
  • Testcontainers integration Story : #757 — F2 requirement is shaped by its design
  • Existing pytest factories Story : #586 — F4 inherits its scope
  • Flaky-test detector Story : #756 — F5 is its requirements expression
  • ADR-58 (FTF guardrails) — drives U4 (FTF launcher pre-sweep)
  • ADR-68 (committee = default review channel) — drives §7 criterion 1
  • ADR-77 (MkDocs SSoT) — this dossier is the SSoT for S02 requirements
  • ADR-81 (8-state Story workflow) — drives §7 criterion 7

12. Plan-review questions for committee

  1. Use case U6 (audit reviewer) numerical bar : pytest --collect-only snapshot per release tag is the proposed evidence. Is that sufficient to reconstruct the test surface during a 6-month-after incident review, or do we need a richer artefact (e.g., mapping test → covered code lines via coverage.py per release) ?
  2. NF4 parallelism contract : we mandate -n auto safety as a merge gate. Is this realistic from day 1 (Testcontainers spin-up cost may dominate) or should we phase it in (allow -n 1 initially with a 1-month migration window) ?
  3. NF6 fresh-clone < 5 min : aggressive bound. The current .venv_airflow install is ~3 min wall-clock on a fast laptop ; tight margin. Acceptable as a stretch goal with an alert at 4 min, or relax to < 7 min ?
  4. Glossary §6 mock vs spy : the line is famously fuzzy. Is the "use mock for incoming, spy for outgoing-while-keeping-real-behaviour" rule clear enough for the next contributor, or do we want a short decision tree (if you need to verify a call AND let the real impl run → spy ; otherwise mock) ?
  5. Scope of "service virtualisation" F2 : we list 5 services (Postgres, Redis, MinIO, MLflow, Airflow). Are we missing any (e.g., the Console Streamlit app, Loki, Grafana), or does the test stack consciously stop at the 5 ?
  6. Decision C (time_machine) lock-in risk : if the lib goes unmaintained in 18 months, the migration cost back to freezegun is a few hours per test. Is that risk profile acceptable vs picking the more entrenched freezegun from day 1 ?