CVN-N015 — Test strategy¶

Date : 2026-05-05 Story : CVN-N015-EA-S01 (OP wp#116) GH issue : #836 Operator decisions : wp#116 comment 701 (2026-05-05) — A=b, B=b, C=b, D=b, E=c, F=a Companion ADR : ADR-0083 (status accepted — same gate, see below) Status : accepted (committee plan_review session 53d76f0f PASSED 2026-05-05 ; ratification per OP wp#116 operator decisions A=b, B=b, C=b, D=b, E=c, F=a — strategy + ADR-0083 are ratified together by the same gate)

This document is the canonical reference for what counts as a "test" in CVNTrade, when each kind runs, who owns it, and what it blocks. Every Story under CVN-N015-E* MUST reference this doc to scope its own deliverables ; downstream Stories that disagree with this strategy MUST raise an amendment Story rather than re-derive a contradictory taxonomy locally.

1. Context¶

CVNTrade ships ML-based crypto trading on a single-operator setup. Test discipline today is informal : pytest runs the unit + integration suites ; a few smoke checks live in DAG launchers ; data quality is mostly visual via Grafana. CVN-N015 (#592) is the umbrella that formalises this into a layered validation pyramid.

Without a strategy first, every Epic in CVN-N015 (EA-EI) risks re-deriving the test taxonomy and contradicting its siblings (e.g., Epic EE "ML behaviour suite" calling data quality what Epic ED calls contract). This Story is the head Story of EA precisely to prevent that drift.

The strategy adopts an honest scope : 12 test types out of the 16 commonly cataloged in industry, dropped : chaos (no SRE team), exploratory (informal already), load (covered by performance budget at the relevant code paths), regression (= retroactive unit tests, redundant ; covered by the unit type with a regression marker).

2. Test taxonomy — 12 types in scope¶

#	Type	Scope (one-liner)	Lifecycle stage	Layer (#592 pyramid)
1	unit	Pure-function logic, no I/O, no fixtures heavier than in-memory pandas frames. `pytest -m unit`.	every PR (fast tier)	base
2	property	Hypothesis-style invariants on FE / labels / cache keys (e.g., "shift(N) preserves NaN count").	every PR (fast tier)	base
3	contract	Per-API contract (FE inputs, MLflow artefact schema, ETL output schema). Fails fast on schema drift.	every PR (fast tier)	middle
4	cache	L1/L2/L3 cache key correctness + invalidation (per ADR-04). Tests cache hit/miss + key extension semantics.	every PR (medium tier)	middle
5	integration	Multi-component flows : `EnrichmentAPI` → `FeatureEngineeringAPI` → `InferenceAPI` ; cross-process kernel parity.	every PR (medium tier)	middle
6	DAG smoke	`dag.test()` per DAG : verifies the import + task discovery + 1-step execution under stub data. (Per Epic EF)	every PR (medium tier)	middle
7	data quality	Great Expectations OSS suites on OHLCV + L2 + KPI events. (Per Epic ED)	every PR + nightly (per-ingest is implicit in nightly)	middle
8	ML behaviour	Evidently OSS + Giskard : drift / leakage / fairness / robustness checks on candidate models. (Per Epic EE)	per FTF sweep + post-promotion shadow	middle
9	performance	p95 / p99 budgets per code path : `inference_p99 < 200ms`, `enrichment_p95 < 50ms`, `FTF cell p95 < 60s`, `train DAG run < 4h`.	every PR touching the code path + nightly drift check	middle
10	system-E2E	Paper-trading kernel + kill-switch + risk gates end-to-end with synthetic Binance feed. (Per Epic EG)	nightly + pre-deploy	top
11	UAT	Hybrid : Markdown scripted scenarios in `tests/uat/scenarios/*.md` for backend flows + Playwright recordings for Console UI flows. Run by operator before any manual prod action.	per Story closure (operator-driven)	top
12	post-deploy smoke	k8s liveness + 1-prediction-call validation + Grafana panel populated. Runs immediately after every Helm upgrade.	post-deploy	top

3. Cadence matrix¶

Type	Push	PR (touching code)	Nightly	Per-Story	Pre-deploy	Post-deploy
unit	–	✅ fast tier	–	–	–	–
property	–	✅ fast tier	–	–	–	–
contract	–	✅ fast tier	–	–	–	–
cache	–	✅ medium tier	–	–	–	–
integration	–	✅ medium tier	–	–	–	–
DAG smoke	–	✅ medium tier	–	–	–	–
data quality	–	✅ medium tier	✅	–	–	–
ML behaviour	–	–	–	✅ per FTF sweep	–	✅ shadow stage
performance	–	✅ on touched code paths	✅ drift check	–	–	–
system-E2E	–	–	✅	–	✅	–
UAT	–	–	–	✅ operator	–	–
post-deploy smoke	–	–	–	–	–	✅

Operator decision B : fast tier runs on PRs touching code (NOT every push) ; no nightly safety net for fast tier (main is protected by PR gates). Medium tier runs on PRs that touch the corresponding subsystem (cache changes, enrichment changes, ETL changes).

4. Gate hierarchy — what blocks what (operator decision C — tiered)¶

Test type	Blocks merge ?	Blocks deploy ?	Blocks LOCK ?	Blocks Story closure ?
unit	YES	(cascade)	(cascade)	(cascade)
property	YES	(cascade)	(cascade)	(cascade)
contract	YES	(cascade)	(cascade)	(cascade)
cache	warn only	–	–	–
integration	YES	(cascade)	(cascade)	(cascade)
DAG smoke	YES	(cascade)	(cascade)	(cascade)
data quality	warn only	YES	(cascade)	(cascade)
ML behaviour	–	–	YES (FTF gate)	(cascade)
performance	warn only on PR	–	YES (LOCK gate)	(cascade)
system-E2E	–	YES	–	(cascade)
UAT	–	–	–	YES (operator sign-off)
post-deploy smoke	–	rollback trigger	–	(cascade)

Reading rule : "(cascade)" means a YES at an upstream stage transitively blocks downstream stages without needing a re-run. E.g., a unit-test failure blocks merge, which prevents deploy, which prevents LOCK, which prevents Story closure.

Why tiered (vs strict) : strict = every test blocks its corresponding state transition = "test-induced traffic jam". Tiered respects the cost of each test type (a 4-hour FTF sweep cannot block every PR ; a 50ms unit test must).

5. Performance budgets — canonical (operator decision D)¶

Code path	Budget	Test type that enforces
`InferenceAPI.predict()` (per call)	p99 < 200 ms	`performance` (PR-time)
`EnrichmentAPI.enrich_streaming()` (per candle)	p95 < 50 ms	`performance` (PR-time)
FTF sweep per cell (single train+eval)	p95 < 60 s	`performance` (nightly drift)
Train DAG run end-to-end (per crypto × strategy)	p99 < 4 h	`performance` (nightly drift)
`compute_btc_features()` (1000-row window)	p95 < 200 ms	`performance` (PR-time)
`compute_l2_features()` (1000-row window)	p95 < 200 ms	`performance` (PR-time)
Post-deploy smoke end-to-end	p99 < 30 s	`post-deploy smoke`

Budgets are first-pass approximations ; refined via dedicated budget Story when nightly drift triggers fire repeatedly (defer-by-default, refine-on-signal).

6. UAT format — hybrid (operator decision E)¶

Backend flows : tests/uat/scenarios/*.md Markdown checklists. Each scenario has ## Setup / ## Steps / ## Expected output / ## Pass criteria / ## Last validated. Operator runs before Story closure ; checks the box in Last validated with date + commit SHA.
Console UI flows : Playwright recordings under tests/uat/playwright/. Operator runs the recorded session against console.cvntrade.eu (the live single-environment Console — no separate staging tier exists in v1) ; pass criterion = no thrown errors + screenshot diff < 5 % vs baseline. Read-only invariant : Playwright UAT scenarios MUST be view-only (read endpoints + listing pages) — any test that flips state, mutates ftf_config, triggers a launcher, or sends a kill-switch action MUST move to the Markdown backend-flows scenarios (executed manually with explicit operator confirmation per step). Follow-up Story will add a dedicated staging Console once a second engineer joins the project.

UAT is operator-driven (not CI-automated) per the single-operator reality. The recorded artefacts (Markdown + Playwright) serve as the documented contract for what the operator validated.

7. Test-type ownership — single DRI (operator decision F)¶

Type	DRI	Backup
ALL 12 TYPES	`@dococeven`	`@dococeven`

Honest re : single-operator project. No fictive ownership stubs ; the operator IS the team. When the second engineer joins, this table is the natural place to introduce per-type ownership (likely : data quality + ML behaviour to the new ML person ; performance + system-E2E to the new infra person).

8. Mapping existing GH issues to test-type buckets¶

GH issue	Title (short)	Test type
#592	Layered validation pyramid	(architecture, parent of all)
#586	Pytest fixtures + factories	unit, property, integration
#757	Testcontainers integration setup	integration
#756	Flaky-test detector	unit, property, integration
#633	Drift detection contracts	ML behaviour, contract
#426	Backtest parity invariant	integration, property
#614	Batch ↔ streaming parity certificate	property, contract

9. Out-of-scope (explicit, with rationale)¶

Type	Why dropped
chaos	No SRE team to operate the chaos schedules + interpret the dashboards. Re-introduce when team grows.
exploratory	Informal exploration happens during operator's daily workflow ; codifying it adds process without value.
load	Load testing measures sustained throughput + resource saturation under concurrent traffic, distinct from the per-call `performance` p95/p99 latency budgets. Dropped because the single-operator setup processes ~13 cryptos × 1 prediction per 15 min = ~52 predictions/h — well below the 200ms p99 inference budget × any plausible concurrency. Re-introduce as a first-class type when the workload reaches a regime where queue depth or pool exhaustion become realistic failure modes (e.g., live-trading multi-strategy multi-exchange fan-out).
regression	Not a test type per se ; every unit test IS a regression test for the bug it was added to catch. Tracked via `pytest -m regression` marker on the existing unit suite.

10. Glossary (canonical definitions — avoids "system vs integration" debates)¶

Term	Definition
fast tier	Tests that complete in < 5 s per file ; run on every code-touching PR with `pytest -m unit or property or contract`.
medium tier	Tests that complete in < 60 s per file ; run on PRs that touch the corresponding subsystem.
slow tier	Tests that take > 60 s ; run nightly OR per Story closure OR pre-deploy.
integration test	Tests > 1 component but stays in-process (no docker, no k8s). Example : `EnrichmentAPI` → `FeatureEngineeringAPI` chain with synthetic data.
system-E2E test	Tests > 1 process AND ≥ 1 container/k8s pod. Example : paper-trading kernel + Postgres + Redis + Loki via Testcontainers.
contract test	Validates the data shape (columns, types, nullability) AT the boundary between 2 components. NOT validates behaviour.
data quality test	Validates the data CONTENT (ranges, distributions, business rules) on production-like data.
ML behaviour test	Validates the model's RESPONSE to specific inputs : drift detection, fairness, perturbation robustness. NOT validates accuracy on a held-out set (that's the FTF sweep).
performance test	Validates p95/p99 latency vs the canonical budget table (§5). Throughput / saturation budgets are out of scope in v1 — see §9 `load` row for when they get re-introduced.
UAT	Operator runs a scripted scenario (Markdown OR Playwright recording) and confirms the observed behaviour matches the documented expectation. Not CI-automated.
post-deploy smoke	k8s liveness + 1 inference call + Grafana panel populated, run within 60s of every Helm upgrade.

11. Story-phase × test-artefacts integration matrix (the test factory ↔ OP workflow)¶

Strategic invariant (foundational for the whole CVN-N015 stack) : tests are NOT a "phase that follows code" — they are first-class artefacts of every Story state transition per ADR-81. Every gate the operator passes through carries a test deliverable ; every committee session validates that deliverable's presence + quality ; every closure pins the (test_sha, dataset_sha, run_id) triples that proved each acceptance criterion immutably.

The test stack is "test-as-code with full reproducibility" : every test result is reproducible from (git_sha, dataset_sha) alone — no manual setup, no operator-specific environment, no "works on my machine". The CI is the only acceptable source of authoritative test results ; local runs are for development feedback only.

11.1 Canonical matrix — gate-by-gate¶

ADR-81 transition	Test artefact required	Committee verdict required	Stewarding tool	Gate enforcement
`New → In specification`	none	–	–	–
`In specification → Specified`	"Test strategy" subsection in plan dossier — per type, with numerical bars + planned dataset families	`plan_review` : committee MUST explicitly verdict the test strategy (not implicitly absorb into the dossier verdict) — `PASSED` requires the test strategy is sufficient for the Story's risk profile	committee `plan_review` validates	committee verdict
`Specified → In progress`	none (start coding)	–	–	–
`In progress → Developed`	All test code written + green locally + discoverable via `@pytest.mark.story("<cvn_id>")` + datasets versioned (DVC for > 10 MB, content-addressed in-repo for ≤ 10 MB)	none (developer-side discipline gate)	`pytest --collect-only --story <cvn_id>` returns ≥ 1 test per acceptance criterion (filter via the `--story` CLI flag added by S04's pytest plugin — pytest's native `-m` filter operates on marker names not arguments, see S03 §6.1)	guardrail CI G5 (Story workflow guardrails — extends G1-G4)
`Developed → In testing`	Full CI pipeline green (fast + medium tiers + relevant nightly suites) on PR head SHA + test artefacts ship with the PR (test code + cases + datasets + draft manifest)	`pr_review` : committee MUST explicitly verdict that (a) test code covers each acceptance criterion, (b) test cases include adversarial/edge coverage proportional to risk, (c) datasets are versioned + reproducible, (d) draft manifest accurately maps tests → criteria — `PASSED` requires all 4 sub-verdicts	CI job artefacts archived + committee session JSON	CI status check + PR template box + committee verdict
`In testing → Tested`	UAT operator-validated (per §6) + test-run report committed	none (operator-driven UAT ; committee already verdicted at `pr_review`)	`documentation/stories/<cvn_id>/tests/test_run_<sha>.md`	OP audit comment + commit
`Tested → Closed`	Test manifest committed pinning `(test_sha, dataset_sha, run_id)` for each acceptance criterion ; immutable thereafter	optional `closure_review` for high-risk Stories (operator decides at `Specified` whether closure needs a third committee gate ; default = no — the `pr_review` test-artefact verdict carries through)	`documentation/stories/<cvn_id>/tests/manifest.yaml`	OP audit comment + commit + immutability check (S04+ tooling)

11.2 Per-Story test artefacts directory¶

Every Story owns a folder under documentation/stories/<cvn_id>/tests/ with a fixed structure :

documentation/stories/<cvn_id>/tests/
├── strategy.md                  # the Story's "test strategy" subsection — extracted from plan dossier at Specified
├── test_run_<sha>.md            # one file per CI run that gated a transition (Developed/In testing/Tested)
├── manifest.yaml                # committed at Tested → Closed ; immutable from this point
└── datasets/                    # dataset hashes + DVC pointers (S03 architecture decides exact scheme)

The S04+ implementation Stories instantiate this folder and the matrix's gate enforcement.

11.3 Vocabulary (canonical — referenced by S02 + S03 + ADR-0087 + ADR-0088)¶

Test artefact : any file produced or consumed by a Story that touches tests (test code, test case, test dataset, test run report, test manifest)
Test case : a data-driven scenario (input × expected-output pair) versioned under tests/cases/<cvn_id>/
Test dataset : a reproducibility-grade data fixture, content-addressed by hash under tests/datasets/ (DVC for > 10 MB, in-repo for ≤ 10 MB)
Test run report : the output of a single CI run on a specific (git_sha, dataset_sha) pair, archived under documentation/stories/<cvn_id>/tests/
Test manifest : the per-Story YAML pinning the (test_sha, dataset_sha, run_id) triples that proved each acceptance criterion, committed at Tested → Closed and immutable from that point
Story-phase gate : an ADR-81 transition that requires specific test artefacts to be produced before the transition is allowed

11.4 Committee scope extension — tests are an explicit verdict item¶

This is a scope extension to ADR-68 (committee = default review channel) : both plan_review and pr_review MUST issue an explicit verdict on the test artefacts as a first-class review item, not implicitly absorbed into the broader code/dossier verdict. The expert-test role within the committee owns this verdict (when team grows ; until then, the existing 5-expert panel covers it as a mandatory checklist item).

Concretely, the pr_review verdict body MUST contain a tests: section with 4 sub-verdicts :

tests:
  coverage_per_acceptance_criterion: PASSED | INSUFFICIENT
  adversarial_edge_coverage: PASSED | INSUFFICIENT
  datasets_versioned_reproducible: PASSED | INSUFFICIENT
  manifest_maps_tests_to_criteria: PASSED | INSUFFICIENT

Any INSUFFICIENT blocks the merge regardless of the rest of the verdict. This is not a soft signal — it's a hard gate.

The plan_review verdict body MUST contain a tests_strategy: field asserting whether the planned test strategy is sufficient for the Story's risk profile. INSUFFICIENT blocks In specification → Specified.

ADR-0087 (drafted in S03) ratifies this contract.

11.5 Future ADRs that lock this contract¶

This section is the strategic invariant. The mechanical contracts land in : - ADR-0087 — Story-phase test integration : tests are first-class artefacts of every ADR-81 transition + explicit committee verdict at every gate (drafted in S03 — wp#118) - ADR-0088 — Test cases + datasets versioned + provenance-tracked (test-as-code with reproducibility guarantee) (drafted in S03 — wp#118)

Until ADR-0087 + ADR-0088 land, this §11 is the authoritative invariant. The 8-state workflow runbook (process/STORY_WORKFLOW.md) will be extended with the per-gate test-artefact checklist by S03 closure.

12. Open follow-ups¶

ADR-0083 (this strategy's companion) is accepted together with this strategy by the same gate (committee plan_review 53d76f0f PASSED + operator decisions). A future Epic EI Story may amend it once downstream Stories EA-EI surface scope they want to revise (amendment Story per the strategy's invariant — not a re-ratification).
pytest -m regression marker convention codified in EA-S02 fixture/factory Story.
Performance budget table refined per nightly drift signal, NOT pre-emptively.
Per-type ownership table updated when team grows beyond single operator.

12. References¶

OP wp#116 / GH #836 : this Story's tracking
F1 plan §6 (sequencing, gates) : the ML-side authority for FTF gates referenced by ML behaviour row
ADR-04 (cache policy explicite) : authority for cache test type contract
ADR-14 (multi-fold obligatoire) : authority for what ML behaviour validates
ADR-23 (features version-pinned, fail-fast) : authority for contract test type
ADR-26 (Grafana point d'entrée unique) : post-deploy smoke validates the Grafana panel populated
ADR-30 (logs structurés interface stable) : contract includes the log event schema
ADR-31 / ADR-32 (logging, structured events) : contract includes log format compliance
ADR-58 (FTF guardrails) : authority for what unit covers on FTF factor matrices
ADR-59 (PG ftf_config) : contract validates the ftf_config schema
ADR-77 (MkDocs SSoT) : this strategy doc + ADR-0083 are the SSoT for test taxonomy
ADR-79 (FTF Story closure 8-step) : ML behaviour row references this for FTF gate semantics
ADR-81 (8-state Story workflow) : gate hierarchy table aligns with the 8 state transitions
ADR-82 (committee → OP Meeting) : strategy_review committee session logged per this contract
Issue #592 : architectural pyramid (parent issue ; this strategy is its textual companion)