CVN-N015 — Test strategy¶
Date : 2026-05-05
Story : CVN-N015-EA-S01 (OP wp#116)
GH issue : #836
Operator decisions : wp#116 comment 701 (2026-05-05) — A=b, B=b, C=b, D=b, E=c, F=a
Companion ADR : ADR-0083 (status accepted — same gate, see below)
Status : accepted (committee plan_review session 53d76f0f PASSED 2026-05-05 ; ratification per OP wp#116 operator decisions A=b, B=b, C=b, D=b, E=c, F=a — strategy + ADR-0083 are ratified together by the same gate)
This document is the canonical reference for what counts as a "test" in CVNTrade, when each kind runs, who owns it, and what it blocks. Every Story under CVN-N015-E* MUST reference this doc to scope its own deliverables ; downstream Stories that disagree with this strategy MUST raise an amendment Story rather than re-derive a contradictory taxonomy locally.
1. Context¶
CVNTrade ships ML-based crypto trading on a single-operator setup. Test discipline today is informal : pytest runs the unit + integration suites ; a few smoke checks live in DAG launchers ; data quality is mostly visual via Grafana. CVN-N015 (#592) is the umbrella that formalises this into a layered validation pyramid.
Without a strategy first, every Epic in CVN-N015 (EA-EI) risks re-deriving the test taxonomy and contradicting its siblings (e.g., Epic EE "ML behaviour suite" calling data quality what Epic ED calls contract). This Story is the head Story of EA precisely to prevent that drift.
The strategy adopts an honest scope : 12 test types out of the 16 commonly cataloged in industry, dropped : chaos (no SRE team), exploratory (informal already), load (covered by performance budget at the relevant code paths), regression (= retroactive unit tests, redundant ; covered by the unit type with a regression marker).
2. Test taxonomy — 12 types in scope¶
| # | Type | Scope (one-liner) | Lifecycle stage | Layer (#592 pyramid) |
|---|---|---|---|---|
| 1 | unit | Pure-function logic, no I/O, no fixtures heavier than in-memory pandas frames. pytest -m unit. |
every PR (fast tier) | base |
| 2 | property | Hypothesis-style invariants on FE / labels / cache keys (e.g., "shift(N) preserves NaN count"). | every PR (fast tier) | base |
| 3 | contract | Per-API contract (FE inputs, MLflow artefact schema, ETL output schema). Fails fast on schema drift. | every PR (fast tier) | middle |
| 4 | cache | L1/L2/L3 cache key correctness + invalidation (per ADR-04). Tests cache hit/miss + key extension semantics. | every PR (medium tier) | middle |
| 5 | integration | Multi-component flows : EnrichmentAPI → FeatureEngineeringAPI → InferenceAPI ; cross-process kernel parity. |
every PR (medium tier) | middle |
| 6 | DAG smoke | dag.test() per DAG : verifies the import + task discovery + 1-step execution under stub data. (Per Epic EF) |
every PR (medium tier) | middle |
| 7 | data quality | Great Expectations OSS suites on OHLCV + L2 + KPI events. (Per Epic ED) | every PR + nightly (per-ingest is implicit in nightly) | middle |
| 8 | ML behaviour | Evidently OSS + Giskard : drift / leakage / fairness / robustness checks on candidate models. (Per Epic EE) | per FTF sweep + post-promotion shadow | middle |
| 9 | performance | p95 / p99 budgets per code path : inference_p99 < 200ms, enrichment_p95 < 50ms, FTF cell p95 < 60s, train DAG run < 4h. |
every PR touching the code path + nightly drift check | middle |
| 10 | system-E2E | Paper-trading kernel + kill-switch + risk gates end-to-end with synthetic Binance feed. (Per Epic EG) | nightly + pre-deploy | top |
| 11 | UAT | Hybrid : Markdown scripted scenarios in tests/uat/scenarios/*.md for backend flows + Playwright recordings for Console UI flows. Run by operator before any manual prod action. |
per Story closure (operator-driven) | top |
| 12 | post-deploy smoke | k8s liveness + 1-prediction-call validation + Grafana panel populated. Runs immediately after every Helm upgrade. | post-deploy | top |
3. Cadence matrix¶
| Type | Push | PR (touching code) | Nightly | Per-Story | Pre-deploy | Post-deploy |
|---|---|---|---|---|---|---|
| unit | – | ✅ fast tier | – | – | – | – |
| property | – | ✅ fast tier | – | – | – | – |
| contract | – | ✅ fast tier | – | – | – | – |
| cache | – | ✅ medium tier | – | – | – | – |
| integration | – | ✅ medium tier | – | – | – | – |
| DAG smoke | – | ✅ medium tier | – | – | – | – |
| data quality | – | ✅ medium tier | ✅ | – | – | – |
| ML behaviour | – | – | – | ✅ per FTF sweep | – | ✅ shadow stage |
| performance | – | ✅ on touched code paths | ✅ drift check | – | – | – |
| system-E2E | – | – | ✅ | – | ✅ | – |
| UAT | – | – | – | ✅ operator | – | – |
| post-deploy smoke | – | – | – | – | – | ✅ |
Operator decision B : fast tier runs on PRs touching code (NOT every push) ; no nightly safety net for fast tier (main is protected by PR gates). Medium tier runs on PRs that touch the corresponding subsystem (cache changes, enrichment changes, ETL changes).
4. Gate hierarchy — what blocks what (operator decision C — tiered)¶
| Test type | Blocks merge ? | Blocks deploy ? | Blocks LOCK ? | Blocks Story closure ? |
|---|---|---|---|---|
| unit | YES | (cascade) | (cascade) | (cascade) |
| property | YES | (cascade) | (cascade) | (cascade) |
| contract | YES | (cascade) | (cascade) | (cascade) |
| cache | warn only | – | – | – |
| integration | YES | (cascade) | (cascade) | (cascade) |
| DAG smoke | YES | (cascade) | (cascade) | (cascade) |
| data quality | warn only | YES | (cascade) | (cascade) |
| ML behaviour | – | – | YES (FTF gate) | (cascade) |
| performance | warn only on PR | – | YES (LOCK gate) | (cascade) |
| system-E2E | – | YES | – | (cascade) |
| UAT | – | – | – | YES (operator sign-off) |
| post-deploy smoke | – | rollback trigger | – | (cascade) |
Reading rule : "(cascade)" means a YES at an upstream stage transitively blocks downstream stages without needing a re-run. E.g., a unit-test failure blocks merge, which prevents deploy, which prevents LOCK, which prevents Story closure.
Why tiered (vs strict) : strict = every test blocks its corresponding state transition = "test-induced traffic jam". Tiered respects the cost of each test type (a 4-hour FTF sweep cannot block every PR ; a 50ms unit test must).
5. Performance budgets — canonical (operator decision D)¶
| Code path | Budget | Test type that enforces |
|---|---|---|
InferenceAPI.predict() (per call) |
p99 < 200 ms | performance (PR-time) |
EnrichmentAPI.enrich_streaming() (per candle) |
p95 < 50 ms | performance (PR-time) |
| FTF sweep per cell (single train+eval) | p95 < 60 s | performance (nightly drift) |
| Train DAG run end-to-end (per crypto × strategy) | p99 < 4 h | performance (nightly drift) |
compute_btc_features() (1000-row window) |
p95 < 200 ms | performance (PR-time) |
compute_l2_features() (1000-row window) |
p95 < 200 ms | performance (PR-time) |
| Post-deploy smoke end-to-end | p99 < 30 s | post-deploy smoke |
Budgets are first-pass approximations ; refined via dedicated budget Story when nightly drift triggers fire repeatedly (defer-by-default, refine-on-signal).
6. UAT format — hybrid (operator decision E)¶
- Backend flows :
tests/uat/scenarios/*.mdMarkdown checklists. Each scenario has## Setup/## Steps/## Expected output/## Pass criteria/## Last validated. Operator runs before Story closure ; checks the box inLast validatedwith date + commit SHA. - Console UI flows : Playwright recordings under
tests/uat/playwright/. Operator runs the recorded session againstconsole.cvntrade.eu(the live single-environment Console — no separate staging tier exists in v1) ; pass criterion = no thrown errors + screenshot diff < 5 % vs baseline. Read-only invariant : Playwright UAT scenarios MUST be view-only (read endpoints + listing pages) — any test that flips state, mutatesftf_config, triggers a launcher, or sends a kill-switch action MUST move to the Markdown backend-flows scenarios (executed manually with explicit operator confirmation per step). Follow-up Story will add a dedicated staging Console once a second engineer joins the project.
UAT is operator-driven (not CI-automated) per the single-operator reality. The recorded artefacts (Markdown + Playwright) serve as the documented contract for what the operator validated.
7. Test-type ownership — single DRI (operator decision F)¶
| Type | DRI | Backup |
|---|---|---|
| ALL 12 TYPES | @dococeven |
@dococeven |
Honest re : single-operator project. No fictive ownership stubs ; the operator IS the team. When the second engineer joins, this table is the natural place to introduce per-type ownership (likely : data quality + ML behaviour to the new ML person ; performance + system-E2E to the new infra person).
8. Mapping existing GH issues to test-type buckets¶
| GH issue | Title (short) | Test type |
|---|---|---|
| #592 | Layered validation pyramid | (architecture, parent of all) |
| #586 | Pytest fixtures + factories | unit, property, integration |
| #757 | Testcontainers integration setup | integration |
| #756 | Flaky-test detector | unit, property, integration |
| #633 | Drift detection contracts | ML behaviour, contract |
| #426 | Backtest parity invariant | integration, property |
| #614 | Batch ↔ streaming parity certificate | property, contract |
9. Out-of-scope (explicit, with rationale)¶
| Type | Why dropped |
|---|---|
| chaos | No SRE team to operate the chaos schedules + interpret the dashboards. Re-introduce when team grows. |
| exploratory | Informal exploration happens during operator's daily workflow ; codifying it adds process without value. |
| load | Load testing measures sustained throughput + resource saturation under concurrent traffic, distinct from the per-call performance p95/p99 latency budgets. Dropped because the single-operator setup processes ~13 cryptos × 1 prediction per 15 min = ~52 predictions/h — well below the 200ms p99 inference budget × any plausible concurrency. Re-introduce as a first-class type when the workload reaches a regime where queue depth or pool exhaustion become realistic failure modes (e.g., live-trading multi-strategy multi-exchange fan-out). |
| regression | Not a test type per se ; every unit test IS a regression test for the bug it was added to catch. Tracked via pytest -m regression marker on the existing unit suite. |
10. Glossary (canonical definitions — avoids "system vs integration" debates)¶
| Term | Definition |
|---|---|
| fast tier | Tests that complete in < 5 s per file ; run on every code-touching PR with pytest -m unit or property or contract. |
| medium tier | Tests that complete in < 60 s per file ; run on PRs that touch the corresponding subsystem. |
| slow tier | Tests that take > 60 s ; run nightly OR per Story closure OR pre-deploy. |
| integration test | Tests > 1 component but stays in-process (no docker, no k8s). Example : EnrichmentAPI → FeatureEngineeringAPI chain with synthetic data. |
| system-E2E test | Tests > 1 process AND ≥ 1 container/k8s pod. Example : paper-trading kernel + Postgres + Redis + Loki via Testcontainers. |
| contract test | Validates the data shape (columns, types, nullability) AT the boundary between 2 components. NOT validates behaviour. |
| data quality test | Validates the data CONTENT (ranges, distributions, business rules) on production-like data. |
| ML behaviour test | Validates the model's RESPONSE to specific inputs : drift detection, fairness, perturbation robustness. NOT validates accuracy on a held-out set (that's the FTF sweep). |
| performance test | Validates p95/p99 latency vs the canonical budget table (§5). Throughput / saturation budgets are out of scope in v1 — see §9 load row for when they get re-introduced. |
| UAT | Operator runs a scripted scenario (Markdown OR Playwright recording) and confirms the observed behaviour matches the documented expectation. Not CI-automated. |
| post-deploy smoke | k8s liveness + 1 inference call + Grafana panel populated, run within 60s of every Helm upgrade. |
11. Story-phase × test-artefacts integration matrix (the test factory ↔ OP workflow)¶
Strategic invariant (foundational for the whole CVN-N015 stack) : tests are NOT a "phase that follows code" — they are first-class artefacts of every Story state transition per ADR-81. Every gate the operator passes through carries a test deliverable ; every committee session validates that deliverable's presence + quality ; every closure pins the (test_sha, dataset_sha, run_id) triples that proved each acceptance criterion immutably.
The test stack is "test-as-code with full reproducibility" : every test result is reproducible from (git_sha, dataset_sha) alone — no manual setup, no operator-specific environment, no "works on my machine". The CI is the only acceptable source of authoritative test results ; local runs are for development feedback only.
11.1 Canonical matrix — gate-by-gate¶
| ADR-81 transition | Test artefact required | Committee verdict required | Stewarding tool | Gate enforcement |
|---|---|---|---|---|
New → In specification |
none | – | – | – |
In specification → Specified |
"Test strategy" subsection in plan dossier — per type, with numerical bars + planned dataset families | plan_review : committee MUST explicitly verdict the test strategy (not implicitly absorb into the dossier verdict) — PASSED requires the test strategy is sufficient for the Story's risk profile |
committee plan_review validates |
committee verdict |
Specified → In progress |
none (start coding) | – | – | – |
In progress → Developed |
All test code written + green locally + discoverable via @pytest.mark.story("<cvn_id>") + datasets versioned (DVC for > 10 MB, content-addressed in-repo for ≤ 10 MB) |
none (developer-side discipline gate) | pytest --collect-only --story <cvn_id> returns ≥ 1 test per acceptance criterion (filter via the --story CLI flag added by S04's pytest plugin — pytest's native -m filter operates on marker names not arguments, see S03 §6.1) |
guardrail CI G5 (Story workflow guardrails — extends G1-G4) |
Developed → In testing |
Full CI pipeline green (fast + medium tiers + relevant nightly suites) on PR head SHA + test artefacts ship with the PR (test code + cases + datasets + draft manifest) | pr_review : committee MUST explicitly verdict that (a) test code covers each acceptance criterion, (b) test cases include adversarial/edge coverage proportional to risk, (c) datasets are versioned + reproducible, (d) draft manifest accurately maps tests → criteria — PASSED requires all 4 sub-verdicts |
CI job artefacts archived + committee session JSON | CI status check + PR template box + committee verdict |
In testing → Tested |
UAT operator-validated (per §6) + test-run report committed | none (operator-driven UAT ; committee already verdicted at pr_review) |
documentation/stories/<cvn_id>/tests/test_run_<sha>.md |
OP audit comment + commit |
Tested → Closed |
Test manifest committed pinning (test_sha, dataset_sha, run_id) for each acceptance criterion ; immutable thereafter |
optional closure_review for high-risk Stories (operator decides at Specified whether closure needs a third committee gate ; default = no — the pr_review test-artefact verdict carries through) |
documentation/stories/<cvn_id>/tests/manifest.yaml |
OP audit comment + commit + immutability check (S04+ tooling) |
11.2 Per-Story test artefacts directory¶
Every Story owns a folder under documentation/stories/<cvn_id>/tests/ with a fixed structure :
documentation/stories/<cvn_id>/tests/
├── strategy.md # the Story's "test strategy" subsection — extracted from plan dossier at Specified
├── test_run_<sha>.md # one file per CI run that gated a transition (Developed/In testing/Tested)
├── manifest.yaml # committed at Tested → Closed ; immutable from this point
└── datasets/ # dataset hashes + DVC pointers (S03 architecture decides exact scheme)
The S04+ implementation Stories instantiate this folder and the matrix's gate enforcement.
11.3 Vocabulary (canonical — referenced by S02 + S03 + ADR-0087 + ADR-0088)¶
- Test artefact : any file produced or consumed by a Story that touches tests (test code, test case, test dataset, test run report, test manifest)
- Test case : a data-driven scenario (input × expected-output pair) versioned under
tests/cases/<cvn_id>/ - Test dataset : a reproducibility-grade data fixture, content-addressed by hash under
tests/datasets/(DVC for > 10 MB, in-repo for ≤ 10 MB) - Test run report : the output of a single CI run on a specific
(git_sha, dataset_sha)pair, archived underdocumentation/stories/<cvn_id>/tests/ - Test manifest : the per-Story YAML pinning the
(test_sha, dataset_sha, run_id)triples that proved each acceptance criterion, committed atTested → Closedand immutable from that point - Story-phase gate : an ADR-81 transition that requires specific test artefacts to be produced before the transition is allowed
11.4 Committee scope extension — tests are an explicit verdict item¶
This is a scope extension to ADR-68 (committee = default review channel) : both plan_review and pr_review MUST issue an explicit verdict on the test artefacts as a first-class review item, not implicitly absorbed into the broader code/dossier verdict. The expert-test role within the committee owns this verdict (when team grows ; until then, the existing 5-expert panel covers it as a mandatory checklist item).
Concretely, the pr_review verdict body MUST contain a tests: section with 4 sub-verdicts :
tests:
coverage_per_acceptance_criterion: PASSED | INSUFFICIENT
adversarial_edge_coverage: PASSED | INSUFFICIENT
datasets_versioned_reproducible: PASSED | INSUFFICIENT
manifest_maps_tests_to_criteria: PASSED | INSUFFICIENT
Any INSUFFICIENT blocks the merge regardless of the rest of the verdict. This is not a soft signal — it's a hard gate.
The plan_review verdict body MUST contain a tests_strategy: field asserting whether the planned test strategy is sufficient for the Story's risk profile. INSUFFICIENT blocks In specification → Specified.
ADR-0087 (drafted in S03) ratifies this contract.
11.5 Future ADRs that lock this contract¶
This section is the strategic invariant. The mechanical contracts land in :
- ADR-0087 — Story-phase test integration : tests are first-class artefacts of every ADR-81 transition + explicit committee verdict at every gate (drafted in S03 — wp#118)
- ADR-0088 — Test cases + datasets versioned + provenance-tracked (test-as-code with reproducibility guarantee) (drafted in S03 — wp#118)
Until ADR-0087 + ADR-0088 land, this §11 is the authoritative invariant. The 8-state workflow runbook (process/STORY_WORKFLOW.md) will be extended with the per-gate test-artefact checklist by S03 closure.
12. Open follow-ups¶
- ADR-0083 (this strategy's companion) is
acceptedtogether with this strategy by the same gate (committeeplan_review53d76f0fPASSED + operator decisions). A future Epic EI Story may amend it once downstream Stories EA-EI surface scope they want to revise (amendment Story per the strategy's invariant — not a re-ratification). pytest -m regressionmarker convention codified in EA-S02 fixture/factory Story.- Performance budget table refined per nightly drift signal, NOT pre-emptively.
- Per-type ownership table updated when team grows beyond single operator.
12. References¶
- OP wp#116 / GH #836 : this Story's tracking
- F1 plan §6 (sequencing, gates) : the ML-side authority for FTF gates referenced by
ML behaviourrow - ADR-04 (cache policy explicite) : authority for
cachetest type contract - ADR-14 (multi-fold obligatoire) : authority for what
ML behaviourvalidates - ADR-23 (features version-pinned, fail-fast) : authority for
contracttest type - ADR-26 (Grafana point d'entrée unique) :
post-deploy smokevalidates the Grafana panel populated - ADR-30 (logs structurés interface stable) :
contractincludes the log event schema - ADR-31 / ADR-32 (logging, structured events) :
contractincludes log format compliance - ADR-58 (FTF guardrails) : authority for what
unitcovers on FTF factor matrices - ADR-59 (PG ftf_config) :
contractvalidates the ftf_config schema - ADR-77 (MkDocs SSoT) : this strategy doc + ADR-0083 are the SSoT for test taxonomy
- ADR-79 (FTF Story closure 8-step) :
ML behaviourrow references this for FTF gate semantics - ADR-81 (8-state Story workflow) :
gate hierarchytable aligns with the 8 state transitions - ADR-82 (committee → OP Meeting) : strategy_review committee session logged per this contract
- Issue #592 : architectural pyramid (parent issue ; this strategy is its textual companion)