Skip to content

CVN-N015 — Test strategy

Date : 2026-05-05 Story : CVN-N015-EA-S01 (OP wp#116) GH issue : #836 Operator decisions : wp#116 comment 701 (2026-05-05) — A=b, B=b, C=b, D=b, E=c, F=a Companion ADR : ADR-0083 (status accepted — same gate, see below) Status : accepted (committee plan_review session 53d76f0f PASSED 2026-05-05 ; ratification per OP wp#116 operator decisions A=b, B=b, C=b, D=b, E=c, F=a — strategy + ADR-0083 are ratified together by the same gate)

This document is the canonical reference for what counts as a "test" in CVNTrade, when each kind runs, who owns it, and what it blocks. Every Story under CVN-N015-E* MUST reference this doc to scope its own deliverables ; downstream Stories that disagree with this strategy MUST raise an amendment Story rather than re-derive a contradictory taxonomy locally.


1. Context

CVNTrade ships ML-based crypto trading on a single-operator setup. Test discipline today is informal : pytest runs the unit + integration suites ; a few smoke checks live in DAG launchers ; data quality is mostly visual via Grafana. CVN-N015 (#592) is the umbrella that formalises this into a layered validation pyramid.

Without a strategy first, every Epic in CVN-N015 (EA-EI) risks re-deriving the test taxonomy and contradicting its siblings (e.g., Epic EE "ML behaviour suite" calling data quality what Epic ED calls contract). This Story is the head Story of EA precisely to prevent that drift.

The strategy adopts an honest scope : 12 test types out of the 16 commonly cataloged in industry, dropped : chaos (no SRE team), exploratory (informal already), load (covered by performance budget at the relevant code paths), regression (= retroactive unit tests, redundant ; covered by the unit type with a regression marker).

2. Test taxonomy — 12 types in scope

# Type Scope (one-liner) Lifecycle stage Layer (#592 pyramid)
1 unit Pure-function logic, no I/O, no fixtures heavier than in-memory pandas frames. pytest -m unit. every PR (fast tier) base
2 property Hypothesis-style invariants on FE / labels / cache keys (e.g., "shift(N) preserves NaN count"). every PR (fast tier) base
3 contract Per-API contract (FE inputs, MLflow artefact schema, ETL output schema). Fails fast on schema drift. every PR (fast tier) middle
4 cache L1/L2/L3 cache key correctness + invalidation (per ADR-04). Tests cache hit/miss + key extension semantics. every PR (medium tier) middle
5 integration Multi-component flows : EnrichmentAPIFeatureEngineeringAPIInferenceAPI ; cross-process kernel parity. every PR (medium tier) middle
6 DAG smoke dag.test() per DAG : verifies the import + task discovery + 1-step execution under stub data. (Per Epic EF) every PR (medium tier) middle
7 data quality Great Expectations OSS suites on OHLCV + L2 + KPI events. (Per Epic ED) every PR + nightly (per-ingest is implicit in nightly) middle
8 ML behaviour Evidently OSS + Giskard : drift / leakage / fairness / robustness checks on candidate models. (Per Epic EE) per FTF sweep + post-promotion shadow middle
9 performance p95 / p99 budgets per code path : inference_p99 < 200ms, enrichment_p95 < 50ms, FTF cell p95 < 60s, train DAG run < 4h. every PR touching the code path + nightly drift check middle
10 system-E2E Paper-trading kernel + kill-switch + risk gates end-to-end with synthetic Binance feed. (Per Epic EG) nightly + pre-deploy top
11 UAT Hybrid : Markdown scripted scenarios in tests/uat/scenarios/*.md for backend flows + Playwright recordings for Console UI flows. Run by operator before any manual prod action. per Story closure (operator-driven) top
12 post-deploy smoke k8s liveness + 1-prediction-call validation + Grafana panel populated. Runs immediately after every Helm upgrade. post-deploy top

3. Cadence matrix

Type Push PR (touching code) Nightly Per-Story Pre-deploy Post-deploy
unit ✅ fast tier
property ✅ fast tier
contract ✅ fast tier
cache ✅ medium tier
integration ✅ medium tier
DAG smoke ✅ medium tier
data quality ✅ medium tier
ML behaviour ✅ per FTF sweep ✅ shadow stage
performance ✅ on touched code paths ✅ drift check
system-E2E
UAT ✅ operator
post-deploy smoke

Operator decision B : fast tier runs on PRs touching code (NOT every push) ; no nightly safety net for fast tier (main is protected by PR gates). Medium tier runs on PRs that touch the corresponding subsystem (cache changes, enrichment changes, ETL changes).

4. Gate hierarchy — what blocks what (operator decision C — tiered)

Test type Blocks merge ? Blocks deploy ? Blocks LOCK ? Blocks Story closure ?
unit YES (cascade) (cascade) (cascade)
property YES (cascade) (cascade) (cascade)
contract YES (cascade) (cascade) (cascade)
cache warn only
integration YES (cascade) (cascade) (cascade)
DAG smoke YES (cascade) (cascade) (cascade)
data quality warn only YES (cascade) (cascade)
ML behaviour YES (FTF gate) (cascade)
performance warn only on PR YES (LOCK gate) (cascade)
system-E2E YES (cascade)
UAT YES (operator sign-off)
post-deploy smoke rollback trigger (cascade)

Reading rule : "(cascade)" means a YES at an upstream stage transitively blocks downstream stages without needing a re-run. E.g., a unit-test failure blocks merge, which prevents deploy, which prevents LOCK, which prevents Story closure.

Why tiered (vs strict) : strict = every test blocks its corresponding state transition = "test-induced traffic jam". Tiered respects the cost of each test type (a 4-hour FTF sweep cannot block every PR ; a 50ms unit test must).

5. Performance budgets — canonical (operator decision D)

Code path Budget Test type that enforces
InferenceAPI.predict() (per call) p99 < 200 ms performance (PR-time)
EnrichmentAPI.enrich_streaming() (per candle) p95 < 50 ms performance (PR-time)
FTF sweep per cell (single train+eval) p95 < 60 s performance (nightly drift)
Train DAG run end-to-end (per crypto × strategy) p99 < 4 h performance (nightly drift)
compute_btc_features() (1000-row window) p95 < 200 ms performance (PR-time)
compute_l2_features() (1000-row window) p95 < 200 ms performance (PR-time)
Post-deploy smoke end-to-end p99 < 30 s post-deploy smoke

Budgets are first-pass approximations ; refined via dedicated budget Story when nightly drift triggers fire repeatedly (defer-by-default, refine-on-signal).

6. UAT format — hybrid (operator decision E)

  • Backend flows : tests/uat/scenarios/*.md Markdown checklists. Each scenario has ## Setup / ## Steps / ## Expected output / ## Pass criteria / ## Last validated. Operator runs before Story closure ; checks the box in Last validated with date + commit SHA.
  • Console UI flows : Playwright recordings under tests/uat/playwright/. Operator runs the recorded session against console.cvntrade.eu (the live single-environment Console — no separate staging tier exists in v1) ; pass criterion = no thrown errors + screenshot diff < 5 % vs baseline. Read-only invariant : Playwright UAT scenarios MUST be view-only (read endpoints + listing pages) — any test that flips state, mutates ftf_config, triggers a launcher, or sends a kill-switch action MUST move to the Markdown backend-flows scenarios (executed manually with explicit operator confirmation per step). Follow-up Story will add a dedicated staging Console once a second engineer joins the project.

UAT is operator-driven (not CI-automated) per the single-operator reality. The recorded artefacts (Markdown + Playwright) serve as the documented contract for what the operator validated.

7. Test-type ownership — single DRI (operator decision F)

Type DRI Backup
ALL 12 TYPES @dococeven @dococeven

Honest re : single-operator project. No fictive ownership stubs ; the operator IS the team. When the second engineer joins, this table is the natural place to introduce per-type ownership (likely : data quality + ML behaviour to the new ML person ; performance + system-E2E to the new infra person).

8. Mapping existing GH issues to test-type buckets

GH issue Title (short) Test type
#592 Layered validation pyramid (architecture, parent of all)
#586 Pytest fixtures + factories unit, property, integration
#757 Testcontainers integration setup integration
#756 Flaky-test detector unit, property, integration
#633 Drift detection contracts ML behaviour, contract
#426 Backtest parity invariant integration, property
#614 Batch ↔ streaming parity certificate property, contract

9. Out-of-scope (explicit, with rationale)

Type Why dropped
chaos No SRE team to operate the chaos schedules + interpret the dashboards. Re-introduce when team grows.
exploratory Informal exploration happens during operator's daily workflow ; codifying it adds process without value.
load Load testing measures sustained throughput + resource saturation under concurrent traffic, distinct from the per-call performance p95/p99 latency budgets. Dropped because the single-operator setup processes ~13 cryptos × 1 prediction per 15 min = ~52 predictions/h — well below the 200ms p99 inference budget × any plausible concurrency. Re-introduce as a first-class type when the workload reaches a regime where queue depth or pool exhaustion become realistic failure modes (e.g., live-trading multi-strategy multi-exchange fan-out).
regression Not a test type per se ; every unit test IS a regression test for the bug it was added to catch. Tracked via pytest -m regression marker on the existing unit suite.

10. Glossary (canonical definitions — avoids "system vs integration" debates)

Term Definition
fast tier Tests that complete in < 5 s per file ; run on every code-touching PR with pytest -m unit or property or contract.
medium tier Tests that complete in < 60 s per file ; run on PRs that touch the corresponding subsystem.
slow tier Tests that take > 60 s ; run nightly OR per Story closure OR pre-deploy.
integration test Tests > 1 component but stays in-process (no docker, no k8s). Example : EnrichmentAPIFeatureEngineeringAPI chain with synthetic data.
system-E2E test Tests > 1 process AND ≥ 1 container/k8s pod. Example : paper-trading kernel + Postgres + Redis + Loki via Testcontainers.
contract test Validates the data shape (columns, types, nullability) AT the boundary between 2 components. NOT validates behaviour.
data quality test Validates the data CONTENT (ranges, distributions, business rules) on production-like data.
ML behaviour test Validates the model's RESPONSE to specific inputs : drift detection, fairness, perturbation robustness. NOT validates accuracy on a held-out set (that's the FTF sweep).
performance test Validates p95/p99 latency vs the canonical budget table (§5). Throughput / saturation budgets are out of scope in v1 — see §9 load row for when they get re-introduced.
UAT Operator runs a scripted scenario (Markdown OR Playwright recording) and confirms the observed behaviour matches the documented expectation. Not CI-automated.
post-deploy smoke k8s liveness + 1 inference call + Grafana panel populated, run within 60s of every Helm upgrade.

11. Story-phase × test-artefacts integration matrix (the test factory ↔ OP workflow)

Strategic invariant (foundational for the whole CVN-N015 stack) : tests are NOT a "phase that follows code" — they are first-class artefacts of every Story state transition per ADR-81. Every gate the operator passes through carries a test deliverable ; every committee session validates that deliverable's presence + quality ; every closure pins the (test_sha, dataset_sha, run_id) triples that proved each acceptance criterion immutably.

The test stack is "test-as-code with full reproducibility" : every test result is reproducible from (git_sha, dataset_sha) alone — no manual setup, no operator-specific environment, no "works on my machine". The CI is the only acceptable source of authoritative test results ; local runs are for development feedback only.

11.1 Canonical matrix — gate-by-gate

ADR-81 transition Test artefact required Committee verdict required Stewarding tool Gate enforcement
New → In specification none
In specification → Specified "Test strategy" subsection in plan dossier — per type, with numerical bars + planned dataset families plan_review : committee MUST explicitly verdict the test strategy (not implicitly absorb into the dossier verdict) — PASSED requires the test strategy is sufficient for the Story's risk profile committee plan_review validates committee verdict
Specified → In progress none (start coding)
In progress → Developed All test code written + green locally + discoverable via @pytest.mark.story("<cvn_id>") + datasets versioned (DVC for > 10 MB, content-addressed in-repo for ≤ 10 MB) none (developer-side discipline gate) pytest --collect-only --story <cvn_id> returns ≥ 1 test per acceptance criterion (filter via the --story CLI flag added by S04's pytest plugin — pytest's native -m filter operates on marker names not arguments, see S03 §6.1) guardrail CI G5 (Story workflow guardrails — extends G1-G4)
Developed → In testing Full CI pipeline green (fast + medium tiers + relevant nightly suites) on PR head SHA + test artefacts ship with the PR (test code + cases + datasets + draft manifest) pr_review : committee MUST explicitly verdict that (a) test code covers each acceptance criterion, (b) test cases include adversarial/edge coverage proportional to risk, (c) datasets are versioned + reproducible, (d) draft manifest accurately maps tests → criteria — PASSED requires all 4 sub-verdicts CI job artefacts archived + committee session JSON CI status check + PR template box + committee verdict
In testing → Tested UAT operator-validated (per §6) + test-run report committed none (operator-driven UAT ; committee already verdicted at pr_review) documentation/stories/<cvn_id>/tests/test_run_<sha>.md OP audit comment + commit
Tested → Closed Test manifest committed pinning (test_sha, dataset_sha, run_id) for each acceptance criterion ; immutable thereafter optional closure_review for high-risk Stories (operator decides at Specified whether closure needs a third committee gate ; default = no — the pr_review test-artefact verdict carries through) documentation/stories/<cvn_id>/tests/manifest.yaml OP audit comment + commit + immutability check (S04+ tooling)

11.2 Per-Story test artefacts directory

Every Story owns a folder under documentation/stories/<cvn_id>/tests/ with a fixed structure :

documentation/stories/<cvn_id>/tests/
├── strategy.md                  # the Story's "test strategy" subsection — extracted from plan dossier at Specified
├── test_run_<sha>.md            # one file per CI run that gated a transition (Developed/In testing/Tested)
├── manifest.yaml                # committed at Tested → Closed ; immutable from this point
└── datasets/                    # dataset hashes + DVC pointers (S03 architecture decides exact scheme)

The S04+ implementation Stories instantiate this folder and the matrix's gate enforcement.

11.3 Vocabulary (canonical — referenced by S02 + S03 + ADR-0087 + ADR-0088)

  • Test artefact : any file produced or consumed by a Story that touches tests (test code, test case, test dataset, test run report, test manifest)
  • Test case : a data-driven scenario (input × expected-output pair) versioned under tests/cases/<cvn_id>/
  • Test dataset : a reproducibility-grade data fixture, content-addressed by hash under tests/datasets/ (DVC for > 10 MB, in-repo for ≤ 10 MB)
  • Test run report : the output of a single CI run on a specific (git_sha, dataset_sha) pair, archived under documentation/stories/<cvn_id>/tests/
  • Test manifest : the per-Story YAML pinning the (test_sha, dataset_sha, run_id) triples that proved each acceptance criterion, committed at Tested → Closed and immutable from that point
  • Story-phase gate : an ADR-81 transition that requires specific test artefacts to be produced before the transition is allowed

11.4 Committee scope extension — tests are an explicit verdict item

This is a scope extension to ADR-68 (committee = default review channel) : both plan_review and pr_review MUST issue an explicit verdict on the test artefacts as a first-class review item, not implicitly absorbed into the broader code/dossier verdict. The expert-test role within the committee owns this verdict (when team grows ; until then, the existing 5-expert panel covers it as a mandatory checklist item).

Concretely, the pr_review verdict body MUST contain a tests: section with 4 sub-verdicts :

tests:
  coverage_per_acceptance_criterion: PASSED | INSUFFICIENT
  adversarial_edge_coverage: PASSED | INSUFFICIENT
  datasets_versioned_reproducible: PASSED | INSUFFICIENT
  manifest_maps_tests_to_criteria: PASSED | INSUFFICIENT

Any INSUFFICIENT blocks the merge regardless of the rest of the verdict. This is not a soft signal — it's a hard gate.

The plan_review verdict body MUST contain a tests_strategy: field asserting whether the planned test strategy is sufficient for the Story's risk profile. INSUFFICIENT blocks In specification → Specified.

ADR-0087 (drafted in S03) ratifies this contract.

11.5 Future ADRs that lock this contract

This section is the strategic invariant. The mechanical contracts land in : - ADR-0087Story-phase test integration : tests are first-class artefacts of every ADR-81 transition + explicit committee verdict at every gate (drafted in S03 — wp#118) - ADR-0088Test cases + datasets versioned + provenance-tracked (test-as-code with reproducibility guarantee) (drafted in S03 — wp#118)

Until ADR-0087 + ADR-0088 land, this §11 is the authoritative invariant. The 8-state workflow runbook (process/STORY_WORKFLOW.md) will be extended with the per-gate test-artefact checklist by S03 closure.

12. Open follow-ups

  • ADR-0083 (this strategy's companion) is accepted together with this strategy by the same gate (committee plan_review 53d76f0f PASSED + operator decisions). A future Epic EI Story may amend it once downstream Stories EA-EI surface scope they want to revise (amendment Story per the strategy's invariant — not a re-ratification).
  • pytest -m regression marker convention codified in EA-S02 fixture/factory Story.
  • Performance budget table refined per nightly drift signal, NOT pre-emptively.
  • Per-type ownership table updated when team grows beyond single operator.

12. References

  • OP wp#116 / GH #836 : this Story's tracking
  • F1 plan §6 (sequencing, gates) : the ML-side authority for FTF gates referenced by ML behaviour row
  • ADR-04 (cache policy explicite) : authority for cache test type contract
  • ADR-14 (multi-fold obligatoire) : authority for what ML behaviour validates
  • ADR-23 (features version-pinned, fail-fast) : authority for contract test type
  • ADR-26 (Grafana point d'entrée unique) : post-deploy smoke validates the Grafana panel populated
  • ADR-30 (logs structurés interface stable) : contract includes the log event schema
  • ADR-31 / ADR-32 (logging, structured events) : contract includes log format compliance
  • ADR-58 (FTF guardrails) : authority for what unit covers on FTF factor matrices
  • ADR-59 (PG ftf_config) : contract validates the ftf_config schema
  • ADR-77 (MkDocs SSoT) : this strategy doc + ADR-0083 are the SSoT for test taxonomy
  • ADR-79 (FTF Story closure 8-step) : ML behaviour row references this for FTF gate semantics
  • ADR-81 (8-state Story workflow) : gate hierarchy table aligns with the 8 state transitions
  • ADR-82 (committee → OP Meeting) : strategy_review committee session logged per this contract
  • Issue #592 : architectural pyramid (parent issue ; this strategy is its textual companion)