Skip to content

MLOps readiness — CVN-N014-ED-S02 (s43 cross-pod transport)

ADR-70. Scope : the transport plumbing of s43's predictions (diagnostic DAG). NOT a model artefact / training change — s43's science + metrics stay in CVN-N001-EI-S05. This Story moves per-pod prediction arrays from a pod-local /tmp to the shared S3 store (pass-by-reference).

Surface touched

  • dags/dag_diagnostic__s43.py (the diagnostic DAG — orchestration), src/commun/finetune/diagnostic/hamilton/s43_io.py (the I/O layer). No model training, inference, filters, or labels logic — diagnostic transport only.

Monitoring / SLO

  • Signals (Loki, via scripts/loki_query.py) : s43_pred_persist_failed, s43_pred_load_failed, s43_acquire_outcome (keys count), s43_gate_persist_failed, the gate's cohort_count vs expected.
  • Key health metric : cohort_count == expected per run. A drop = transport regression (the bug this Story fixes). Committee plan_review reco : add S3 write/read latency + error-rate + cohort-consistency alerts (follow-up — see plan §recos).
  • Threshold/alert : any s43_pred_*_failed at severity=error, or cohort_count < expected on a run that should be complete → operator attention.

Rollback

Transport is per-run and stateless (no model artefact, no migration). Revert the code change + re-trigger the DAG run → predictions regenerate under the run-isolated S3 prefix. No data loss (prior /tmp path held nothing cross-pod anyway). See runbook.

Data lifecycle

  • S3 prefix s43-predictions/<run_id>/ — per-run .npz blobs, short-lived. Lifecycle rule infra/scaleway/s3-lifecycle-s43-predictions.json (7-day retention). Gateable (DoD §6) : apply + operator proof, or a dated waiver with owner + deadline.

Residual risk

  • The acceptance gate (cross-pod cluster smoke) runs only post-merge — until then, the fix is proven by unit tests (in-memory shared store) + code review, not by a real multi-pod run. The cluster smoke is the Tested evidence.
  • cvntrade_s3_manager security/retry hardening (encryption-at-rest, least-priv IAM, backoff) is a committee plan_review recommendation — tracked, not blocking this transport fix.

DRI

Dev owner : this Story's implementer. Operational owner : the diagnostic-pipeline operator (s43 runs are manual, schedule=None).