MLOps readiness — CVN-N014-ED-S02 (s43 cross-pod transport)¶
ADR-70. Scope : the transport plumbing of s43's predictions (diagnostic DAG). NOT a model artefact / training change — s43's science + metrics stay in CVN-N001-EI-S05. This Story moves per-pod prediction arrays from a pod-local
/tmpto the shared S3 store (pass-by-reference).
Surface touched¶
dags/dag_diagnostic__s43.py(the diagnostic DAG — orchestration),src/commun/finetune/diagnostic/hamilton/s43_io.py(the I/O layer). No model training, inference, filters, or labels logic — diagnostic transport only.
Monitoring / SLO¶
- Signals (Loki, via
scripts/loki_query.py) :s43_pred_persist_failed,s43_pred_load_failed,s43_acquire_outcome(keys count),s43_gate_persist_failed, the gate'scohort_countvs expected. - Key health metric :
cohort_count == expectedper run. A drop = transport regression (the bug this Story fixes). Committeeplan_reviewreco : add S3 write/read latency + error-rate + cohort-consistency alerts (follow-up — see plan §recos). - Threshold/alert : any
s43_pred_*_failedatseverity=error, orcohort_count < expectedon a run that should be complete → operator attention.
Rollback¶
Transport is per-run and stateless (no model artefact, no migration). Revert the code change +
re-trigger the DAG run → predictions regenerate under the run-isolated S3 prefix. No data loss
(prior /tmp path held nothing cross-pod anyway). See runbook.
Data lifecycle¶
- S3 prefix
s43-predictions/<run_id>/— per-run.npzblobs, short-lived. Lifecycle ruleinfra/scaleway/s3-lifecycle-s43-predictions.json(7-day retention). Gateable (DoD §6) : apply + operator proof, or a dated waiver with owner + deadline.
Residual risk¶
- The acceptance gate (cross-pod cluster smoke) runs only post-merge — until then, the fix is
proven by unit tests (in-memory shared store) + code review, not by a real multi-pod run. The
cluster smoke is the
Testedevidence. cvntrade_s3_managersecurity/retry hardening (encryption-at-rest, least-priv IAM, backoff) is a committeeplan_reviewrecommendation — tracked, not blocking this transport fix.
DRI¶
Dev owner : this Story's implementer. Operational owner : the diagnostic-pipeline operator (s43
runs are manual, schedule=None).