Runbook — s43 cross-pod prediction transport (S3)¶
Story : CVN-N014-ED-S02 · Architecture : design · ADR : 0100. Per ADR-26, start at Grafana ; this runbook hands you the exact commands.
What it is¶
s43's acquisition pods persist per-(crypto, fold, family) prediction .npz blobs to S3
(s3://cvntrade-artifacts/s43-predictions/<run_id>/…) ; the gate pod loads the expected-cohort
keys and pools them. XCom carries only the JSON key. Query logs via scripts/loki_query.py
(non-lossy stdout stream), not a hand-rolled port-forward.
Failure scenarios¶
A. INCONCLUSIVE_TOOLING / cohort incomplete¶
python scripts/loki_query.py --since 1h \
--query '{namespace="cvntrade"} |~ "(?i)s43_pred_(persist|load)_failed|cohort|INCONCLUSIVE_TOOLING"'
s43_acquire_outcome / s43_pred_persist_failed. An absent blob is tolerated (the crypto is
simply missing) — the verdict degrades to INCONCLUSIVE_TOOLING, never a partial pool. Re-trigger
the failed acq tasks (clear + rerun) ; the gate re-reads the cohort.
- All cryptos absent = a transport/config problem (NOT 5 independent failures). Check S3
connectivity + the prefix (scenario C).
B. ValueError: s43 npz schema mismatch / missing array / shape mismatch (fail-loud)¶
A present blob is malformed → the gate raises (by design, ADR-25 — corruption must surface).
Do NOT treat it as a missing pod. Likely causes: a SCHEMA_VERSION bump without a matching
producer, or a partial/truncated write. Inspect the blob, fix the producer, clear + rerun.
C. S3 connectivity / wrong prefix¶
POD=$(kubectl -n cvntrade get pods -l component=scheduler --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "
from commun.s3.cvntrade_s3_manager import CVNTrade_S3Manager
s=CVNTrade_S3Manager(); print('objects:', len(s.list_files('s43-predictions/')))"
aws_default
/ creds / bucket). Verify the S3 conn (cf. runbook_xcom_backend_flip.md pre-flight). The s43
prefix is run-isolated : s43-predictions/<run_id>/… — the gate uses the SAME run's prefix as
the acq pods (_s43_prefix), so a prefix mismatch would only arise from a code change.
D. S3 footprint growth¶
Lifecycle rule infra/scaleway/s3-lifecycle-s43-predictions.json (7-day retention on
s43-predictions/) bounds it. ⚠️ Scaleway has one lifecycle config per bucket → merge with the
existing xcom/ + l2_events/ rules before put-bucket-lifecycle-configuration. Verify :
aws --endpoint-url https://s3.fr-par.scw.cloud s3api get-bucket-lifecycle-configuration --bucket cvntrade-artifacts
Rollback¶
The transport is per-run and stateless. No live rollback needed beyond reverting the code change
(the prior /tmp path was already broken cross-pod). Re-trigger the DAG run after any fix —
predictions regenerate under the run-isolated prefix.