Skip to content

Runbook — s43 cross-pod prediction transport (S3)

Story : CVN-N014-ED-S02 · Architecture : design · ADR : 0100. Per ADR-26, start at Grafana ; this runbook hands you the exact commands.

What it is

s43's acquisition pods persist per-(crypto, fold, family) prediction .npz blobs to S3 (s3://cvntrade-artifacts/s43-predictions/<run_id>/…) ; the gate pod loads the expected-cohort keys and pools them. XCom carries only the JSON key. Query logs via scripts/loki_query.py (non-lossy stdout stream), not a hand-rolled port-forward.

Failure scenarios

A. INCONCLUSIVE_TOOLING / cohort incomplete

python scripts/loki_query.py --since 1h \
  --query '{namespace="cvntrade"} |~ "(?i)s43_pred_(persist|load)_failed|cohort|INCONCLUSIVE_TOOLING"'
- Some cryptos absent = those acq pods failed or hit K=0 (no draws survived). Check each pod's s43_acquire_outcome / s43_pred_persist_failed. An absent blob is tolerated (the crypto is simply missing) — the verdict degrades to INCONCLUSIVE_TOOLING, never a partial pool. Re-trigger the failed acq tasks (clear + rerun) ; the gate re-reads the cohort. - All cryptos absent = a transport/config problem (NOT 5 independent failures). Check S3 connectivity + the prefix (scenario C).

B. ValueError: s43 npz schema mismatch / missing array / shape mismatch (fail-loud)

A present blob is malformed → the gate raises (by design, ADR-25 — corruption must surface). Do NOT treat it as a missing pod. Likely causes: a SCHEMA_VERSION bump without a matching producer, or a partial/truncated write. Inspect the blob, fix the producer, clear + rerun.

C. S3 connectivity / wrong prefix

POD=$(kubectl -n cvntrade get pods -l component=scheduler --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "
from commun.s3.cvntrade_s3_manager import CVNTrade_S3Manager
s=CVNTrade_S3Manager(); print('objects:', len(s.list_files('s43-predictions/')))"
- 0 objects after a run that should have written = the acq pods didn't reach S3 (conn aws_default / creds / bucket). Verify the S3 conn (cf. runbook_xcom_backend_flip.md pre-flight). The s43 prefix is run-isolated : s43-predictions/<run_id>/… — the gate uses the SAME run's prefix as the acq pods (_s43_prefix), so a prefix mismatch would only arise from a code change.

D. S3 footprint growth

Lifecycle rule infra/scaleway/s3-lifecycle-s43-predictions.json (7-day retention on s43-predictions/) bounds it. ⚠️ Scaleway has one lifecycle config per bucket → merge with the existing xcom/ + l2_events/ rules before put-bucket-lifecycle-configuration. Verify :

aws --endpoint-url https://s3.fr-par.scw.cloud s3api get-bucket-lifecycle-configuration --bucket cvntrade-artifacts

Rollback

The transport is per-run and stateless. No live rollback needed beyond reverting the code change (the prior /tmp path was already broken cross-pod). Re-trigger the DAG run after any fix — predictions regenerate under the run-isolated prefix.