MLOps readiness — CVN-N014-ED-S01 — global object-storage XCom backend flip¶
Story: CVN-N014-ED-S01 — GH #1105 · OP wp#242
Owner: @dococeven (DRI for production behaviour of this change)
Filled on: 2026-06-05
Reviewed by committee: plan_review session 8d7efca0 (PASSED, 0 blocker) → Meeting #249
Scope note: this Story ships tooling + the standard (ADR-0100) + runbook, all read-only/no-prod-change on the branch. The flip itself (
values-prod.yaml, all-DAGs blast radius) is staged separately and gated on the items below + a kill-switch rehearsal.
1. Production monitoring (MUST)¶
XCom backend health post-flip (committee recommendation), via Grafana → Prometheus/Loki:
- Offload rate — count of XCom values > threshold written to S3 (expected ≈ 0 at flip, near-inert).
- XCom size distribution — inline vs offloaded, to confirm the threshold keeps control-flow in the DB.
- Error rate — task failures keyed on xcom_push/xcom_pull; S3 write/read errors for offloaded values.
- Offload latency — S3 write/read time on the XCom path (watch for hot-path regressions).
- Baseline: pre-flight scripts/xcom_flip_preflight.py + scripts/xcom_roundtrip_probe.py (control-flow GREEN).
2. Alerting & runbooks (MUST)¶
- Alert: any
xcom-keyed task-failure spike, or S3-path error on XCom → P2 (escalate to P1 if control-flow regresses). - Runbook:
runbook_xcom_backend_flip.md— detection, kill-switch = revert + re-trigger, rehearsal-before-cut-over gate.
3. Rollback (MUST)¶
- Kill-switch: revert
AIRFLOW__CORE__XCOM_BACKEND→BaseXComvia Helm (helm rollbackor revertvalues-prod.yaml). - Critical nuance: a bare revert is unsafe once any XCom is offloaded (DB holds only the path string) → revert + re-trigger in-flight runs (XComs are per-run, self-heal on re-run). Sub-threshold XComs need no action (plain JSON).
- Rehearsed before cut-over (mandatory gate, runbook §Rehearsal).
4. Data dependencies & failure modes¶
- S3 object store (existing project substrate) + an Airflow S3 connection — proven by the pre-flight (
--functional-s3) before flip. - Dotted-section config footgun:
AIRFLOW__COMMON_IO__...can be silently dropped → pre-flight proves the effective in-pod config (airflow config get-value), not the file. - Serialization: backend transports
XComEncoder-JSON only; numpy/pandas fail at any size (verified). Deferred to S02 (pass-by-reference default Y). No silent fallback (ADR-25): unreachable S3 fails the task loudly.
5. Residual risk¶
- Current parc is JSON-only + sub-threshold (verified) → flip near-inert, residual risk low.
- Real risk is the first above-threshold payload (post-S02) — covered by the threshold ramp (prospective), the dynamic round-trip on the s43 shape, and the revert+re-trigger kill-switch.