Skip to content

MLOps readiness — CVN-N014-ED-S01 — global object-storage XCom backend flip

Story: CVN-N014-ED-S01 — GH #1105 · OP wp#242 Owner: @dococeven (DRI for production behaviour of this change) Filled on: 2026-06-05 Reviewed by committee: plan_review session 8d7efca0 (PASSED, 0 blocker) → Meeting #249

Scope note: this Story ships tooling + the standard (ADR-0100) + runbook, all read-only/no-prod-change on the branch. The flip itself (values-prod.yaml, all-DAGs blast radius) is staged separately and gated on the items below + a kill-switch rehearsal.

1. Production monitoring (MUST)

XCom backend health post-flip (committee recommendation), via Grafana → Prometheus/Loki: - Offload rate — count of XCom values > threshold written to S3 (expected ≈ 0 at flip, near-inert). - XCom size distribution — inline vs offloaded, to confirm the threshold keeps control-flow in the DB. - Error rate — task failures keyed on xcom_push/xcom_pull; S3 write/read errors for offloaded values. - Offload latency — S3 write/read time on the XCom path (watch for hot-path regressions). - Baseline: pre-flight scripts/xcom_flip_preflight.py + scripts/xcom_roundtrip_probe.py (control-flow GREEN).

2. Alerting & runbooks (MUST)

  • Alert: any xcom-keyed task-failure spike, or S3-path error on XCom → P2 (escalate to P1 if control-flow regresses).
  • Runbook: runbook_xcom_backend_flip.md — detection, kill-switch = revert + re-trigger, rehearsal-before-cut-over gate.

3. Rollback (MUST)

  • Kill-switch: revert AIRFLOW__CORE__XCOM_BACKENDBaseXCom via Helm (helm rollback or revert values-prod.yaml).
  • Critical nuance: a bare revert is unsafe once any XCom is offloaded (DB holds only the path string) → revert + re-trigger in-flight runs (XComs are per-run, self-heal on re-run). Sub-threshold XComs need no action (plain JSON).
  • Rehearsed before cut-over (mandatory gate, runbook §Rehearsal).

4. Data dependencies & failure modes

  • S3 object store (existing project substrate) + an Airflow S3 connection — proven by the pre-flight (--functional-s3) before flip.
  • Dotted-section config footgun: AIRFLOW__COMMON_IO__... can be silently dropped → pre-flight proves the effective in-pod config (airflow config get-value), not the file.
  • Serialization: backend transports XComEncoder-JSON only; numpy/pandas fail at any size (verified). Deferred to S02 (pass-by-reference default Y). No silent fallback (ADR-25): unreachable S3 fails the task loudly.

5. Residual risk

  • Current parc is JSON-only + sub-threshold (verified) → flip near-inert, residual risk low.
  • Real risk is the first above-threshold payload (post-S02) — covered by the threshold ramp (prospective), the dynamic round-trip on the s43 shape, and the revert+re-trigger kill-switch.