Runbook — XCom object-storage backend flip (kill-switch + rollback)¶

Story: CVN-N014-ED-S01 · Epic: CVN-N014-ED · ADR: 0100 Scope: the global flip of AIRFLOW__CORE__XCOM_BACKEND from BaseXCom to XComObjectStorageBackend, and how to back it out safely.

Per ADR-26, start at Grafana. This runbook hands you the exact commands; it never asks you to open three tools first.

TL;DR¶

Kill-switch = revert the Helm config and re-trigger any in-flight offloaded runs. A bare revert is not safe once any XCom has been offloaded (> threshold): the metadata DB then holds only the object-store path string, and BaseXCom hands that string to the downstream task verbatim instead of the data → downstream breaks.
Status: the flip is LIVE in prod (cut over 2026-06-06, threshold = 1_600_000 ≈ p99.9 of the live distribution). Near-inert: 99.9% of XCom stays inline (byte-identical to BaseXCom); only the few values >1.6 MB offload. Kill-switch is load-bearing for any offloaded run.
Dotted-section encoding gotcha (the RCA of the 1st cut-over — see cutover_decision.md §RCA): the common.io section MUST be encoded AIRFLOW__COMMON_IO__… (underscore — Airflow does section.replace('.', '_')), NOT AIRFLOW__COMMON.IO__… (literal dot). A dotted var is present in the env but invisible to conf.get → backend init crashes. Verify resolution (conf.get), never presence (env | grep).

Pre-flight (before the flip)¶

Run the hardened pre-flight in the real worker/scheduler context — it proves the provider, the S3 connection, and that the dotted-section common.io config actually TOOK (env-var-export ≠ Airflow-read):

python scripts/xcom_flip_preflight.py --conn-id <s3_conn> \
  --expect-backend airflow.providers.common.io.xcom.backend.XComObjectStorageBackend \
  --expect-path 's3://<s3_conn>@<bucket>/xcom' --expect-threshold <bytes> --functional-s3

All checks must be OK (exit 0). A FAIL on config_threshold/config_path = the env var did not resolve → do not trust the flip.

Post-flip verification (in-vivo system test)¶

After the cut-over deploy succeeds, two layers must be checked: (A) the scheduler resolves the config, then (B) a real DAG run exercises XCom through a step-pod (the production path — the scheduler test alone does NOT cover the KPO step-pod, which gets the config via extraEnvFrom).

A. Scheduler resolution (hardened gate — resolution, not presence)¶

POD=$(kubectl -n cvntrade get pods -l component=scheduler --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "from airflow.configuration import conf; \
  print(conf.get('common.io','xcom_objectstorage_path')); \
  print(conf.getint('common.io','xcom_objectstorage_threshold')); \
  from airflow.models.xcom import XCom; print(XCom.__name__)"
# expect: the s3 path, the threshold int, and 'XComObjectStorageBackend'.
# If conf.get raises AirflowConfigException or XCom won't import -> KILL-SWITCH (do NOT call it a 'CLI quirk').

B. Real DAG run (the step-pod path)¶

Pick a DAG that pushes XCom. diagnostic__s42 is the reference: its discriminate_cell.return_value dicts span the threshold (p99.9 ≈ 1.5 MB) → tests both inline (near-inert) and S3 offload. Any short XCom-pushing DAG works for a lighter smoke.
Trigger via the Airflow UI (manual — no direct python, ADR-18): DAGs → diagnostic__s42 → Trigger DAG w/ config → set params (crypto_group=defi_top5, …) → Trigger. Note the run_id.
Watch: Airflow grid — no failed task (especially XCom-consuming ones). In parallel, query Loki via the project tooling (scripts/loki_query.py / the loki-query skill — it manages the port-forward itself and uses the non-lossy stdout stream; do NOT hand-roll a port-forward + curl):
```
python scripts/loki_query.py --since 15m \
  --query '{namespace="cvntrade"} |~ "(?i)xcom|AirflowConfigException|XComObjectStorage"'
# expect: nothing (no error/exception/traceback line).
```

Read the trajectory back through the new backend (proves read path):

python scripts/airflow_xcom_pull.py --dag-id diagnostic__s42 --run-id '<run_id>' --limit 5
# expect: the return_value dicts read correctly (inline OR resolved from S3).

Confirm offload (if any cell > 1.6 MB):

kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "
from airflow.io.path import ObjectStoragePath
base=ObjectStoragePath('s3://aws_default@cvntrade-artifacts/xcom')
objs=[str(p) for p in base.rglob('*') if 'diagnostic__s42' in str(p)]
print('offloaded objects for this DAG:', len(objs)); [print(' ',o) for o in objs[:5]]"
# 0 = everything stayed inline (near-inert confirmed); >0 = prod offload confirmed.

Pass / fail¶

Observation	Verdict
Run `success`, 0 XCom-failed task, trajectory readable, 0 Loki error	✅ flip validated in-vivo (step-pod path)
Any `AirflowConfigException` / XCom-failed task / unreadable XCom	🔴 KILL-SWITCH (revert + re-trigger, below)

Detect — is a back-out needed?¶

Signal	Where
XCom write/read errors, tasks failing on `xcom_pull`	Grafana → Airflow task-failure panel; Loki `{namespace="cvntrade"} \\|~ "xcom"`
Downstream task receives a `s3://…` string instead of data	task logs (the offloaded-then-reverted symptom)
Object-store path effective vs expected	`python scripts/xcom_flip_preflight.py --expect-path … --expect-threshold …`
Round-trip regression on control-flow	`python scripts/xcom_roundtrip_probe.py` (control-flow must stay GREEN)

Kill-switch — revert + re-trigger¶

Revert the backend in Helm (infra/helm/airflow/values-prod.yaml) → AIRFLOW__CORE__XCOM_BACKEND=airflow.models.xcom.BaseXCom (or remove the 4 common.io env vars), redeploy:

# via the normal deploy path (Helm = SSoT, #378) — push values-prod.yaml, CI runs helm upgrade
# OR emergency: helm rollback <release> <previous-revision> -n cvntrade

Confirm the revert took (in-pod, not the file):

kubectl -n cvntrade exec <scheduler-pod> -c scheduler -- airflow config get-value core xcom_backend
# expect: airflow.models.xcom.BaseXCom

Re-trigger in-flight runs that wrote offloaded XComs. XComs are per-run, so a re-run regenerates them under BaseXCom (self-heals) — there is no mid-run continuity to preserve:
Identify runs in-flight during the flip window (Airflow UI / airflow dags list-runs).
Clear + re-run the affected DAG-runs (airflow tasks clear / UI "Clear") so upstream tasks re-emit XComs under BaseXCom.
Sub-threshold XComs need no action — they were stored as plain JSON, readable by BaseXCom directly.
Verify: control-flow round-trip green (xcom_roundtrip_probe.py), no xcom-keyed task failures in Grafana/Loki.

Rehearsal (mandatory gate before the prod cut-over)¶

The revert+re-trigger sequence MUST be rehearsed before the production cut-over (committee recommendation, plan_review 8d7efca0). Rehearsal = on a non-prod context (local docker-compose or a throwaway run): flip → write an above-threshold XCom → revert backend → observe the downstream-gets-path-string break → apply re-trigger → observe self-heal. Document the observed downtime. The cut-over is gated on a successful rehearsal.

Notes¶

Orphaned objects: a revert leaves the offloaded objects in S3 (now unreferenced). The S3 lifecycle rule on xcom/ (ADR-0100) expires them; an orphan-sweeper, if needed, is tracked in CVN-N014-ED-S03.
Fail-loud: if the object store is unreachable at runtime, an above-threshold XCom write fails and the task fails (ADR-25, no silent fallback) — this is intended; investigate S3/connection, do not add a fallback.