Runbook — XCom object-storage backend flip (kill-switch + rollback)¶
Story: CVN-N014-ED-S01 · Epic: CVN-N014-ED · ADR: 0100
Scope: the global flip of AIRFLOW__CORE__XCOM_BACKEND from BaseXCom to XComObjectStorageBackend, and how to back it out safely.
Per ADR-26, start at Grafana. This runbook hands you the exact commands; it never asks you to open three tools first.
TL;DR¶
- Kill-switch = revert the Helm config and re-trigger any in-flight offloaded runs. A bare revert is not safe once any XCom has been offloaded (
> threshold): the metadata DB then holds only the object-store path string, andBaseXComhands that string to the downstream task verbatim instead of the data → downstream breaks. - Status: the flip is LIVE in prod (cut over 2026-06-06,
threshold = 1_600_000≈ p99.9 of the live distribution). Near-inert: 99.9% of XCom stays inline (byte-identical to BaseXCom); only the few values >1.6 MB offload. Kill-switch is load-bearing for any offloaded run. - Dotted-section encoding gotcha (the RCA of the 1st cut-over — see cutover_decision.md §RCA): the
common.iosection MUST be encodedAIRFLOW__COMMON_IO__…(underscore — Airflow doessection.replace('.', '_')), NOTAIRFLOW__COMMON.IO__…(literal dot). A dotted var is present in the env but invisible toconf.get→ backend init crashes. Verify resolution (conf.get), never presence (env | grep).
Pre-flight (before the flip)¶
Run the hardened pre-flight in the real worker/scheduler context — it proves the provider, the S3 connection, and that the dotted-section common.io config actually TOOK (env-var-export ≠ Airflow-read):
python scripts/xcom_flip_preflight.py --conn-id <s3_conn> \
--expect-backend airflow.providers.common.io.xcom.backend.XComObjectStorageBackend \
--expect-path 's3://<s3_conn>@<bucket>/xcom' --expect-threshold <bytes> --functional-s3
All checks must be OK (exit 0). A FAIL on config_threshold/config_path = the env var did not resolve → do not trust the flip.
Post-flip verification (in-vivo system test)¶
After the cut-over deploy succeeds, two layers must be checked: (A) the scheduler resolves the config, then (B) a real DAG run exercises XCom through a step-pod (the production path — the scheduler test alone does NOT cover the KPO step-pod, which gets the config via extraEnvFrom).
A. Scheduler resolution (hardened gate — resolution, not presence)¶
POD=$(kubectl -n cvntrade get pods -l component=scheduler --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "from airflow.configuration import conf; \
print(conf.get('common.io','xcom_objectstorage_path')); \
print(conf.getint('common.io','xcom_objectstorage_threshold')); \
from airflow.models.xcom import XCom; print(XCom.__name__)"
# expect: the s3 path, the threshold int, and 'XComObjectStorageBackend'.
# If conf.get raises AirflowConfigException or XCom won't import -> KILL-SWITCH (do NOT call it a 'CLI quirk').
B. Real DAG run (the step-pod path)¶
- Pick a DAG that pushes XCom.
diagnostic__s42is the reference: itsdiscriminate_cell.return_valuedicts span the threshold (p99.9 ≈ 1.5 MB) → tests both inline (near-inert) and S3 offload. Any short XCom-pushing DAG works for a lighter smoke. - Trigger via the Airflow UI (manual — no direct
python, ADR-18): DAGs →diagnostic__s42→ Trigger DAG w/ config → set params (crypto_group=defi_top5, …) → Trigger. Note therun_id. - Watch: Airflow grid — no
failedtask (especially XCom-consuming ones). In parallel, query Loki via the project tooling (scripts/loki_query.py/ theloki-queryskill — it manages the port-forward itself and uses the non-lossy stdout stream; do NOT hand-roll aport-forward+curl): - Read the trajectory back through the new backend (proves read path):
- Confirm offload (if any cell > 1.6 MB):
kubectl -n cvntrade exec "$POD" -c scheduler -- python -c " from airflow.io.path import ObjectStoragePath base=ObjectStoragePath('s3://aws_default@cvntrade-artifacts/xcom') objs=[str(p) for p in base.rglob('*') if 'diagnostic__s42' in str(p)] print('offloaded objects for this DAG:', len(objs)); [print(' ',o) for o in objs[:5]]" # 0 = everything stayed inline (near-inert confirmed); >0 = prod offload confirmed.
Pass / fail¶
| Observation | Verdict |
|---|---|
Run success, 0 XCom-failed task, trajectory readable, 0 Loki error |
✅ flip validated in-vivo (step-pod path) |
Any AirflowConfigException / XCom-failed task / unreadable XCom |
🔴 KILL-SWITCH (revert + re-trigger, below) |
Detect — is a back-out needed?¶
| Signal | Where |
|---|---|
XCom write/read errors, tasks failing on xcom_pull |
Grafana → Airflow task-failure panel; Loki {namespace="cvntrade"} \|~ "xcom" |
Downstream task receives a s3://… string instead of data |
task logs (the offloaded-then-reverted symptom) |
| Object-store path effective vs expected | python scripts/xcom_flip_preflight.py --expect-path … --expect-threshold … |
| Round-trip regression on control-flow | python scripts/xcom_roundtrip_probe.py (control-flow must stay GREEN) |
Kill-switch — revert + re-trigger¶
- Revert the backend in Helm (
infra/helm/airflow/values-prod.yaml) →AIRFLOW__CORE__XCOM_BACKEND=airflow.models.xcom.BaseXCom(or remove the 4common.ioenv vars), redeploy: - Confirm the revert took (in-pod, not the file):
- Re-trigger in-flight runs that wrote offloaded XComs. XComs are per-run, so a re-run regenerates them under
BaseXCom(self-heals) — there is no mid-run continuity to preserve: - Identify runs in-flight during the flip window (Airflow UI /
airflow dags list-runs). - Clear + re-run the affected DAG-runs (
airflow tasks clear/ UI "Clear") so upstream tasks re-emit XComs underBaseXCom. - Sub-threshold XComs need no action — they were stored as plain JSON, readable by
BaseXComdirectly. - Verify: control-flow round-trip green (
xcom_roundtrip_probe.py), noxcom-keyed task failures in Grafana/Loki.
Rehearsal (mandatory gate before the prod cut-over)¶
The revert+re-trigger sequence MUST be rehearsed before the production cut-over (committee recommendation, plan_review 8d7efca0). Rehearsal = on a non-prod context (local docker-compose or a throwaway run): flip → write an above-threshold XCom → revert backend → observe the downstream-gets-path-string break → apply re-trigger → observe self-heal. Document the observed downtime. The cut-over is gated on a successful rehearsal.
Notes¶
- Orphaned objects: a revert leaves the offloaded objects in S3 (now unreferenced). The S3 lifecycle rule on
xcom/(ADR-0100) expires them; an orphan-sweeper, if needed, is tracked in CVN-N014-ED-S03. - Fail-loud: if the object store is unreachable at runtime, an above-threshold XCom write fails and the task fails (ADR-25, no silent fallback) — this is intended; investigate S3/connection, do not add a fallback.