Skip to content

Runbook — XCom object-storage backend flip (kill-switch + rollback)

Story: CVN-N014-ED-S01 · Epic: CVN-N014-ED · ADR: 0100 Scope: the global flip of AIRFLOW__CORE__XCOM_BACKEND from BaseXCom to XComObjectStorageBackend, and how to back it out safely.

Per ADR-26, start at Grafana. This runbook hands you the exact commands; it never asks you to open three tools first.

TL;DR

  • Kill-switch = revert the Helm config and re-trigger any in-flight offloaded runs. A bare revert is not safe once any XCom has been offloaded (> threshold): the metadata DB then holds only the object-store path string, and BaseXCom hands that string to the downstream task verbatim instead of the data → downstream breaks.
  • Status: the flip is LIVE in prod (cut over 2026-06-06, threshold = 1_600_000 ≈ p99.9 of the live distribution). Near-inert: 99.9% of XCom stays inline (byte-identical to BaseXCom); only the few values >1.6 MB offload. Kill-switch is load-bearing for any offloaded run.
  • Dotted-section encoding gotcha (the RCA of the 1st cut-over — see cutover_decision.md §RCA): the common.io section MUST be encoded AIRFLOW__COMMON_IO__… (underscore — Airflow does section.replace('.', '_')), NOT AIRFLOW__COMMON.IO__… (literal dot). A dotted var is present in the env but invisible to conf.get → backend init crashes. Verify resolution (conf.get), never presence (env | grep).

Pre-flight (before the flip)

Run the hardened pre-flight in the real worker/scheduler context — it proves the provider, the S3 connection, and that the dotted-section common.io config actually TOOK (env-var-export ≠ Airflow-read):

python scripts/xcom_flip_preflight.py --conn-id <s3_conn> \
  --expect-backend airflow.providers.common.io.xcom.backend.XComObjectStorageBackend \
  --expect-path 's3://<s3_conn>@<bucket>/xcom' --expect-threshold <bytes> --functional-s3

All checks must be OK (exit 0). A FAIL on config_threshold/config_path = the env var did not resolve → do not trust the flip.

Post-flip verification (in-vivo system test)

After the cut-over deploy succeeds, two layers must be checked: (A) the scheduler resolves the config, then (B) a real DAG run exercises XCom through a step-pod (the production path — the scheduler test alone does NOT cover the KPO step-pod, which gets the config via extraEnvFrom).

A. Scheduler resolution (hardened gate — resolution, not presence)

POD=$(kubectl -n cvntrade get pods -l component=scheduler --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "from airflow.configuration import conf; \
  print(conf.get('common.io','xcom_objectstorage_path')); \
  print(conf.getint('common.io','xcom_objectstorage_threshold')); \
  from airflow.models.xcom import XCom; print(XCom.__name__)"
# expect: the s3 path, the threshold int, and 'XComObjectStorageBackend'.
# If conf.get raises AirflowConfigException or XCom won't import -> KILL-SWITCH (do NOT call it a 'CLI quirk').

B. Real DAG run (the step-pod path)

  1. Pick a DAG that pushes XCom. diagnostic__s42 is the reference: its discriminate_cell.return_value dicts span the threshold (p99.9 ≈ 1.5 MB) → tests both inline (near-inert) and S3 offload. Any short XCom-pushing DAG works for a lighter smoke.
  2. Trigger via the Airflow UI (manual — no direct python, ADR-18): DAGs → diagnostic__s42Trigger DAG w/ config → set params (crypto_group=defi_top5, …) → Trigger. Note the run_id.
  3. Watch: Airflow grid — no failed task (especially XCom-consuming ones). In parallel, query Loki via the project tooling (scripts/loki_query.py / the loki-query skill — it manages the port-forward itself and uses the non-lossy stdout stream; do NOT hand-roll a port-forward + curl):
    python scripts/loki_query.py --since 15m \
      --query '{namespace="cvntrade"} |~ "(?i)xcom|AirflowConfigException|XComObjectStorage"'
    # expect: nothing (no error/exception/traceback line).
    
  4. Read the trajectory back through the new backend (proves read path):
    python scripts/airflow_xcom_pull.py --dag-id diagnostic__s42 --run-id '<run_id>' --limit 5
    # expect: the return_value dicts read correctly (inline OR resolved from S3).
    
  5. Confirm offload (if any cell > 1.6 MB):
    kubectl -n cvntrade exec "$POD" -c scheduler -- python -c "
    from airflow.io.path import ObjectStoragePath
    base=ObjectStoragePath('s3://aws_default@cvntrade-artifacts/xcom')
    objs=[str(p) for p in base.rglob('*') if 'diagnostic__s42' in str(p)]
    print('offloaded objects for this DAG:', len(objs)); [print(' ',o) for o in objs[:5]]"
    # 0 = everything stayed inline (near-inert confirmed); >0 = prod offload confirmed.
    

Pass / fail

Observation Verdict
Run success, 0 XCom-failed task, trajectory readable, 0 Loki error ✅ flip validated in-vivo (step-pod path)
Any AirflowConfigException / XCom-failed task / unreadable XCom 🔴 KILL-SWITCH (revert + re-trigger, below)

Detect — is a back-out needed?

Signal Where
XCom write/read errors, tasks failing on xcom_pull Grafana → Airflow task-failure panel; Loki {namespace="cvntrade"} \|~ "xcom"
Downstream task receives a s3://… string instead of data task logs (the offloaded-then-reverted symptom)
Object-store path effective vs expected python scripts/xcom_flip_preflight.py --expect-path … --expect-threshold …
Round-trip regression on control-flow python scripts/xcom_roundtrip_probe.py (control-flow must stay GREEN)

Kill-switch — revert + re-trigger

  1. Revert the backend in Helm (infra/helm/airflow/values-prod.yaml) → AIRFLOW__CORE__XCOM_BACKEND=airflow.models.xcom.BaseXCom (or remove the 4 common.io env vars), redeploy:
    # via the normal deploy path (Helm = SSoT, #378) — push values-prod.yaml, CI runs helm upgrade
    # OR emergency: helm rollback <release> <previous-revision> -n cvntrade
    
  2. Confirm the revert took (in-pod, not the file):
    kubectl -n cvntrade exec <scheduler-pod> -c scheduler -- airflow config get-value core xcom_backend
    # expect: airflow.models.xcom.BaseXCom
    
  3. Re-trigger in-flight runs that wrote offloaded XComs. XComs are per-run, so a re-run regenerates them under BaseXCom (self-heals) — there is no mid-run continuity to preserve:
  4. Identify runs in-flight during the flip window (Airflow UI / airflow dags list-runs).
  5. Clear + re-run the affected DAG-runs (airflow tasks clear / UI "Clear") so upstream tasks re-emit XComs under BaseXCom.
  6. Sub-threshold XComs need no action — they were stored as plain JSON, readable by BaseXCom directly.
  7. Verify: control-flow round-trip green (xcom_roundtrip_probe.py), no xcom-keyed task failures in Grafana/Loki.

Rehearsal (mandatory gate before the prod cut-over)

The revert+re-trigger sequence MUST be rehearsed before the production cut-over (committee recommendation, plan_review 8d7efca0). Rehearsal = on a non-prod context (local docker-compose or a throwaway run): flip → write an above-threshold XCom → revert backend → observe the downstream-gets-path-string break → apply re-trigger → observe self-heal. Document the observed downtime. The cut-over is gated on a successful rehearsal.

Notes

  • Orphaned objects: a revert leaves the offloaded objects in S3 (now unreferenced). The S3 lifecycle rule on xcom/ (ADR-0100) expires them; an orphan-sweeper, if needed, is tracked in CVN-N014-ED-S03.
  • Fail-loud: if the object store is unreachable at runtime, an above-threshold XCom write fails and the task fails (ADR-25, no silent fallback) — this is intended; investigate S3/connection, do not add a fallback.