Skip to content

ADR-0100 — Inter-task data transport: object-storage XCom backend, threshold-driven, no bespoke per-task storage

Status: proposed (CVN-N014-ED-S01 ; plan_review 8d7efca0 PASSED, consensus strong, 0 blocker)

Context-of-record: the CVN-N001-EI-S05 / s43 incident — cross-pod prediction transport via np.savez to a pod-local /tmp (not shared) → empty cohort → INCONCLUSIVE_TOOLING. Root cause: no project standard for inter-task transport — each DAG improvises its intermediate storage, with a configurable target that can be wrong and that unit tests (tmp_path) never see. See Epic CVN-N014-ED.

Context

Airflow XCom is persisted in the metadata DB: designed for small messages (return values, flags), bounded in size (~48 KB on Postgres → silent truncation beyond) and in serializable types. Large arrays/DataFrames in XCom are a documented anti-pattern. The sound practice is pass-by-reference: small messages → XCom; large data → external store, push the reference (URI), read it downstream.

The XComObjectStorageBackend (apache-airflow-providers-common-io) makes this a centralized policy rather than per-task code: a task does return value / xcom_pull, and the backend offloads values above a size threshold to object storage (S3 — already the project substrate), keeping only the URI in the DB. The "wrong-target" bug class becomes structurally impossible for the payloads the backend can serialize.

Verified mechanics (in-pod, common-io 1.4.2 — CVN-N014-ED-S01 v4) — these bound the standard and are load-bearing:

  1. serialize_value runs json.dumps(value, cls=XComEncoder) unconditionally, before the threshold check. XComEncoder does not serialize numpy (TypeError on a bare/unicode ndarray and on a nested dict-of-arrays). So the backend transports JSON-serializable values, not arbitrary large objects: numpy/pandas fail at any size.
  2. Below threshold the backend returns the same XComEncoder JSON bytes BaseXCom would store → behavior is byte-identical (a flip on a JSON-only, sub-threshold parc is near-inert).
  3. Above threshold the DB stores the object-store path string; the value lives in S3. A revert to BaseXCom then hands downstream the path string, not the data → rollback is revert + re-trigger, not a bare revert.

enable_xcom_pickling=False + donot_pickle=True are effective in-pod (env does not override), so the current XCom universe is JSON-only and enumerable.

Decision

Adopt the object-storage XCom backend as the project standard for inter-task data transport, threshold-driven, with no bespoke per-task storage — under the following disciplines:

  1. Global flip, threshold-gated incrementality. AIRFLOW__CORE__XCOM_BACKEND is instance-global (all DAGs at once). The migration is not per-DAG opt-in; the threshold provides incrementality — set above the control-flow p99.9 so existing small XComs stay inline in the DB, and only large payloads offload. Rejected: a custom per-dag_id allowlist backend (tests the wiring, not the shipped backend).
  2. Serialization is JSON-by-XComEncoder, not "arbitrary object". A value transported by XCom MUST be XComEncoder-serializable. numpy/pandas require a registered serializer (airflow.serialization serde) or explicit pass-by-reference (XCom = pointer, data in store). Pass-by-reference is the preferred default for numpy/large ML objects (CVN-N014-ED-S02).
  3. No bespoke per-task intermediate storage. A task does return value / xcom_pull; it MUST NOT hand-roll its own S3/FS write for inter-task transport. (Reviewer gate, CVN-N014-ED-S04.)
  4. Object storage, not a shared RWX filesystem. S3 is already the project substrate; a shared mutable RWX FS (concurrency, SPOF, manual GC) is rejected as against the stateless-pods architecture, unless a real POSIX need is opened separately.
  5. Flip is gated on proof, not hope. A hardened pre-flight (scripts/xcom_flip_preflight.py) MUST prove in the real worker/scheduler context: provider present, S3 connection usable, and that the dotted-section common.io config actually took (airflow config get-value, not the file — guards the silent env-var drop). A serialization audit (scripts/xcom_serialization_audit.py static + scripts/xcom_roundtrip_probe.py dynamic in-pod) MUST show control-flow round-trips green before flip.
  6. Rollback = revert + re-trigger, rehearsed. The kill-switch reverts XCOM_BACKEND to BaseXCom and re-triggers in-flight runs that wrote offloaded XComs (per-run, self-heals on re-run). The sequence MUST be rehearsed before the prod cut-over. Runbook: runbook_xcom_backend_flip.md.
  7. Bounded object-store growth. An S3 lifecycle rule expires the xcom/ prefix; the backend purges objects on XCom clear. Orphan-sweeping (if needed) is CVN-N014-ED-S03.

Invariants

  • Inv 1 — JSON-serializable or pass-by-reference. Any XCom value is XComEncoder-serializable; numpy/pandas go by registered serializer or explicit reference. Never assume the backend "natively" carries arrays.
  • Inv 2 — no bespoke per-task transport storage. Inter-task data moves via return/xcom_pull (+ the backend) or an explicit, reviewed pass-by-reference — never a hand-rolled per-task store target.
  • Inv 3 — flip is proof-gated. No flip without a green pre-flight (provider + S3 conn + effective config proven in-pod) and a green control-flow round-trip.
  • Inv 4 — rollback is revert + re-trigger, rehearsed. A bare backend revert is a defect once offload traffic exists; the runbook sequence is rehearsed before cut-over.
  • Inv 5 — threshold protects location, not serialization. Calibrate the threshold (p99.9 control-flow) to keep control-flow inline; never treat it as a serialization gate (the json.dumps precedes it).

Relation to other ADRs

  • ADR-25 (no silent fallback) — an unreachable object store fails the task loudly; no fallback path.
  • ADR-26 — the flip/rollback runbook starts at Grafana.
  • Scoped by the Epic CVN-N014-ED; S02 (numpy routing X′/X″/Y, default Y) and S04 (author guideline + review gate) build on this standard.