Skip to content

Cut-over decision dossier — CVN-N014-ED-S01 (object-storage XCom flip)

Status: CUT OVER + VALIDATED 2026-06-06. The flip is LIVE in prod (XComObjectStorageBackend, threshold 1.6 MB) — verified in-pod (scheduler conf.get resolves) and in-vivo (real diagnostic__s42 run SUCCESS, 0 xcom errors, step-pod trajectory readable, near-inert). 1st attempt was broken by the dotted-section encoding (see §RCA), fixed-forward. S3 lifecycle is deferred to CVN-N014-ED-S03 (wp#244) — not load-bearing while offload traffic = 0. Refs: plan · ADR-0100 · runbook.

1. Threshold calibration (measured, not assumed)

In-pod SELECT over the live xcom table (7 175 rows), byte sizes:

p50 p95 p99 p99.9 max rows >48KB >100KB >1MB
15 B 73 KB 307 KB 1.52 MB 1.99 MB 545 277 11

Proposed threshold = 1600000 (1.6 MB, just above p99.9). Rationale: keeps 99.9 % of current XCom inline in the DB (byte-identical to BaseXCom → near-inert); only the ~7 largest historical rows (>1.6 MB) would offload — and all large XComs are JSON decision-dicts from diagnostic DAGs (s40/s41/s42), XComEncoder-safe (the audit found 0 numpy in XCom). Lowering the threshold (e.g. 100 KB → 277 rows) would offload hundreds of rows immediately = not near-inert. Optional ramp: start 2000000 (>max → zero offload) then lower to 1600000.

2. Pre-flight (read-only) — GREEN

Check Result
provider apache-airflow-providers-common-io 1.4.2 present in-pod
S3 connection aws_default ✅ exists (Scaleway s3.fr-par.scw.cloud, region fr-par, path-style)
functional write→read→delete on s3://aws_default@cvntrade-artifacts/xcom/_preflight/ PASS (self-cleaned)

Safety gate (operator-mandated): GREEN → cut-over diff is promotable. (If this had failed, no promotable diff — hard stop.) Bucket = cvntrade-artifacts (existing MLflow substrate).

3. The Helm diff (prepared, NOT applied)

infra/helm/airflow/values-prod.yamlextraConfigMaps.cvntrade-env-config.data, 4 env vars:

AIRFLOW__CORE__XCOM_BACKEND: "airflow.providers.common.io.xcom.backend.XComObjectStorageBackend"
AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_PATH: "s3://aws_default@cvntrade-artifacts/xcom"
AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_THRESHOLD: "1600000"
AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_COMPRESSION: ""

Dotted-section footgun (it BIT the 1st cut-over — see §RCA): the common.io section MUST be encoded COMMON_IO (underscore), because Airflow's _env_var_name does section.replace('.', '_'). A literal COMMON.IO is present in the pod env but invisible to conf.get → backend init raises AirflowConfigException → all XCom writes crash. Post-deploy MUST verify the config resolves via airflow config get-value common.io xcom_objectstorage_path (NOT just env | grep) — see §6.

4. S3 lifecycle (prepared)

infra/scaleway/s3-lifecycle-xcom.json — 7-day retention on the xcom/ prefix (mirrors l2_events/). ⚠️ Scaleway has one lifecycle config per bucket → merge with the existing l2-events-7d-retention rule before applying (see the README). Apply just before/after cut-over.

5. Kill-switch rehearsal — procedure (run on local docker-compose before cut-over)

The rehearsal needs an offloaded XCom (a >1.6 MB value), which only exists after a flip → run it on the local docker-compose stack, not prod:

1. Flip local: set the 4 env vars in airflow_docker/docker-compose.yaml + airflow.cfg.
2. Trigger a task that returns a >1.6 MB value → confirm it offloads (DB holds the path string;
   the object lands under s3://…/xcom/<dag>/<run>/<task>/<uuid>).
3. REVERT: set AIRFLOW__CORE__XCOM_BACKEND back to airflow.models.xcom.BaseXCom, restart.
4. OBSERVE the break: the downstream task reading that XCom now gets the path STRING, not the data.
5. RE-TRIGGER: clear + re-run the affected DAG-run → upstream re-emits the XCom under BaseXCom
   (per-run, self-heals). Confirm downstream gets the data again.
6. Document the observed downtime.

This proves the revert + re-trigger kill-switch (a bare revert is unsafe once offloaded — ADR-0100 Inv 4). Mandatory gate before the prod cut-over.

✅ REHEARSED on K8S (2026-06-06), controlled in-pod, no global flip. Local docker-compose was unavailable, so the mechanic was exercised in the scheduler pod against real S3 + real Airflow code paths without changing AIRFLOW__CORE__XCOM_BACKEND and without touching any DAG or DB row (fully isolated, self-cleaned): (1) a 2.03 MB JSON value offloaded to s3://aws_default@cvntrade-artifacts/xcom/_rehearsal/ → DB would hold only the path string; (2) BaseXCom (revert) read the row → returned the path STRING, not the dict → BREAK confirmed; (3) the S3 object round-tripped intact; (4) re-emit under BaseXCom restored the data → HEAL confirmed; (5) sentinel deleted. Verdict: revert-alone breaks; revert + re-trigger heals — kill-switch validated. A full DAG-run clear+rerun is standard Airflow (the novel path — the path-string break — is what was rehearsed).

6. Cut-over sequence (operator go required at the ✋ step)

  1. (pre) §5 rehearsal — DONE ✅ (in-pod on K8S, controlled, 2026-06-06; revert+re-trigger validated).
  2. ✋ operator go → merge feat/CVN-N014-ED-S01-prod-flip to main → CI deploy (Helm upgrade).
  3. Verify config RESOLVES in-pod (the hardened footgun gate — env | grep is NOT enough, the 1st cut-over had the var present in env but unresolved):
    kubectl -n cvntrade exec <scheduler-pod> -c scheduler -- python -c "from airflow.configuration import conf; \
      print(conf.get('common.io','xcom_objectstorage_path')); \
      print(conf.getint('common.io','xcom_objectstorage_threshold')); \
      from airflow.models.xcom import XCom; print(XCom.__name__)"
    
    Expect: the path + threshold values + XComObjectStorageBackend. If conf.get raises AirflowConfigException or XCom won't import → kill-switch (revert) immediately — the dotted-section env var did not resolve. Do NOT rationalise a CLI error as a "quirk".
  4. S3 lifecycle rule — DEFERRED to CVN-N014-ED-S03 (wp#244) ✅ decision: not load-bearing while offload traffic = 0 (near-inert); applied opportunistically with the orphan-sweeper.
  5. Post-flip verification — DONE ✅ (2026-06-06): real prod run diagnostic__s42 (manual__xcom-flip-validation-20260606T111730) SUCCESS under the live backend. loki_query.py = 0 xcom/AirflowConfigException errors; fresh trajectory (discriminate_cell.return_value, UNIUSDC fold-3) read via airflow_xcom_pull.py (step-pod write+read path proven); rglob xcom/diagnostic__s42 = 0 (all cells inline → near-inert). Offload path proven in-pod (3.22 MB → S3 → read-back).
  6. Kill-switch armed (runbook) — git revert a0b49778 + re-trigger.

Decision summary (for the operator)

  • Threshold: 1.6 MB (p99.9) — near-inert, ~7 historical rows offload, all JSON-safe.
  • conn/path: aws_default / s3://aws_default@cvntrade-artifacts/xcom — pre-flight functional PASS.
  • Diff: ready on feat/CVN-N014-ED-S01-prod-flip (not merged).
  • Lifecycle: xcom/ 7-day, ready (merge with l2-events rule).
  • Kill-switch: revert + re-trigger — rehearsed on K8S ✅ (controlled in-pod, 2026-06-06): revert-alone breaks (path string), revert + re-trigger heals.
  • Post-flip: representative re-run + in-pod config verification.

Cut over on the operator's explicit go (2026-06-06), fixed-forward after the RCA, validated in-vivo. S01 closed; lifecycle deferred to CVN-N014-ED-S03 (wp#244).


RCA — 1st cut-over failed (dotted-section env-var encoding) — 2026-06-06

Timeline: PR #1118 merged → "Deploy to Scaleway K8s" SUCCESS → in-pod verification → flip found broken (latent, no DAG run in flight). Fix-forward chosen (operator). PR #1118 fix fix/CVN-N014-ED-S01-xcom-env-encoding.

Symptom: with AIRFLOW__CORE__XCOM_BACKEND set to XComObjectStorageBackend, importing the resolved airflow.models.xcom.XCom raised AirflowConfigException: section/key [common.io/xcom_objectstorage_path] not found in config. Since serialize_value reads _get_threshold() before the size check, every XCom write (sub- and above-threshold) would have crashed.

Root cause: the common.io config section was encoded with a literal dot (AIRFLOW__COMMON.IO__…). Airflow's _env_var_name does section.replace('.', '_'), so it only looks for AIRFLOW__COMMON_IO__… (underscore). The dotted var was present in the pod env but invisible to conf.get → backend init failed.

Fix (empirically verified in-pod before re-deploy): encode the section as COMMON_IO (underscore). With it: conf.get('common.io','xcom_objectstorage_path') resolves, XCom resolves to XComObjectStorageBackend, sub-threshold round-trip passes.

Why the gates missed it — and the hardening applied: 1. Pre-flight checked presence, not resolution. It proved the env var existed + S3 worked, never that conf.get('common.io',…) resolved. → Pre-flight/post-deploy gate now calls conf.get (§6.3), not env | grep. 2. The signal was rationalised. airflow config get-value common.io xcom_objectstorage_path garbled/failed post-deploy — that was the footgun, read as a "CLI quirk". → §6.3 now states explicitly: a config get-value/conf.get failure = STOP/revert, never a quirk. 3. The rehearsal bypassed the mapping. It imported the backend directly + offloaded manually (no _get_threshold() read), and pre-flip the config was unset → it never exercised env-var→config resolution. → A faithful rehearsal must read the config through conf, with the env set exactly as deployed.

Blast radius: zero. DAGs are schedule=None (ADR-18); no run executed between the broken deploy and the fix. No data lost, no XCom corrupted (the backend never ran).