ADR-92 — DAG versioning and operator-visible build provenance¶
Status: accepted (Phase-2 complete 2026-05-21) — committee plan_review cee60cdd PASSED/OK strong (OP Meeting #163, ADR-82). Phase 1 (2026-05-20) shipped the helper + sync-time stamp + G6 guardrail (WARN-only) + the S22 diagnostic DAGs ; Phase 2 (2026-05-21, CVN-N014-EB-S01) completes the retrofit of all remaining operator-triggerable DAGs under dags/ and flips G6 to hard-block (fail-the-build). Carve-out : pure-KubernetesPodOperator DAGs (no Python @task, hence no logger) are exempt from the first-task dag_loaded_event surface — their build provenance is carried by the doc_md banner + build tag (both pre-trigger) and the pod's own runtime-image SHA ; G6 enforces the first-task event only on @task-based DAGs.
Driver: CVN-N014-EB-S01 / GH issue #1005
Trigger: CVN-N001-EE-S22A5 (OP wp#185) double-crash on a stale DAG, 2026-05-19.
Related: ADR-25 (no silent fallback) · ADR-30 (logs as stable interface) · ADR-31 (logging discipline) · ADR-68/82 (committee + meeting traceability).
Context¶
The CVNTrade platform deploys DAGs via a two-hop propagation chain :
- The DAG source files live in
dococeven/champollion(dags/*.py). - On every push to
main, an auto-sync workflow copies them intodococeven/cvntrade-airflow-dags(one commit per push, messagechore: auto-sync DAGs from champollion@<sha>). - The Airflow git-sync sidecar polls that repo and the DAG parser refreshes.
- Independently, the
Deploy to Scaleway K8sworkflow builds + rolls out a new image containing the worker-side modules (e.g.src/commun/finetune/diagnostic/) that the DAG imports.
Both legs take time (~7 min for the DAG sync, ~5-6 min for the image build/rollout) and complete asynchronously. Before the policy in this ADR, the Airflow UI exposed no surface allowing the operator to verify which build of the DAG and underlying code was actually loaded by the running pod.
On 2026-05-19, this gap produced a double-crash on the S22A5 diagnostic : a known bug was fixed and merged (PR #1002 at 22:29:41Z), the operator re-triggered the DAG 3 minutes later (22:32:49Z), and the pod ran the pre-fix DAG and image because neither propagation leg had completed. The pod crashed identically to the first run, with no signal in the UI distinguishing "stale build" from "fix didn't work".
The operator mandated, the same day :
Every DAG must be versioned, and the version must be visible to the operator for verification before triggering a run.
This ADR formalises that invariant.
Decision¶
Every DAG file under dags/ MUST surface its champollion build SHA on three operator-visible surfaces, computed from a single source of truth stamped at sync time.
Single source of truth¶
The auto-sync workflow on cvntrade-airflow-dags MUST write dags/.dag_build_info.json at every sync, atomically (tmp + rename), with the contract :
{
"champollion_sha": "<40-char full SHA of the champollion commit producing this sync>",
"champollion_sha_short": "<7-char short form>",
"synced_at_utc": "<ISO-8601 timestamp with Z suffix>"
}
If the file is absent (e.g. local development before any sync), the read helper MUST return the sentinel {"champollion_sha": "unknown", "champollion_sha_short": "unknown", "synced_at_utc": "unknown", "status": "unknown"} and MUST NOT raise. This honours the "unknown is acceptable, silent crash is not" sub-rule.
Three operator-visible surfaces (all mandatory)¶
| Surface | Where the operator sees it | When |
|---|---|---|
doc_md banner |
Airflow UI DAG page (visible on click, before any trigger) | Before trigger |
| DAG tag | Airflow UI DAG list (build=<short-sha> tag visible without clicking) |
Before trigger |
| First-task log event | Loki / Airflow task log (event=dag_loaded dag_id=... champollion_sha=... synced_at_utc=...) |
During run |
The doc_md banner SHALL be emitted by a shared helper dags/_common.py:dag_version_banner() that prepends a single line of the form :
The first-task log event SHALL be emitted by a shared helper dags/_common.py:dag_loaded_event(logger, dag_id, **extra) that calls the project's log_event with the structured fields dag_id, champollion_sha, champollion_sha_short, synced_at_utc. The event name is the closed-catalogue identifier dag_loaded (per ADR-33).
The DAG tag SHALL be added by passing the short SHA through the existing make_tags(...) helper, e.g. tags=make_tags("diagnostic", "s22", "finetune", "compute", build=dag_build_info()["champollion_sha_short"]).
Enforcement — CI guardrail G6¶
A new Story-workflow guardrail G6 — DAG version surfaced MUST be added to .github/workflows/pr-workflow-guardrails.yml. It fails any PR that introduces or modifies a file matching dags/[!_]*.py containing @dag( if either :
- The
doc_mddoes not include a reference todag_version_banner()(regex/AST check), OR - The first decorated task does not call
dag_loaded_event(.
Without this gate, the policy will drift the moment a new DAG ships without the boilerplate.
Local-dev compatibility¶
dags/.dag_build_info.json SHALL be .gitignore-d at the repository root. The helper's unknown sentinel + the banner's literal unknown display make local DAG loads + unit tests work without the stamp. The G6 guardrail only checks for the call to the helper, not the return value, so it does not interfere with local-dev.
Consequences¶
Positive¶
- The 2026-05-19 incident class becomes structurally impossible to mis-diagnose : the operator can verify, before triggering, which DAG build is loaded. The propagation latency itself is not eliminated (that is a separate problem) but the blast radius of mis-timed triggers shrinks from "indistinguishable crash" to "operator sees the banner reads the old SHA and waits".
- Audit-trail completeness improves : every task log line now carries the build SHA, making post-hoc reconstruction (e.g. cold-eyes reviews) trivially anchored to a git commit.
- Consistent surface across all 32 DAGs : one helper, one contract.
Negative¶
- Mild boilerplate per DAG : every DAG must call the helper. Mitigated by the one-liner footprint and the
make_tagsintegration. - Stamp absence renders
unknown: in local dev, the banner showsunknown— visible but harmless. A trade-off for never-raise behaviour. - G6 introduces another guardrail to maintain. Justified given the drift risk.
Negligible / accepted¶
- The sync-workflow change is one extra step ; the JSON file is < 200 bytes ; the
dag_build_info()helper is cached withfunctools.lru_cacheso per-task overhead is < 1 µs.
Invariants (machine-checkable contract)¶
- Inv-1 (source of truth) : every commit on
cvntrade-airflow-dagsfrom the auto-sync workflow MUST include a freshdags/.dag_build_info.jsonwith the three required fields. - Inv-2 (surface coverage) : every file matching
dags/[!_]*.pycontaining@dag(MUST embeddag_version_banner()in itsdoc_mdAND calldag_loaded_event(in its first decorated task — enforced by guardrail G6. - Inv-3 (no raise) :
dag_build_info()MUST NOT raise on missing or malformed JSON ; it MUST return theunknownsentinel anddag_version_banner()MUST renderunknownliterally. - Inv-4 (catalogue) : the event name
dag_loadedMUST be in the closed catalogue per ADR-33.
Notes¶
This is the discipline complement to ADR-25 (no silent fallback) and feedback_no_python_crash_visible — same family of "the operator must not be misled by silent state". The fix here is visibility, not correctness of the underlying DAG, and as such it is non-invasive : it changes no behaviour, only what the operator can see.