Skip to content

Plan dossier — CVN-N014-EB-S01 DAG versioning policy

Date : 2026-05-20 · Story : CVN-N014-EB-S01 (wp#198) · Issue : #1005 Parent : CVN-N014-EB (Operator-visible runtime invariants, wp#197, #1006) · CVN-N014 (Continuous improvement, wp#88) Session type : plan_review Trigger : CVN-N001-EE-S22A5 (OP wp#185) double-crash on a stale DAG, 2026-05-19 — PR #1002 merged at 22:29:41Z ; operator re-triggered the run at 22:32:49Z (3 min later, before DAG-repo sync at 22:38:55Z + K8s deploy at ~22:39:02Z) ; pod ran the pre-fix DAG + image and crashed identically. No surface in the Airflow UI allowed the operator to verify which build was loaded.


1. Decisional question (project policy)

Operator-mandated 2026-05-20 :

Every DAG must be versioned, and the version must be visible to the operator for verification before triggering a run.

This Story implements that policy across the 32 existing DAGs and freezes it with a CI guardrail + ADR-92.

2. Design — single source, three operator-visible surfaces

Source of truth : the champollion commit SHA that produced the DAG, captured at sync time by the auto-sync workflow on cvntrade-airflow-dags. The sync commit message already carries it (chore: auto-sync DAGs from champollion@<sha>) — we make that machine-readable via a stamped JSON file.

Layer What changes Why
Sync workflow (cvntrade-airflow-dags) write dags/.dag_build_info.json = {"champollion_sha": "<full-sha>", "champollion_sha_short": "<7-char>", "synced_at_utc": "<ISO-8601 with Z>"} at every sync, atomically (tmp + rename), single source. The file is what every DAG reads at module-load time.
dags/_common.py (champollion repo) new @functools.lru_cache(maxsize=1) def dag_build_info() -> dict[str, str] reads the JSON ; new dag_version_banner() -> str returns **DAG build:** champollion@<short> · synced <UTC> for doc_md ; new dag_loaded_event(logger, dag_id, **extra) emits event=dag_loaded dag_id=... champollion_sha=... synced_at_utc=... (uses log_event from commun.logs). All three return champollion_sha_short="unknown" synced_at_utc="unknown" status="unknown" when the JSON is absent (local dev / first-time run) — never raise. Single helper, three callers, all DAGs uniform.
dags/*.py (32 files) every DAG doc_md starts with dag_version_banner() ; every first task line is dag_loaded_event(logger, dag.dag_id) ; make_tags(..., build=dag_build_info()["champollion_sha_short"]) for the DAG list. Three operator-visible surfaces : before trigger (UI page doc_md + DAG-list tag) + during run (first log line).
.github/workflows/pr-workflow-guardrails.yml new step G6 : for every changed file under dags/*.py containing @dag(, regex-assert that doc_md includes a dag_version_banner() call AND that the first decorated task emits dag_loaded_event(. Annotate-and-fail on miss. Without G6 the policy drifts the moment we ship a new DAG.
documentation/adr/0092-... new ADR : DAG versioning & operator-visible build provenance — invariant + 3-surface contract + sync-stamp contract. Formal invariant per ADR-69 ; future agents won't relearn this.
CLAUDE.md one-line addition to the "Patterns critiques" section pointing at ADR-92. Onboarding visibility.

Why this design :

  • Operator can verify the build before triggering (Airflow UI page renders doc_md ; tag visible in the DAG list — no clicking required for a sanity glance).
  • Operator can verify during the run (grep the task log for event=dag_loaded).
  • One helper, one stamp, zero per-DAG SHA-substitution magic.
  • The stamp is computed at sync time in the DAGs repo, not at DAG-parse time in Airflow — so the SHA is the real propagated build, not a "what champollion main was when this pod woke up" (subtle but important).
  • G6 is the discipline guarantee — without a CI gate, a new DAG will forget the boilerplate within one mission.

3. Scope / non-goals

In scope : the helper, the 32 DAG retrofits, the sync-workflow JSON stamp, G6, ADR-92, CLAUDE.md one-liner.

Out of scope (this Story) : - Reducing propagation latency. The chain (champollion CI → DAG-repo sync → Airflow git-sync sidecar → DAG parser refresh + K8s image rollout) typically takes ~10 min ; reducing that is a separate Story under a different Epic. The policy here is about visibility, not latency. - Versioning anything beyond the DAG (e.g. the K8s ConfigMap layer, Helm release, image registry) — placeholder Stories CVN-N014-EB-S02/S03 for follow-ups. - A retroactive audit of historical DAG runs (no point — the policy is forward-only).

4. Risks + mitigations

Risk Likelihood Impact Mitigation
Sync workflow fails to write the JSON → all DAGs render champollion_sha=unknown Low Medium (operator sees unknown and knows to wait — not a regression vs today, where they see nothing) Helper returns status="unknown" explicitly + banner says unknown. Sync workflow uses tmp+rename so partial writes are impossible.
Stamp file is present but stale (last sync's SHA, not current) Medium Low (still strictly better than today — operator sees the actual loaded build, which IS the last-synced) Acceptable : "synced_at_utc" tells the operator when the loaded build was synced. That is precisely what we need them to see.
G6 false-positive (matches a DAG that legitimately doesn't need versioning — none expected, but e.g. _common.py itself) Low Trivial (CI noise) G6 only fires on files matching dags/[!_]*.py (skip helpers) AND containing @dag(.
Backwards-compat with the local-dev path (no sync, no JSON file) Medium Trivial Helper returns unknown ; banner says unknown ; local mkdocs serve and unit tests still work.
ADR-92 conflicts with an existing ADR Very low Low Verified against the corpus : closest are ADR-30 (logs as stable interface) and ADR-31 (logging discipline) — both complementary, not overlapping.

5. Test plan

  • Unit tests/unit/test_dag_build_info.py :
  • dag_build_info() returns the JSON content when the file exists.
  • dag_build_info() returns the documented unknown sentinel when the file is absent (no raise).
  • dag_build_info() returns the documented unknown sentinel when the file is malformed (no raise).
  • dag_version_banner() returns the expected **DAG build:** champollion@<short> · synced <UTC> format ; unknown propagates honestly.
  • dag_loaded_event(...) calls log_event with champollion_sha=... and synced_at_utc=... (mock the logger).
  • DAG-load smoke tests/unit/test_dag_versioning_smoke.py : every file dags/[!_]*.py containing @dag( is parsed by ast.parse and contains at least one dag_version_banner() reference + one dag_loaded_event( reference. This is the same check G6 enforces — unit-locking the contract.
  • G6 dry-run : run the new guardrail logic against a synthetic DAG missing the boilerplate ; expect failure with a clear annotation.
  • Manual operator sanity : after the PR merges + propagates, verify on the Airflow UI page for diagnostic__s22_a5 that the doc_md shows the expected champollion@<short> · synced <UTC> ; verify the first task log line emits event=dag_loaded.

6. Success criteria (binary)

  • Helper merged + 32 DAGs retrofitted + sync workflow updated + G6 active on main + ADR-92 live on docs.cvntrade.eu/adr/0092-... + CLAUDE.md updated.
  • Operator runs diagnostic__s22_a5 (post-sync + post-deploy) and confirms — from the Airflow UI alone — both the banner and the dag_loaded log line carrying the correct SHA. Recorded on OP wp.

7. Review question

(a) Is the three-surface contract (doc_md banner + first-task event=dag_loaded + DAG-list tag) the right set of operator-visible surfaces — not over-engineered, not under-specified — given the trigger incident? (b) Is the sync-time JSON stamp the right source of truth (vs DAG-parse-time computation or per-DAG SHA substitution), and is the dags/.dag_build_info.json location + {champollion_sha, champollion_sha_short, synced_at_utc} schema correct? (c) Is the G6 CI guardrail (regex/AST check on dags/[!_]*.py for the banner + dag_loaded_event call) the right enforcement, or should it be runtime-only (e.g. an Airflow lint hook) instead? (d) Is the ADR-92 placement (under CVN-N014-EB rather than as a global infra ADR under a different Need) correct, given the policy is operationally-driven rather than architectural-strategic ?