Plan dossier — CVN-N014-EB-S01 DAG versioning policy¶
Date : 2026-05-20 · Story : CVN-N014-EB-S01 (wp#198) · Issue : #1005
Parent : CVN-N014-EB (Operator-visible runtime invariants, wp#197, #1006) · CVN-N014 (Continuous improvement, wp#88)
Session type : plan_review
Trigger : CVN-N001-EE-S22A5 (OP wp#185) double-crash on a stale DAG, 2026-05-19 — PR #1002 merged at 22:29:41Z ; operator re-triggered the run at 22:32:49Z (3 min later, before DAG-repo sync at 22:38:55Z + K8s deploy at ~22:39:02Z) ; pod ran the pre-fix DAG + image and crashed identically. No surface in the Airflow UI allowed the operator to verify which build was loaded.
1. Decisional question (project policy)¶
Operator-mandated 2026-05-20 :
Every DAG must be versioned, and the version must be visible to the operator for verification before triggering a run.
This Story implements that policy across the 32 existing DAGs and freezes it with a CI guardrail + ADR-92.
2. Design — single source, three operator-visible surfaces¶
Source of truth : the champollion commit SHA that produced the DAG, captured at sync time by the auto-sync workflow on cvntrade-airflow-dags. The sync commit message already carries it (chore: auto-sync DAGs from champollion@<sha>) — we make that machine-readable via a stamped JSON file.
| Layer | What changes | Why |
|---|---|---|
Sync workflow (cvntrade-airflow-dags) |
write dags/.dag_build_info.json = {"champollion_sha": "<full-sha>", "champollion_sha_short": "<7-char>", "synced_at_utc": "<ISO-8601 with Z>"} at every sync, atomically (tmp + rename), single source. |
The file is what every DAG reads at module-load time. |
dags/_common.py (champollion repo) |
new @functools.lru_cache(maxsize=1) def dag_build_info() -> dict[str, str] reads the JSON ; new dag_version_banner() -> str returns **DAG build:** champollion@<short> · synced <UTC> for doc_md ; new dag_loaded_event(logger, dag_id, **extra) emits event=dag_loaded dag_id=... champollion_sha=... synced_at_utc=... (uses log_event from commun.logs). All three return champollion_sha_short="unknown" synced_at_utc="unknown" status="unknown" when the JSON is absent (local dev / first-time run) — never raise. |
Single helper, three callers, all DAGs uniform. |
dags/*.py (32 files) |
every DAG doc_md starts with dag_version_banner() ; every first task line is dag_loaded_event(logger, dag.dag_id) ; make_tags(..., build=dag_build_info()["champollion_sha_short"]) for the DAG list. |
Three operator-visible surfaces : before trigger (UI page doc_md + DAG-list tag) + during run (first log line). |
.github/workflows/pr-workflow-guardrails.yml |
new step G6 : for every changed file under dags/*.py containing @dag(, regex-assert that doc_md includes a dag_version_banner() call AND that the first decorated task emits dag_loaded_event(. Annotate-and-fail on miss. |
Without G6 the policy drifts the moment we ship a new DAG. |
documentation/adr/0092-... |
new ADR : DAG versioning & operator-visible build provenance — invariant + 3-surface contract + sync-stamp contract. | Formal invariant per ADR-69 ; future agents won't relearn this. |
CLAUDE.md |
one-line addition to the "Patterns critiques" section pointing at ADR-92. | Onboarding visibility. |
Why this design :
- Operator can verify the build before triggering (Airflow UI page renders
doc_md; tag visible in the DAG list — no clicking required for a sanity glance). - Operator can verify during the run (grep the task log for
event=dag_loaded). - One helper, one stamp, zero per-DAG SHA-substitution magic.
- The stamp is computed at sync time in the DAGs repo, not at DAG-parse time in Airflow — so the SHA is the real propagated build, not a "what champollion main was when this pod woke up" (subtle but important).
- G6 is the discipline guarantee — without a CI gate, a new DAG will forget the boilerplate within one mission.
3. Scope / non-goals¶
In scope : the helper, the 32 DAG retrofits, the sync-workflow JSON stamp, G6, ADR-92, CLAUDE.md one-liner.
Out of scope (this Story) :
- Reducing propagation latency. The chain (champollion CI → DAG-repo sync → Airflow git-sync sidecar → DAG parser refresh + K8s image rollout) typically takes ~10 min ; reducing that is a separate Story under a different Epic. The policy here is about visibility, not latency.
- Versioning anything beyond the DAG (e.g. the K8s ConfigMap layer, Helm release, image registry) — placeholder Stories CVN-N014-EB-S02/S03 for follow-ups.
- A retroactive audit of historical DAG runs (no point — the policy is forward-only).
4. Risks + mitigations¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
Sync workflow fails to write the JSON → all DAGs render champollion_sha=unknown |
Low | Medium (operator sees unknown and knows to wait — not a regression vs today, where they see nothing) |
Helper returns status="unknown" explicitly + banner says unknown. Sync workflow uses tmp+rename so partial writes are impossible. |
| Stamp file is present but stale (last sync's SHA, not current) | Medium | Low (still strictly better than today — operator sees the actual loaded build, which IS the last-synced) | Acceptable : "synced_at_utc" tells the operator when the loaded build was synced. That is precisely what we need them to see. |
G6 false-positive (matches a DAG that legitimately doesn't need versioning — none expected, but e.g. _common.py itself) |
Low | Trivial (CI noise) | G6 only fires on files matching dags/[!_]*.py (skip helpers) AND containing @dag(. |
| Backwards-compat with the local-dev path (no sync, no JSON file) | Medium | Trivial | Helper returns unknown ; banner says unknown ; local mkdocs serve and unit tests still work. |
| ADR-92 conflicts with an existing ADR | Very low | Low | Verified against the corpus : closest are ADR-30 (logs as stable interface) and ADR-31 (logging discipline) — both complementary, not overlapping. |
5. Test plan¶
- Unit
tests/unit/test_dag_build_info.py: dag_build_info()returns the JSON content when the file exists.dag_build_info()returns the documentedunknownsentinel when the file is absent (no raise).dag_build_info()returns the documentedunknownsentinel when the file is malformed (no raise).dag_version_banner()returns the expected**DAG build:** champollion@<short> · synced <UTC>format ;unknownpropagates honestly.dag_loaded_event(...)callslog_eventwithchampollion_sha=...andsynced_at_utc=...(mock the logger).- DAG-load smoke
tests/unit/test_dag_versioning_smoke.py: every filedags/[!_]*.pycontaining@dag(is parsed byast.parseand contains at least onedag_version_banner()reference + onedag_loaded_event(reference. This is the same check G6 enforces — unit-locking the contract. - G6 dry-run : run the new guardrail logic against a synthetic DAG missing the boilerplate ; expect failure with a clear annotation.
- Manual operator sanity : after the PR merges + propagates, verify on the Airflow UI page for
diagnostic__s22_a5that thedoc_mdshows the expectedchampollion@<short> · synced <UTC>; verify the first task log line emitsevent=dag_loaded.
6. Success criteria (binary)¶
- Helper merged + 32 DAGs retrofitted + sync workflow updated + G6 active on
main+ ADR-92 live ondocs.cvntrade.eu/adr/0092-...+ CLAUDE.md updated. - Operator runs
diagnostic__s22_a5(post-sync + post-deploy) and confirms — from the Airflow UI alone — both the banner and thedag_loadedlog line carrying the correct SHA. Recorded on OP wp.
7. Review question¶
(a) Is the three-surface contract (doc_md banner + first-task event=dag_loaded + DAG-list tag) the right set of operator-visible surfaces — not over-engineered, not under-specified — given the trigger incident? (b) Is the sync-time JSON stamp the right source of truth (vs DAG-parse-time computation or per-DAG SHA substitution), and is the dags/.dag_build_info.json location + {champollion_sha, champollion_sha_short, synced_at_utc} schema correct? (c) Is the G6 CI guardrail (regex/AST check on dags/[!_]*.py for the banner + dag_loaded_event call) the right enforcement, or should it be runtime-only (e.g. an Airflow lint hook) instead? (d) Is the ADR-92 placement (under CVN-N014-EB rather than as a global infra ADR under a different Need) correct, given the policy is operationally-driven rather than architectural-strategic ?