Plan dossier — CVN-N014-EC-S16 : dev-productivity tooling (SCRIPTS)¶
Story: CVN-N014-EC-S16 — OP wp#240 · GH #1103 · Epic CVN-N014-EC (wp#205, GH #1027)
Type: implementation dossier (ADR-68), submitted for committee review. Tooling tier of the agentic dev-workflow epic.
Date: 2026-06-05
PR: #1102 — chore/airflow-xcom-reader (all checks GREEN, mergeable=MERGEABLE, CodeRabbit pass; not merged).
Status: implementation complete and green on the PR branch; OP wp#240 New. Per operator decision (2026-06-05), this Story takes the full canonical documentation set (ADR-0101, shipped by CVN-N014-EC-S17) — not the scripts-only carve-out: hub · plan (this) · architecture · runbook · test strategy. Dogfood: S16 is the first pre-existing tooling Story retrofitted to the universal standard.
Partie I — Dossier (problématisation → Consolidation)¶
Chapitre 1 — Problématisation (sans jargon)¶
En une phrase : deux frictions quotidiennes du workflow de dev n'avaient aucun outil — lire en local le résultat d'un DAG, et faire travailler deux « développeurs » en parallèle sur le repo — et chacun les contournait à la main.
- Lire le résultat d'un DAG diagnostic depuis son poste. Les DAGs
s40/s41/s42rendent leur trajectoire (AUC par tirage, deltas, verdicts) via XCom — un petit casier interne d'Airflow. Mais ce casier vit dans une base de données posée sur un réseau privé (VPC) inatteignable depuis un laptop. Loki (les logs) ne porte que le statut, pas la trajectoire. Résultat : pour lire ce qu'un run a renvoyé, on n'avait aucun chemin — on devinait, ou on relançait. - Faire tourner un second « développeur » Claude. Travailler à deux fronts sur le même repo se faisait à la main : un second
git cloneduplique tout, dérive aux fetches, et perd tous les secrets locaux (venv,.env, réglages). Pas de procédure, donc des erreurs.
Le correctif : deux outils — un lecteur XCom read-only (qui interroge la base depuis l'intérieur du cluster, sans exposer la base) et un bootstrap de « second développeur » (un git worktree + symlinks de l'environnement local). Plus de bricolage, plus de cible fausse.
Chapitre 2 — User stories mises en œuvre¶
| # | En tant que… | je veux… | afin de… |
|---|---|---|---|
| US-1 | mainteneur diagnostic | lire la trajectoire XCom d'un run s42 depuis mon poste |
analyser le résultat sans relancer ni deviner (ce que Loki ne donne pas) |
| US-2 | opérateur | que ce lecteur soit read-only par construction (jamais d'écriture en base) | l'utiliser sans risque sur la base de prod |
| US-3 | dev | lancer un second workspace Claude sur une autre branche en une commande | bosser sur deux fronts sans recloner ni reconfigurer |
| US-4 | dev | que le second workspace partage venv/secrets/réglages/mémoire du premier | qu'il soit un vrai clone fonctionnel, pas une coquille à reconfigurer |
Chapter 3 — Hypotheses (EN)¶
| # | Hypothesis | Test method | Result |
|---|---|---|---|
| H1 | An in-pod SELECT (via kubectl exec) reads XCom without exposing the DB and without any write. |
a _REMOTE_PROBE issuing only SELECT + a read-only guard (no DDL/DML verb), locked by unit tests; verified in-pod. |
Confirmed — read-only-by-construction guard + 23 unit tests (post-committee Q2). |
| H2 | A git worktree + symlinks makes a second workspace a faithful clone of the first (venv/.env/.claude/memory). | new_worktree.sh dogfooded: this very dev2 worktree runs on it. |
Confirmed — dev2 session ran the whole work on the provisioned worktree. |
| H3 | The shared git stash stack is a real worktree foot-gun (not theoretical). |
operator review (Q4) flagged it; realised live this session (git stash -u swallowed the dev2 venv symlink). |
Confirmed — mitigated by guidance + per-worktree identity (post-committee Q4). |
Chapter 4 — State of the art¶
Airflow XCom is the metadata-DB message channel; reading it programmatically usually means the REST API or the DB — both impractical here (private VPC, value-return disabled). The idiomatic safe path is an in-cluster read (a pod already holds the connection + reachability), mirroring the project's loki-query pattern. Git worktrees are the standard way to check out multiple branches of one repository without a second clone (shared object store, separate working trees); the "second-developer"/parallel-agent pattern layers environment symlinking on top. References: Apache Airflow XCom + REST API docs; git worktree documentation; internal loki-query (CVN-N014-EC-S02) as the read-only-CLI precedent.
Chapter 5 — Definition of Done¶
-
scripts/airflow_xcom_pull.py— read-only XCom reader (in-pod SELECT, read-only guard) + 23 unit tests. -
scripts/new_worktree.sh— second-developer bootstrap (worktree + env symlinks + per-worktree identity). - Process docs live:
AIRFLOW_XCOM_READER.md+PARALLEL_DEV_WORKTREES.md. - Committee blockers addressed (read-only guard Q2 ; git-stash mitigation + per-worktree identity Q4).
- Full canonical doc set (ADR-0101): hub · plan · architecture · runbook · test strategy →
check_story_docsCONFORMANT. - PR #1102 merged + docs live (the
→ Developedgate). - Operational readiness (process-equivalent, Inv 4): usage + gotchas in the runbook; rollback = revert the scripts (no runtime impact).
Chapter 6 — Consolidation & traceability¶
| Problem thread | Hypothesis | US | Deliverable |
|---|---|---|---|
| can't read XCom locally | H1 | US-1/US-2 | airflow_xcom_pull.py + AIRFLOW_XCOM_READER.md |
| can't run a 2nd dev | H2/H3 | US-3/US-4 | new_worktree.sh + PARALLEL_DEV_WORKTREES.md |
Decision routing: read-only-safe + dogfooded → ship. No dangling thread.
Partie II — Plan technique (design self-contained)¶
Problem¶
Two recurring frictions in the diagnostic / multi-dev workflow, neither solvable from a laptop today:
- Diagnostic XCom is unreadable locally. DAGs
s40/s41/s42return their per-cell trajectory (auc_by_seed,deltas_per_point, verdict dicts) via XCom, not Loki — thes42_*log events carry status, not the AUC-vs-point trajectory the HP-swap step-3 analysis needs (hp_swap_deliverable_plan.md§9). Loki cannot answer "is the optimum still at point X across folds"; XCom can. But the Airflow metadata DB lives on a private Scaleway VPC IP unreachable from a laptop — there was no read path at all. - No clean way to run a second Claude "developer". Parallel work on the repo was being done ad-hoc; a second
git cloneduplicates.git, drifts on fetch, and loses every gitignored local secret (venv,.env,.claudesettings, shared memory).
Approach¶
Two deterministic, operator-usable SCRIPTS (not skills — no agent judgment; the SCRIPT/SKILL split mirrors S02 loki-query). A follow-up SKILL wrapping the XCom reader (mirror of the loki-query skill over loki_query.py) is tracked separately and out of scope here.
scripts/airflow_xcom_pull.py— read-only XCom reader. Ships a tiny embedded probe and runs it in-pod viakubectl exec(the scheduler pod already holds the conn string and can reach the DB). Only SQL issued isSELECT; credentials never leave the cluster; fail-fast on any error (ADR-25), never invents an empty result. Mirrorsloki_query.py's shape and ergonomics.scripts/new_worktree.sh— second-developer bootstrap. Creates a git worktree (shared.git, no second clone) on a dedicated branch and symlinks the gitignored local environment (.venv_airflow,.env,.claude/settings*.json) plus the keyed auto-memory dir, so the second session is a faithful clone of the first.CLAUDE.mdtrim — replaced the inline ADR-1..92 enumeration with the load-bearing invariants + a pointer todocumentation/ADR.md(the SSoT, ADR-77). No behavioural change; reduces drift between the inline list and the source of truth.
Files¶
scripts/airflow_xcom_pull.py(new)scripts/new_worktree.sh(new)CLAUDE.md(trim — ADR catalogue → invariants + SSoT pointer)documentation/process/AIRFLOW_XCOM_READER.md(new — usage note, mirrorsLOKI_QUERY.md)documentation/process/PARALLEL_DEV_WORKTREES.md(new — usage note + worktree gotchas)mkdocs.yml(nav — two entries under Operations)documentation/reviews/2026-06-05-cvn-n014-ec-s16-dev-productivity-tooling-plan.md(this dossier)
Risks & mitigations¶
- A write path slips into the "read-only" reader → only SQL string is a
SELECT; the in-pod probe takes no DDL/DML branch; documented + asserted in the usage note as read-only by construction. - Credential leakage to the laptop → conn string is read from the pod env and never printed; the script transports only the JSON request + JSON result over
kubectl execstdout. - Silent empty result masking a real error → fail-fast (ADR-25): no Running pod / non-zero exec / missing conn / non-JSON output each raise with a clear message + non-zero exit. An empty match (vs error) prints a "widen with
--list-runs" hint, distinct from an error. - Worktree foot-guns (same branch in two worktrees, shared
git stashstack, stale localmain, concurrent cluster runs) → enumerated as "Iron rules" inPARALLEL_DEV_WORKTREES.md; serialise-cluster-runs called out explicitly (sharedmax_active_runs=1+ftf_config). kubectl exectimeout / pod churn → 60s timeout on the exec; pod resolved bycomponent=schedulerlabel with--podoverride;--containeroverridable.
Success criteria¶
- PR #1102 checks green + CodeRabbit clean (✅ met).
- Docs live on docs.cvntrade.eu (two Operations pages) after merge — the
→ Developedgate. - Dogfood:
airflow_xcom_pull.pypulls a realdiagnostic__s42XCom trajectory;new_worktree.shalready dogfooded (this very session runs in thechampollion-dev2worktree it bootstraps). - Standard PR / CR / committee cycle completed before merge.
Self-contained design (for committee — no code-reading required)¶
1. airflow_xcom_pull.py — CLI surface¶
| flag | default | meaning |
|---|---|---|
--dag-id |
— (required) | DAG id, e.g. diagnostic__s42 |
--run-id |
— | DAG run_id (required unless --list-runs) |
--task-id / --map-index |
— | restrict to one task / one mapped cell |
--key |
return_value |
XCom key |
--list-runs |
off | list recent runs of --dag-id and exit |
--limit |
20 |
max rows |
--namespace / --selector / --container / --pod |
cvntrade / component=scheduler / scheduler / — |
pod targeting |
--json |
off | raw JSON dump |
The load-bearing decision — in-pod SELECT over kubectl exec (not a local DB connection, not a write):
- The metadata DB is on a private VPC IP → no local connection possible. The scheduler pod can reach it and already has the conn string.
- The embedded probe (sqlalchemy + json, both in the Airflow image) issues exactly one SELECT against xcom/dag_run, JSON-decodes values when possible (pickled/non-JSON surfaced verbatim — never guessed), and prints JSON to stdout.
- Read-only by construction: no DDL/DML path exists in the probe. The only side effect is the transient kubectl exec the script owns.
Error model: single fail-fast surface returning rc=1 with a clear message (ADR-25, no silent fallback) — distinct hint on an empty match vs. a real error.
2. new_worktree.sh — what it wires¶
new_worktree.sh <branch> [dir] [base]
├─ git worktree (shared .git, new branch from <base> or attach existing)
├─ symlinks: .venv_airflow, .env, .claude/settings.json + settings.local.json
└─ symlink: ~/.claude/projects/<worktree>/memory → primary's memory (shared handoffs)
.git means a push on one side is instantly visible on the other; symlinks mean a single source of truth for secrets + permissions + the cross-session memory channel. Tracked artefacts (skills, hooks, ADRs, CLAUDE.md) follow the branch for free.
Validation done¶
airflow_xcom_pull.py: read-onlySELECT-only probe; fail-fast paths exercised by construction review. Cluster dogfood (pull a reals42trajectory) is the acceptance step before close.new_worktree.sh: dogfooded live — this session runs in thechampollion-dev2worktree the script provisions (shared venv/env/settings/memory all resolved).- Docs build:
mkdocs build --strictmust be green with the two new Operations pages (run before push).
Questions for the committee¶
- Q1 — Is the in-pod
kubectl exec+ embeddedSELECTthe right pattern for a read-only XCom reader, given the DB is on a private VPC IP? Any safer/cleaner path (e.g. an Airflow REST/CLI surface) we're missing? - Q2 — Is "read-only by construction" sufficiently guaranteed (single
SELECTstring, no DDL/DML branch), or should we add an explicit guard / test asserting no write verb can be issued? - Q3 — Is the SCRIPT (not SKILL) split correct for both deliverables, consistent with the S02 loki-query precedent? (A SKILL wrapping the reader is tracked separately.)
- Q4 — Any blind spots in
new_worktree.sh: the sharedgit stashstack,.git/configidentity sharing, or the cluster-run serialisation risk between two live devs? - Q5 — Is the
CLAUDE.mdADR-catalogue trim (inline list → invariants + SSoT pointer) aligned with ADR-77 (docs/ADR.md as SSoT), or does removing the inline enumeration lose operator value?