Skip to content

Plan dossier — CVN-N014-EC-S16 : dev-productivity tooling (SCRIPTS)

Story: CVN-N014-EC-S16 — OP wp#240 · GH #1103 · Epic CVN-N014-EC (wp#205, GH #1027) Type: implementation dossier (ADR-68), submitted for committee review. Tooling tier of the agentic dev-workflow epic. Date: 2026-06-05 PR: #1102 — chore/airflow-xcom-reader (all checks GREEN, mergeable=MERGEABLE, CodeRabbit pass; not merged). Status: implementation complete and green on the PR branch; OP wp#240 New. Per operator decision (2026-06-05), this Story takes the full canonical documentation set (ADR-0101, shipped by CVN-N014-EC-S17) — not the scripts-only carve-out: hub · plan (this) · architecture · runbook · test strategy. Dogfood: S16 is the first pre-existing tooling Story retrofitted to the universal standard.

Partie I — Dossier (problématisation → Consolidation)

Chapitre 1 — Problématisation (sans jargon)

En une phrase : deux frictions quotidiennes du workflow de dev n'avaient aucun outil — lire en local le résultat d'un DAG, et faire travailler deux « développeurs » en parallèle sur le repo — et chacun les contournait à la main.

  1. Lire le résultat d'un DAG diagnostic depuis son poste. Les DAGs s40/s41/s42 rendent leur trajectoire (AUC par tirage, deltas, verdicts) via XCom — un petit casier interne d'Airflow. Mais ce casier vit dans une base de données posée sur un réseau privé (VPC) inatteignable depuis un laptop. Loki (les logs) ne porte que le statut, pas la trajectoire. Résultat : pour lire ce qu'un run a renvoyé, on n'avait aucun chemin — on devinait, ou on relançait.
  2. Faire tourner un second « développeur » Claude. Travailler à deux fronts sur le même repo se faisait à la main : un second git clone duplique tout, dérive aux fetches, et perd tous les secrets locaux (venv, .env, réglages). Pas de procédure, donc des erreurs.

Le correctif : deux outils — un lecteur XCom read-only (qui interroge la base depuis l'intérieur du cluster, sans exposer la base) et un bootstrap de « second développeur » (un git worktree + symlinks de l'environnement local). Plus de bricolage, plus de cible fausse.

Chapitre 2 — User stories mises en œuvre

# En tant que… je veux… afin de…
US-1 mainteneur diagnostic lire la trajectoire XCom d'un run s42 depuis mon poste analyser le résultat sans relancer ni deviner (ce que Loki ne donne pas)
US-2 opérateur que ce lecteur soit read-only par construction (jamais d'écriture en base) l'utiliser sans risque sur la base de prod
US-3 dev lancer un second workspace Claude sur une autre branche en une commande bosser sur deux fronts sans recloner ni reconfigurer
US-4 dev que le second workspace partage venv/secrets/réglages/mémoire du premier qu'il soit un vrai clone fonctionnel, pas une coquille à reconfigurer

Chapter 3 — Hypotheses (EN)

# Hypothesis Test method Result
H1 An in-pod SELECT (via kubectl exec) reads XCom without exposing the DB and without any write. a _REMOTE_PROBE issuing only SELECT + a read-only guard (no DDL/DML verb), locked by unit tests; verified in-pod. Confirmed — read-only-by-construction guard + 23 unit tests (post-committee Q2).
H2 A git worktree + symlinks makes a second workspace a faithful clone of the first (venv/.env/.claude/memory). new_worktree.sh dogfooded: this very dev2 worktree runs on it. Confirmed — dev2 session ran the whole work on the provisioned worktree.
H3 The shared git stash stack is a real worktree foot-gun (not theoretical). operator review (Q4) flagged it; realised live this session (git stash -u swallowed the dev2 venv symlink). Confirmed — mitigated by guidance + per-worktree identity (post-committee Q4).

Chapter 4 — State of the art

Airflow XCom is the metadata-DB message channel; reading it programmatically usually means the REST API or the DB — both impractical here (private VPC, value-return disabled). The idiomatic safe path is an in-cluster read (a pod already holds the connection + reachability), mirroring the project's loki-query pattern. Git worktrees are the standard way to check out multiple branches of one repository without a second clone (shared object store, separate working trees); the "second-developer"/parallel-agent pattern layers environment symlinking on top. References: Apache Airflow XCom + REST API docs; git worktree documentation; internal loki-query (CVN-N014-EC-S02) as the read-only-CLI precedent.

Chapter 5 — Definition of Done

  • scripts/airflow_xcom_pull.py — read-only XCom reader (in-pod SELECT, read-only guard) + 23 unit tests.
  • scripts/new_worktree.sh — second-developer bootstrap (worktree + env symlinks + per-worktree identity).
  • Process docs live: AIRFLOW_XCOM_READER.md + PARALLEL_DEV_WORKTREES.md.
  • Committee blockers addressed (read-only guard Q2 ; git-stash mitigation + per-worktree identity Q4).
  • Full canonical doc set (ADR-0101): hub · plan · architecture · runbook · test strategy → check_story_docs CONFORMANT.
  • PR #1102 merged + docs live (the → Developed gate).
  • Operational readiness (process-equivalent, Inv 4): usage + gotchas in the runbook; rollback = revert the scripts (no runtime impact).

Chapter 6 — Consolidation & traceability

Problem thread Hypothesis US Deliverable
can't read XCom locally H1 US-1/US-2 airflow_xcom_pull.py + AIRFLOW_XCOM_READER.md
can't run a 2nd dev H2/H3 US-3/US-4 new_worktree.sh + PARALLEL_DEV_WORKTREES.md

Decision routing: read-only-safe + dogfooded → ship. No dangling thread.


Partie II — Plan technique (design self-contained)

Problem

Two recurring frictions in the diagnostic / multi-dev workflow, neither solvable from a laptop today:

  1. Diagnostic XCom is unreadable locally. DAGs s40/s41/s42 return their per-cell trajectory (auc_by_seed, deltas_per_point, verdict dicts) via XCom, not Loki — the s42_* log events carry status, not the AUC-vs-point trajectory the HP-swap step-3 analysis needs (hp_swap_deliverable_plan.md §9). Loki cannot answer "is the optimum still at point X across folds"; XCom can. But the Airflow metadata DB lives on a private Scaleway VPC IP unreachable from a laptop — there was no read path at all.
  2. No clean way to run a second Claude "developer". Parallel work on the repo was being done ad-hoc; a second git clone duplicates .git, drifts on fetch, and loses every gitignored local secret (venv, .env, .claude settings, shared memory).

Approach

Two deterministic, operator-usable SCRIPTS (not skills — no agent judgment; the SCRIPT/SKILL split mirrors S02 loki-query). A follow-up SKILL wrapping the XCom reader (mirror of the loki-query skill over loki_query.py) is tracked separately and out of scope here.

  1. scripts/airflow_xcom_pull.py — read-only XCom reader. Ships a tiny embedded probe and runs it in-pod via kubectl exec (the scheduler pod already holds the conn string and can reach the DB). Only SQL issued is SELECT; credentials never leave the cluster; fail-fast on any error (ADR-25), never invents an empty result. Mirrors loki_query.py's shape and ergonomics.
  2. scripts/new_worktree.sh — second-developer bootstrap. Creates a git worktree (shared .git, no second clone) on a dedicated branch and symlinks the gitignored local environment (.venv_airflow, .env, .claude/settings*.json) plus the keyed auto-memory dir, so the second session is a faithful clone of the first.
  3. CLAUDE.md trim — replaced the inline ADR-1..92 enumeration with the load-bearing invariants + a pointer to documentation/ADR.md (the SSoT, ADR-77). No behavioural change; reduces drift between the inline list and the source of truth.

Files

  • scripts/airflow_xcom_pull.py (new)
  • scripts/new_worktree.sh (new)
  • CLAUDE.md (trim — ADR catalogue → invariants + SSoT pointer)
  • documentation/process/AIRFLOW_XCOM_READER.md (new — usage note, mirrors LOKI_QUERY.md)
  • documentation/process/PARALLEL_DEV_WORKTREES.md (new — usage note + worktree gotchas)
  • mkdocs.yml (nav — two entries under Operations)
  • documentation/reviews/2026-06-05-cvn-n014-ec-s16-dev-productivity-tooling-plan.md (this dossier)

Risks & mitigations

  • A write path slips into the "read-only" reader → only SQL string is a SELECT; the in-pod probe takes no DDL/DML branch; documented + asserted in the usage note as read-only by construction.
  • Credential leakage to the laptop → conn string is read from the pod env and never printed; the script transports only the JSON request + JSON result over kubectl exec stdout.
  • Silent empty result masking a real error → fail-fast (ADR-25): no Running pod / non-zero exec / missing conn / non-JSON output each raise with a clear message + non-zero exit. An empty match (vs error) prints a "widen with --list-runs" hint, distinct from an error.
  • Worktree foot-guns (same branch in two worktrees, shared git stash stack, stale local main, concurrent cluster runs) → enumerated as "Iron rules" in PARALLEL_DEV_WORKTREES.md; serialise-cluster-runs called out explicitly (shared max_active_runs=1 + ftf_config).
  • kubectl exec timeout / pod churn → 60s timeout on the exec; pod resolved by component=scheduler label with --pod override; --container overridable.

Success criteria

  • PR #1102 checks green + CodeRabbit clean (✅ met).
  • Docs live on docs.cvntrade.eu (two Operations pages) after merge — the → Developed gate.
  • Dogfood: airflow_xcom_pull.py pulls a real diagnostic__s42 XCom trajectory; new_worktree.sh already dogfooded (this very session runs in the champollion-dev2 worktree it bootstraps).
  • Standard PR / CR / committee cycle completed before merge.

Self-contained design (for committee — no code-reading required)

1. airflow_xcom_pull.py — CLI surface

flag default meaning
--dag-id — (required) DAG id, e.g. diagnostic__s42
--run-id DAG run_id (required unless --list-runs)
--task-id / --map-index restrict to one task / one mapped cell
--key return_value XCom key
--list-runs off list recent runs of --dag-id and exit
--limit 20 max rows
--namespace / --selector / --container / --pod cvntrade / component=scheduler / scheduler / — pod targeting
--json off raw JSON dump

The load-bearing decision — in-pod SELECT over kubectl exec (not a local DB connection, not a write): - The metadata DB is on a private VPC IP → no local connection possible. The scheduler pod can reach it and already has the conn string. - The embedded probe (sqlalchemy + json, both in the Airflow image) issues exactly one SELECT against xcom/dag_run, JSON-decodes values when possible (pickled/non-JSON surfaced verbatim — never guessed), and prints JSON to stdout. - Read-only by construction: no DDL/DML path exists in the probe. The only side effect is the transient kubectl exec the script owns.

Error model: single fail-fast surface returning rc=1 with a clear message (ADR-25, no silent fallback) — distinct hint on an empty match vs. a real error.

2. new_worktree.sh — what it wires

new_worktree.sh <branch> [dir] [base]
  ├─ git worktree (shared .git, new branch from <base> or attach existing)
  ├─ symlinks: .venv_airflow, .env, .claude/settings.json + settings.local.json
  └─ symlink: ~/.claude/projects/<worktree>/memory → primary's memory (shared handoffs)
Load-bearing decision — worktree + symlinks, not a clone: one .git means a push on one side is instantly visible on the other; symlinks mean a single source of truth for secrets + permissions + the cross-session memory channel. Tracked artefacts (skills, hooks, ADRs, CLAUDE.md) follow the branch for free.

Validation done

  • airflow_xcom_pull.py: read-only SELECT-only probe; fail-fast paths exercised by construction review. Cluster dogfood (pull a real s42 trajectory) is the acceptance step before close.
  • new_worktree.sh: dogfooded live — this session runs in the champollion-dev2 worktree the script provisions (shared venv/env/settings/memory all resolved).
  • Docs build: mkdocs build --strict must be green with the two new Operations pages (run before push).

Questions for the committee

  • Q1 — Is the in-pod kubectl exec + embedded SELECT the right pattern for a read-only XCom reader, given the DB is on a private VPC IP? Any safer/cleaner path (e.g. an Airflow REST/CLI surface) we're missing?
  • Q2 — Is "read-only by construction" sufficiently guaranteed (single SELECT string, no DDL/DML branch), or should we add an explicit guard / test asserting no write verb can be issued?
  • Q3 — Is the SCRIPT (not SKILL) split correct for both deliverables, consistent with the S02 loki-query precedent? (A SKILL wrapping the reader is tracked separately.)
  • Q4 — Any blind spots in new_worktree.sh: the shared git stash stack, .git/config identity sharing, or the cluster-run serialisation risk between two live devs?
  • Q5 — Is the CLAUDE.md ADR-catalogue trim (inline list → invariants + SSoT pointer) aligned with ADR-77 (docs/ADR.md as SSoT), or does removing the inline enumeration lose operator value?

Committee launch

python scripts/expert_committee.py \
  --artifact documentation/reviews/2026-06-05-cvn-n014-ec-s16-dev-productivity-tooling-plan.md \
  --question "<English question covering Q1–Q5>" \
  --session-type pr_review --issue "#1103"