Skip to content

ADR-0091 — Console/ftf_config keys a DAG resolves MUST be guaranteed-seeded by an automated fail-loud deploy mechanism

Status: active Date: 2026-05-17 Introduced by: CVN-N001-EF (MAYDAY remediation) / PR #986 + PR #988 (issues #985, #987) Supersedes: none — sharpens ADR-59 and ADR-90


Context

ADR-59 mandates that all pipeline parameters live in PostgreSQL ftf_config (Console-only). ADR-90 sharpens that for training hyperparameters (CVN_HPO_<MODEL>_<TF>_<PARAM>) and makes the canonical resolver fail-fast on a missing key (no silent fallback — ADR-25).

Merging the code that reads those keys does not populate the database. On 2026-05-16/17 this gap cost a full operator day : ADR-90's ~481 canonical hyperparameter keys had never been seeded into the production ftf_config.base_env, so diagnostic__s22_a1 (and structurally every FTF/diagnostic DAG) hard-failed on the resolver fail-fast (RuntimeError: Hyperparam CVN_HPO_LGB_5M_LEARNING_RATE not in Console). The "fix" had been a fragile manual operator chore (scripts/seed_hyperparams_console.py --apply) that no environment-bring-up checklist enforced and that was never run in prod.

PR #986 shipped an automated Helm post-install,post-upgrade hook that seeds the Console on every deploy. Its very first prod run also surfaced — loudly, at deploy time, not silently at DAG-run time — a wrong-secret env-wiring bug (the hook reached localhost:5432), which the fail-loud property correctly turned into a blocked deploy rather than a silently half-seeded Console (PR #988 fixed the wiring). That incident is itself the proof of why fail-loud is non-negotiable.

The forcing function : a DAG must never be one un-run manual script away from a hard failure on a fresh or freshly-deployed environment.

Decision

Any PostgreSQL ftf_config / Console key that a DAG or pipeline resolves at runtime MUST be guaranteed present in every environment by an automated, deploy-time mechanism — a Helm post-install,post-upgrade hook Job (or an equivalent gated deploy step) — that is:

  1. Automated : runs on every deploy with zero manual operator action. No DAG may depend on a manual seed step as a prerequisite.
  2. Idempotent : re-running on an already-seeded Console is a no-op.
  3. Insert-missing-only : it MUST NOT overwrite existing keys (no --force-overwrite in the automated path) — operator Console edits always win (ADR-59 authority preserved).
  4. Fail-loud : if the seed cannot complete, the deploy MUST fail (e.g. helm upgrade --wait on the hook). A silently half-seeded Console is forbidden (ADR-25).

The ADR-90 resolver fail-fast remains the last-line guard, NOT the seeding mechanism : a correctly-deployed environment must never reach it for a missing canonical key.

Invariants

  • INV-1: every canonical Console/ftf_config key class a DAG resolves is present in the automated seed payload (scripts/seed_hyperparams_console.py::build_seed_payload), pinned by a strict-count unit test (no loose >= threshold — a silent-regression hole).
  • INV-2: the seed executes automatically on every deploy via a fail-loud, idempotent, insert-missing-only mechanism ; a seed failure fails the deploy (never a half-seeded Console).
  • INV-3: the automated seed path never passes --force-overwrite (operator Console edits are never clobbered — ADR-59).
  • INV-4: no runbook, onboarding doc, or Story instructs an operator to run the seed manually as a DAG prerequisite.
  • INV-5: the seed/bootstrap script itself must not crash nor silent-fail — clean actionable single-line error (no raw traceback), a final structured event=seed_summary ... status=<OK|FAILED> always emitted, non-zero exit on real failure, and --dry-run runnable fully offline (cross-ref memory feedback_all_scripts_run_without_crash_or_silent_fail).

Alternatives rejected

  • Resolver self-bootstrap on first miss: re-introduces code-sourced config at runtime, muddying ADR-59 Console authority and ADR-90's Console-only contract.
  • Manual operator seed step (status quo pre-#986): the exact failure that cost a day ; relies on operator memory across environments = drift (cross-ref memory feedback_no_discipline_workflows).
  • Deploy-CI kubectl step instead of a Helm hook: weaker guarantee — a manual helm upgrade bypasses CI. The chosen mechanism is the Helm post-upgrade hook in the local cvntrade-runtime chart, gated by helm upgrade --wait.
  • Relaxing the ADR-90 resolver fail-fast to a silent default: violates ADR-25 ; would convert a loud bootstrap gap into silent wrong-config training.

Consequences

  • Positive: no environment can run a Console-dependent DAG against an unseeded Console ; wiring/credential bugs surface at deploy time (loud, blocked) instead of at DAG-run time (a lost day) ; operator never has a manual seed chore.
  • Negative: every future Console-dependent DAG must register its key class in the seed payload + strict-count test ; every deploy carries the seed hook (sub-second, idempotent).
  • Neutral: the ADR-90 resolver fail-fast stays as defence-in-depth (last-line guard).

Rollback

Emergency only : set the chart toggle ftfSeedHook.enabled=false (values, Helm = SSoT #378). This reverts to the manual-prerequisite failure mode and MUST be paired with a documented one-off manual scripts/seed_hyperparams_console.py --apply until the hook is re-enabled. Disabling the hook without that pairing re-opens exactly the #985 gap and is itself an INV-2 violation.

References

  • Sharpens : ADR-0059, ADR-0090. Defence-in-depth with ADR-0025 (no silent fallback).
  • Incident / implementation : Epic CVN-N001-EF, issues #985 / #987, PR #986 (auto-seed hook + crash-proof script) and PR #988 (hook DB-creds via superset-env, mirroring infra/k8s/console.yaml).
  • Plan dossier : documentation/reviews/2026-05-17-auto-seed-ftf-hyperparams-before-dag-plan.md (committee plan_review session 156408e2 PASSED) ; pr_review sessions ce33a169, b095f325 PASSED.
  • Memory rules : feedback_all_scripts_run_without_crash_or_silent_fail, feedback_no_discipline_workflows.