ADR-0091 — Console/ftf_config keys a DAG resolves MUST be guaranteed-seeded by an automated fail-loud deploy mechanism¶
Status: active
Date: 2026-05-17
Introduced by: CVN-N001-EF (MAYDAY remediation) / PR #986 + PR #988 (issues #985, #987)
Supersedes: none — sharpens ADR-59 and ADR-90
Context¶
ADR-59 mandates that all pipeline parameters live in PostgreSQL ftf_config (Console-only). ADR-90 sharpens that for training hyperparameters (CVN_HPO_<MODEL>_<TF>_<PARAM>) and makes the canonical resolver fail-fast on a missing key (no silent fallback — ADR-25).
Merging the code that reads those keys does not populate the database. On 2026-05-16/17 this gap cost a full operator day : ADR-90's ~481 canonical hyperparameter keys had never been seeded into the production ftf_config.base_env, so diagnostic__s22_a1 (and structurally every FTF/diagnostic DAG) hard-failed on the resolver fail-fast (RuntimeError: Hyperparam CVN_HPO_LGB_5M_LEARNING_RATE not in Console). The "fix" had been a fragile manual operator chore (scripts/seed_hyperparams_console.py --apply) that no environment-bring-up checklist enforced and that was never run in prod.
PR #986 shipped an automated Helm post-install,post-upgrade hook that seeds the Console on every deploy. Its very first prod run also surfaced — loudly, at deploy time, not silently at DAG-run time — a wrong-secret env-wiring bug (the hook reached localhost:5432), which the fail-loud property correctly turned into a blocked deploy rather than a silently half-seeded Console (PR #988 fixed the wiring). That incident is itself the proof of why fail-loud is non-negotiable.
The forcing function : a DAG must never be one un-run manual script away from a hard failure on a fresh or freshly-deployed environment.
Decision¶
Any PostgreSQL ftf_config / Console key that a DAG or pipeline resolves at runtime MUST be guaranteed present in every environment by an automated, deploy-time mechanism — a Helm post-install,post-upgrade hook Job (or an equivalent gated deploy step) — that is:
- Automated : runs on every deploy with zero manual operator action. No DAG may depend on a manual seed step as a prerequisite.
- Idempotent : re-running on an already-seeded Console is a no-op.
- Insert-missing-only : it MUST NOT overwrite existing keys (no
--force-overwritein the automated path) — operator Console edits always win (ADR-59 authority preserved). - Fail-loud : if the seed cannot complete, the deploy MUST fail (e.g.
helm upgrade --waiton the hook). A silently half-seeded Console is forbidden (ADR-25).
The ADR-90 resolver fail-fast remains the last-line guard, NOT the seeding mechanism : a correctly-deployed environment must never reach it for a missing canonical key.
Invariants¶
- INV-1: every canonical Console/ftf_config key class a DAG resolves is present in the automated seed payload (
scripts/seed_hyperparams_console.py::build_seed_payload), pinned by a strict-count unit test (no loose>=threshold — a silent-regression hole). - INV-2: the seed executes automatically on every deploy via a fail-loud, idempotent, insert-missing-only mechanism ; a seed failure fails the deploy (never a half-seeded Console).
- INV-3: the automated seed path never passes
--force-overwrite(operator Console edits are never clobbered — ADR-59). - INV-4: no runbook, onboarding doc, or Story instructs an operator to run the seed manually as a DAG prerequisite.
- INV-5: the seed/bootstrap script itself must not crash nor silent-fail — clean actionable single-line error (no raw traceback), a final structured
event=seed_summary ... status=<OK|FAILED>always emitted, non-zero exit on real failure, and--dry-runrunnable fully offline (cross-ref memoryfeedback_all_scripts_run_without_crash_or_silent_fail).
Alternatives rejected¶
- Resolver self-bootstrap on first miss: re-introduces code-sourced config at runtime, muddying ADR-59 Console authority and ADR-90's Console-only contract.
- Manual operator seed step (status quo pre-#986): the exact failure that cost a day ; relies on operator memory across environments = drift (cross-ref memory
feedback_no_discipline_workflows). - Deploy-CI
kubectlstep instead of a Helm hook: weaker guarantee — a manualhelm upgradebypasses CI. The chosen mechanism is the Helm post-upgrade hook in the localcvntrade-runtimechart, gated byhelm upgrade --wait. - Relaxing the ADR-90 resolver fail-fast to a silent default: violates ADR-25 ; would convert a loud bootstrap gap into silent wrong-config training.
Consequences¶
- Positive: no environment can run a Console-dependent DAG against an unseeded Console ; wiring/credential bugs surface at deploy time (loud, blocked) instead of at DAG-run time (a lost day) ; operator never has a manual seed chore.
- Negative: every future Console-dependent DAG must register its key class in the seed payload + strict-count test ; every deploy carries the seed hook (sub-second, idempotent).
- Neutral: the ADR-90 resolver fail-fast stays as defence-in-depth (last-line guard).
Rollback¶
Emergency only : set the chart toggle ftfSeedHook.enabled=false (values, Helm = SSoT #378). This reverts to the manual-prerequisite failure mode and MUST be paired with a documented one-off manual scripts/seed_hyperparams_console.py --apply until the hook is re-enabled. Disabling the hook without that pairing re-opens exactly the #985 gap and is itself an INV-2 violation.
References¶
- Sharpens : ADR-0059, ADR-0090. Defence-in-depth with ADR-0025 (no silent fallback).
- Incident / implementation : Epic
CVN-N001-EF, issues #985 / #987, PR #986 (auto-seed hook + crash-proof script) and PR #988 (hook DB-creds viasuperset-env, mirroringinfra/k8s/console.yaml). - Plan dossier :
documentation/reviews/2026-05-17-auto-seed-ftf-hyperparams-before-dag-plan.md(committeeplan_reviewsession156408e2PASSED) ; pr_review sessionsce33a169,b095f325PASSED. - Memory rules :
feedback_all_scripts_run_without_crash_or_silent_fail,feedback_no_discipline_workflows.