Skip to content

ADR-0090 — All training hyperparameters live in PG ftf_config (Console-only), no in-code defaults beyond explicit fallback with WARN

Status: active Date: 2026-05-11 Introduced by: CVN-N001-EE-S17 (validation post-ADR-89, post-harness baseline regression incident 2026-05-11) Supersedes: none ; sharpens ADR-59 (all params in PG ftf_config — was too generic, did not enforce against in-code defaults) Related: ADR-25 (no silent fallback), ADR-30/32/38 (structured logs), ADR-56 (FTF factor + env var gating), ADR-59 (params in PG ftf_config), ADR-89 (training harness as plugin registry — introduced the violation this ADR closes)


Context

On 2026-05-11 the operator observed f1_buy regression from ~0.42 (pre-harness May 1-3) to ~0.22 (post-harness May 9-11) on defi_top5 5m / ATR0.5_1.5_H4. Diagnostic root-cause analysis (committee session a860565e, OP Meeting #129) identified 18 hyperparameter divergences silently introduced during the harness migration (PRs #891 #896 #899 #901, ADR-89) :

  • LightGBM defaults : max_depth -1 (vs legacy 6), learning_rate 0.05 (vs 0.1), reg_lambda 0.0 (vs 1.0), subsample 0.9 (vs 0.8), colsample_bytree 0.9 (vs 0.8)
  • XGBoost defaults : timeframe-agnostic (vs legacy timeframe-aware GRID_DEFAULT_HP['5m'/'15m'/'1h'/...] ranges with learning_rate 0.07-0.13, max_depth 7-9 per timeframe)
  • CatBoost defaults : learning_rate 0.05 (vs legacy 0.1)
  • All 3 models : early_stopping_rounds hardcoded to 50 (LGB+CB) or None (XGB) ignoring the Console env CVN_EARLY_STOPPING_ROUNDS=150
  • HPO search spaces too wide : learning_rate (0.01, 0.3, log=True) for LGB vs legacy timeframe-narrow (0.05, 0.15)

The validation experiment for this regression (2026-05-11) attempted to patch the broken defaults in-code (LGB/XGB/CB DAG files) and deploy via Option Z (helm upgrade to an experimental image tag). Two flaws of that approach surfaced immediately :

  1. Defaults are dead code when HPO is active — Optuna's suggest_* calls override every patched default. With CVN_DEFAULT_N_TRIALS=10, every training selects HPO-suggested values from the (too-wide) search space. The in-code defaults are read only when no HPO trial provides a value — i.e., effectively never under normal FTF operation. The operator observed live Loki event=training_started learning_rate=0.016149 even after the patched-defaults image was deployed.
  2. Every parameter change requires a redeploy — the validation experiment alone produced 2 commits, 2 CI image builds (~10 min each), 2 helm upgrades, and consumed >1 hour of operator time before any training result was available. For an ML system whose entire value proposition is rapid hypothesis iteration, this is structurally untenable.

The deeper issue, restated by the operator on 2026-05-11 : "Règle 1 changement de paramètre ne devrait jamais nécessiter un nouveau déploiement. Au fur et à mesure du projet, tout doit être dans PG et modifiable par la console — on doit à terme pouvoir faire de la gestion des paramètres dynamiques."

ADR-59 had already stated "All params in PostgreSQL" but did not enforce it for training hyperparameters, did not specify the fallback contract, and did not provide a CI gate. ADR-25 (no silent fallback) was nominally applicable but was bypassed by dict.get(key, default) patterns which silently use defaults when env keys are absent.

Decision

Every numeric hyperparameter used during model training (defaults AND HPO search ranges) is read at runtime from ftf_config.base_env via an explicit env var, and the canonical resolver fails fast on absence with a Loki-queryable WARN event when a code-side fallback is applied. There is no silent default.

The five enforcement clauses below are non-negotiable.

Clause 1 — Canonical env var naming convention

All training hyperparams use the prefix CVN_HPO_<MODEL>_<TIMEFRAME>_<PARAM> where : - MODEL{XGB, LGB, CB} (uppercase) - TIMEFRAME{1M, 5M, 15M, 30M, 1H} (uppercase, no separator) - PARAM is the parameter name in uppercase with underscores (e.g., LEARNING_RATE, MAX_DEPTH, REG_LAMBDA, N_ESTIMATORS)

HPO ranges use the suffix _RANGE_MIN / _RANGE_MAX / _RANGE_SCALE (the last is "linear" or "log").

Examples : - CVN_HPO_LGB_5M_LEARNING_RATE = "0.1" (default for HPO trial=0 / no-HPO mode) - CVN_HPO_LGB_5M_LEARNING_RATE_RANGE_MIN = "0.05" - CVN_HPO_LGB_5M_LEARNING_RATE_RANGE_MAX = "0.15" - CVN_HPO_LGB_5M_LEARNING_RATE_RANGE_SCALE = "linear" - CVN_HPO_XGB_15M_MAX_DEPTH = "8" - CVN_HPO_XGB_15M_MAX_DEPTH_RANGE_MIN = "6" - CVN_HPO_XGB_15M_MAX_DEPTH_RANGE_MAX = "10"

Cross-timeframe constants that legitimately don't vary (e.g. random_state=42) use the timeframe-less form CVN_HPO_<MODEL>_<PARAM> — but every such case MUST be documented in the relevant DAG module docstring.

Clause 2 — Canonical resolver with fail-fast + WARN fallback

A single helper commun.finetune.hyperparams.resolve(model_type, timeframe, param_name) is the sole authoritative reader. Its contract :

def resolve(model_type: str, timeframe: str, param_name: str, fallback: Any | None = None) -> Any:
    """Read a hyperparameter from the env (ftf_config-mediated).

    Behavior :
        - Construct the canonical env key (Clause 1).
        - If env key is set : parse to the expected type, return.
        - If env key is NOT set AND fallback is None :
            raise RuntimeError(f"HP {key} not in Console — set via ftf_config.base_env per ADR-90")
        - If env key is NOT set AND fallback is not None :
            emit `log_event WARN event=hpo_fallback_applied model=<m> timeframe=<tf> param=<p> fallback=<v>
                              reason=env_key_missing key=<key>`
            return fallback

    NEVER use dict.get(key, default) on hpo_params for hyperparams in scope.
    NEVER read os.environ[KEY] directly without going through resolve().
    """

The fallback parameter exists as a transition-window concession : the first PR cannot externalize all 200+ env vars atomically, so a fallback path is necessary while Console seeding catches up. Each fallback occurrence emits a structured warn event, queryable in Grafana to identify which params still need Console seeding. Once Console seeding is complete (within 1 sprint of the externalization PR landing), the fallback parameter MUST be dropped from every callsite and a follow-up ADR amendment may upgrade Clause 2 to fallback parameter forbidden.

Clause 3 — CI grep gate (mechanical enforcement)

A CI step in .github/workflows/ci.yml runs the following grep against src/training/ + src/commun/finetune/. Any match fails the build :

# Numeric literals on the right side of hyperparam-looking assignments
grep -rn -E '(learning_rate|max_depth|reg_alpha|reg_lambda|subsample|colsample|min_child|n_estimators|num_leaves|gamma|l2_leaf_reg|depth|iterations|early_stopping_rounds)[^=]*=\s*[0-9]' \
  src/training/ src/commun/finetune/

# Direct os.environ reads (must go through resolve())
grep -rn -E 'os\.(environ|getenv).*CVN_HPO_' src/training/ src/commun/finetune/ \
  | grep -v 'hyperparams\.py'  # the resolver itself

Test code (tests/) is excluded from the grep — tests legitimately use literal values for fixture inputs.

The CI step name is exactly Story workflow guardrails (G5) to match the G1-G4 naming pattern of 0087-story-phase-test-integration.md. Failure messages reference this ADR and link to the resolver docstring.

Clause 4 — No silent fallback (ADR-25 restated for HPs)

The ADR-25 ban on silent fallbacks applies bit-for-bit to hyperparameter resolution : - dict.get(key, default) patterns on hpo_params for hyperparams in scope = ADR-90 violation. - os.environ.get(KEY, default) for CVN_HPO_* keys outside the resolver = ADR-90 violation. - A fallback path WITHOUT an emitted WARN event = ADR-90 violation. - A try/except that swallows a hyperparameter resolution failure = ADR-90 violation.

Operator audit query for fallback usage (Loki) :

{namespace="cvntrade"} |~ "event=hpo_fallback_applied" | line_format "{{.param}} on {{.timeframe}} fallback={{.fallback}}"

If this query returns non-zero entries 7 days post-externalization-PR-merge, the operator MUST open a Story to complete Console seeding for the missing keys.

Clause 5 — Console seeding script + audit panel

The externalization PR (CVN-N001-EE-S17) ships :

  1. A one-shot seeding script scripts/seed_hyperparams_console.py that bulk-inserts the ~200 canonical default values (legacy values when available, sane sources documented otherwise) into ftf_config.base_env. The operator runs this once post-deploy. The script is idempotent : re-running with existing keys is a no-op (no clobber).
  2. A Grafana panel cvntrade-hp-coverage showing count(CVN_HPO_* keys in ftf_config) / expected_total = X%. Threshold WARN at <100% (means Console is missing keys → fallback path active for some HPs). The threshold flips to CRIT 7 days post-merge if still <100%.
  3. A Console UI section Hyperparameters (separate tab, ugly-but-functional UI per operator 2026-05-11 decision : "cela va alourdir son UI, I know, mais le sprint d'optimisation de la console sera traité +tard") listing the ~200 keys grouped by <MODEL>_<TIMEFRAME> with current value + last-modified-at metadata. A "reset to legacy default" button per row.

Consequences

Positive

  • Hypothesis iteration cycle drops from ~30 min (rebuild + helm upgrade) to ~10 seconds (Console edit + DAG re-trigger). This single change reclaims the ML iteration velocity the harness migration silently destroyed.
  • The Console becomes the canonical record of training behavior — git history shows code structure, ftf_config history shows what trained the live champion. Both are versioned, both are auditable.
  • CI grep gate makes the rule mechanically enforced, removing reviewer cognitive load on every PR.
  • event=hpo_fallback_applied Loki query gives the operator a continuous fallback-debt audit — they always know how complete the Console coverage is.
  • Per-timeframe XGB hyperparams are restored (legacy had it, harness regressed) — now configurable via CVN_HPO_XGB_5M_* vs CVN_HPO_XGB_15M_* etc.
  • Future hyperparameter additions (new models, new HPO dimensions, regularization variants) follow the same path — no architectural divergence.

Negative / accepted trade-offs

  • Console UI gets visually heavier (~200 keys added). Operator explicitly accepted on 2026-05-11. A follow-up "Console UX" sprint will deduplicate / collapse / group these but is NOT in scope of this ADR.
  • Externalization PR is large — ~200 env var resolutions, helper module, seeding script, parity tests vs legacy values, Grafana panel, CI gate. Estimated ~1-2 days of focused work + committee plan_review v3 + pr_review per ADR-68.
  • Operator must run the seeding script post-deploy — manual one-shot, idempotent, documented in the Story closure ritual. ADR-22 (DAGs paused, manual trigger only) precedent.
  • Fallback path (Clause 2) is technically a transition concession — must be removed in a follow-up amendment once seeding is complete. The 7-day audit gate (Clause 4) is the trigger.

Migration plan (gradual per operator's "au fur et à mesure du projet" principle)

  • PR-1 (CVN-N001-EE-S17) — Resolver helper + LGB+XGB+CB hyperparams externalized (the harness-regression scope identified 2026-05-11). Console seeding. CI gate Story workflow guardrails (G5) shipped initially in warn-only mode (logs the violation in Action logs but does not fail the build) for the first sprint. The gate flips to fail-the-build mode on next PR.
  • PR-2 (CVN-N001-EE-S18) — Externalize remaining training params : threshold sweep, calibration choice, regime weighting alpha. Drop fallback parameter from all callsites where Clause 5 §1 reports 100% Console coverage.
  • Beyond — Apply Clause 1 naming + Clause 2 resolver to feature engineering hyperparams (rolling window sizes, etc.) as separate Stories.

Alternatives considered (and rejected)

  1. Keep defaults in code with stricter docstring + lint — rejected. Doesn't fix the redeploy-to-change loop, just makes the documentation prettier. Doesn't satisfy the operator's Règle 1.
  2. JSON blob env vars (CVN_HPO_LGB_5M_BUNDLE = '{"learning_rate": 0.1, ...}') — rejected. Lower diff-readability in Console UI ; harder to write a CI grep gate ; harder to write a Grafana coverage panel.
  3. Per-PTE or per-crypto hyperparams — rejected for this ADR. No empirical evidence that hyperparams differ by PTE or crypto on the current feature stack ; only timeframe-aware is supported by legacy code + ML literature. A future ADR may extend the naming convention if needed.
  4. Auto-tuned HPO ranges derived from data distribution — out of scope for this ADR ; could be a future Story under the FTF protocol.
  5. Use Apache Airflow Variables instead of ftf_config — rejected. Airflow Variables don't have versioning, audit, or Console UI editing. ftf_config.base_env already has all three via the existing Console.

References

  • ADR-25 (no silent fallback)
  • ADR-30 / 32 / 38 (structured logs, golden fields)
  • ADR-56 (every change gated by CVN_* env var + FTF factor)
  • ADR-59 (params in PG ftf_config — generic predecessor of this ADR)
  • ADR-67 (pluggable feature-selection registry — same enforcement style)
  • ADR-68 (committee plan_review canonical channel)
  • ADR-71 (kill switch single source — same Console-as-source-of-truth pattern)
  • ADR-87 (Story phase test integration — G5 gate)
  • ADR-89 (training harness as plugin registry — the migration this ADR retrofits)
  • Diagnostic dossier documentation/reviews/2026-05-11-cvn-n001-ee-s16-harness-baseline-validation-experiment.md — the empirical evidence of the regression that motivated this ADR
  • Committee session a860565e — operator's full statement of Règle 1