Plan dossier — CVN-N001-EE-S17 — Externalisation totale des hyperparamètres training¶
Date : 2026-05-11
Story : CVN-N001-EE-S17 (à créer post-committee, OP wp à allouer)
Author : Dominique (operator) + Claude
Session type : plan_review — implementation plan for the full externalization mandated by ADR-90
Related ADR : ADR-90 (just written today, this dossier is its companion implementation plan)
Status : awaiting committee verdict
1. Why this Story exists¶
Two empirical events converged on 2026-05-11 :
- The harness migration regression —
f1_buydropped from 0.42 to 0.22 ondefi_top55m post-PRs #891/#896/#899/#901. Diagnosed root cause : 18 hyperparameter divergences silently introduced. Documented indocumentation/reviews/2026-05-11-cvn-n001-ee-s16-harness-baseline-validation-experiment.md. - The validation experiment itself failed — patching defaults in code + helm upgrade (Option Z) was overridden by HPO Optuna's
suggest_*calls. Live Loki showedlearning_rate=0.016149even after the patched-defaults image was deployed. The patches were dead code.
Together these motivate ADR-90 : every hyperparameter (defaults + HPO ranges) MUST live in PG ftf_config (Console-editable), no in-code defaults, fail-fast or WARN-fallback only.
This Story implements ADR-90's first PR (the LGB+XGB+CB scope identified by the diagnostic).
2. Scope — what gets externalized¶
2.1 Defaults (per-model × per-timeframe)¶
| model | timeframes | per-TF default count | total defaults |
|---|---|---|---|
| XGB | 1m, 5m, 15m, 30m, 1h | 9 (max_depth, learning_rate, n_estimators, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda) | 45 |
| LGB | 1m, 5m, 15m, 30m, 1h | 9 (num_leaves, max_depth, learning_rate, n_estimators, min_child_samples, subsample, colsample_bytree, reg_alpha, reg_lambda) | 45 |
| CB | 1m, 5m, 15m, 30m, 1h | 4 (depth, learning_rate, iterations, l2_leaf_reg) | 20 |
| Subtotal defaults | 110 |
2.2 HPO ranges (per-model × per-timeframe)¶
Each HPO-suggestable param has 3 keys : _RANGE_MIN, _RANGE_MAX, _RANGE_SCALE (linear or log).
| model | timeframes | HPO range count per TF | total HPO range keys |
|---|---|---|---|
| XGB | 1m, 5m, 15m, 30m, 1h | 9 params × 3 keys = 27 | 135 |
| LGB | 1m, 5m, 15m, 30m, 1h | 9 params × 3 keys = 27 | 135 |
| CB | 1m, 5m, 15m, 30m, 1h | 4 params × 3 keys = 12 | 60 |
| Subtotal HPO ranges | 330 |
2.3 Cross-cutting hyperparams (model-agnostic, TF-agnostic)¶
Already in Console (verified 2026-05-11) :
- CVN_EARLY_STOPPING_ROUNDS — 1 key (kept as-is, just reading-side fix)
- CVN_HPO_OBJECTIVE — 1 key
- CVN_THRESHOLD_METHOD — 1 key
No new keys here ; the resolver helper just centralizes the existing reads.
2.4 TOTAL : ~440 env vars to seed in Console¶
This is the operator's accepted heaviness (per 2026-05-11 statement "cela va alourdir son UI, I know").
3. Naming convention (per ADR-90 Clause 1)¶
CVN_HPO_<MODEL>_<TF>_<PARAM> # default
CVN_HPO_<MODEL>_<TF>_<PARAM>_RANGE_MIN # HPO suggest min
CVN_HPO_<MODEL>_<TF>_<PARAM>_RANGE_MAX # HPO suggest max
CVN_HPO_<MODEL>_<TF>_<PARAM>_RANGE_SCALE # "linear" or "log"
MODEL ∈ {XGB, LGB, CB} (uppercase, no separator)
TF ∈ {1M, 5M, 15M, 30M, 1H} (uppercase, no separator)
PARAM : uppercase, underscores (e.g. LEARNING_RATE, MAX_DEPTH, REG_LAMBDA)
Examples :
- CVN_HPO_XGB_5M_LEARNING_RATE = "0.07"
- CVN_HPO_XGB_5M_LEARNING_RATE_RANGE_MIN = "0.05"
- CVN_HPO_XGB_5M_LEARNING_RATE_RANGE_MAX = "0.15"
- CVN_HPO_XGB_5M_LEARNING_RATE_RANGE_SCALE = "linear"
Type enforcement : the resolver parses to int, float, or str based on the param name (LEARNING_RATE → float, MAX_DEPTH → int, etc.). Naming convention table is the source of truth, kept in commun/finetune/hyperparams.py.
4. Implementation deliverables¶
4.1 Helper module commun/finetune/hyperparams.py¶
# Pseudo-code, finalized in implementation
def resolve(model_type: str, timeframe: str, param_name: str, fallback: Any | None = None) -> Any:
"""ADR-90 canonical resolver. Read docstring there."""
key = f"CVN_HPO_{model_type.upper()}_{timeframe.upper()}_{param_name.upper()}"
raw = os.environ.get(key)
if raw is not None:
return _parse(raw, _expected_type(param_name))
if fallback is None:
raise RuntimeError(
f"HP {key} not in Console — set via ftf_config.base_env per ADR-90. "
f"See documentation/adr/0090-training-hyperparameters-in-pg-console-only.md"
)
log_event(
level="WARN",
event="hpo_fallback_applied",
model=model_type, timeframe=timeframe, param=param_name,
fallback=fallback, key=key, reason="env_key_missing",
)
return fallback
def resolve_hpo_range(model_type: str, timeframe: str, param_name: str) -> tuple[Any, Any, str]:
"""Returns (min, max, scale). All three keys must be present or RuntimeError."""
...
4.2 Patch lightgbm_dag.py / xgboost_dag.py / catboost_dag.py¶
Each *_native_params function and each _hpo_space function reads via resolve() / resolve_hpo_range() instead of hpo_params.get(..., default) / trial.suggest_*(min, max).
Example XGB (illustrative, full diff in PR) :
# BEFORE (ADR-90 violation)
def xgb_native_params(hpo_params, xgb_binary):
return {
"max_depth": p.get("max_depth", 6), # ← in-code default
"learning_rate": p.get("learning_rate", 0.1), # ← in-code default
...
}
def _hpo_space(trial):
return {
"max_depth": trial.suggest_int("max_depth", 5, 12), # ← in-code range
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), # ← in-code range
...
}
# AFTER (ADR-90 compliant)
def xgb_native_params(hpo_params, xgb_binary, timeframe):
return {
"max_depth": p.get("max_depth", resolve("XGB", timeframe, "MAX_DEPTH", fallback=6)),
"learning_rate": p.get("learning_rate", resolve("XGB", timeframe, "LEARNING_RATE", fallback=0.1)),
...
}
def _hpo_space(trial, timeframe):
mn, mx, sc = resolve_hpo_range("XGB", timeframe, "MAX_DEPTH")
max_depth = trial.suggest_int("max_depth", mn, mx)
mn, mx, sc = resolve_hpo_range("XGB", timeframe, "LEARNING_RATE")
lr = trial.suggest_float("learning_rate", mn, mx, log=(sc == "log"))
...
The timeframe parameter is threaded down from the FTF runner via the existing Datasets dataclass or HPOParams (decision made during implementation).
4.3 Console seeding script scripts/seed_hyperparams_console.py¶
Idempotent CLI :
python scripts/seed_hyperparams_console.py \
--console-host console.cvntrade.eu \
--dry-run # prints the 440 keys + values without writing
python scripts/seed_hyperparams_console.py \
--console-host console.cvntrade.eu \
--apply # writes the keys ; existing keys with same value = no-op ;
# existing keys with different value = SKIP + WARN log line
# to allow operator to inspect before overwriting
Source values bundled in the script as Python dicts, derived from :
- Legacy cvntrade_XGBoost_config.py::GRID_DEFAULT_HP[<TF>] (per-TF) and XGBoostHyperConfig._apply_timeframe_specific_params (per-TF HPO ranges) → XGB defaults + HPO ranges
- Legacy cvntrade_LightGBM_config.py::GRID_DEFAULT_HP_LGB (TF-agnostic, replicated × 5 TFs) → LGB defaults
- Legacy cvntrade_CatBoost_config.py::CatBoostConfig dataclass → CB defaults
- LGB + CB HPO ranges : derived from existing harness _hpo_space (the harness ranges, kept verbatim because no legacy HPO ranges exist for LGB/CB — they used GridSearch not Optuna pre-harness)
4.4 CI grep gate Story workflow guardrails (G5)¶
New step in .github/workflows/ci.yml :
- name: Story workflow guardrails (G5) — ADR-90 hyperparams in code
if: always()
shell: bash
run: |
set -e
VIOLATIONS=$(grep -rn -E '(learning_rate|max_depth|reg_alpha|reg_lambda|subsample|colsample|min_child|n_estimators|num_leaves|gamma|l2_leaf_reg|depth|iterations|early_stopping_rounds)[^=]*=\s*[0-9]' \
src/training/ src/commun/finetune/ \
--include='*.py' \
--exclude-dir=__pycache__ \
| grep -v 'hyperparams\.py' \
|| true)
if [ -n "$VIOLATIONS" ]; then
echo "::error title=Guardrail G5 — ADR-90 violation::Hyperparameter literals found in source. ADR-90 mandates Console-only via commun/finetune/hyperparams.py::resolve()"
echo "$VIOLATIONS"
exit 1
fi
echo "::notice title=Guardrail G5::No ADR-90 violations found"
Initially shipped in warn-only mode (logs but does not fail) for the first sprint to catch any in-flight PRs ; flips to fail-the-build mode in PR-2.
4.5 Parity tests (mandatory per ADR-25 + ADR-58)¶
tests/unit/training_harness/test_hyperparams_resolver_parity.py :
- Verify resolve("XGB", "5m", "LEARNING_RATE") returns 0.07 when CVN_HPO_XGB_5M_LEARNING_RATE=0.07 is set (matches legacy)
- Verify the resolver raises RuntimeError with the canonical message when key missing AND no fallback
- Verify the resolver emits event=hpo_fallback_applied WARN log when fallback is used
- Verify the seeding script's bundled values match the legacy values byte-for-byte (compare against git show e75418ca^:src/training/... — the deletion commit)
tests/unit/training_harness/test_hpo_space_uses_resolver.py :
- Mock resolve_hpo_range and verify _hpo_space calls it for each param (no in-code numeric literals reach trial.suggest_*)
4.6 Grafana panel cvntrade-hp-coverage¶
Loki query :
sum by () (count_over_time({namespace="cvntrade"} |~ "event=hpo_fallback_applied" [7d]))
/ sum by () (count_over_time({namespace="cvntrade"} |~ "event=training_started" [7d]))
Dashboard rule : threshold WARN at 5% (some HPs fallback-resolved), CRIT at 30% (Console seeding broken).
4.7 mlops_readiness file¶
Per ADR-70, this Story touches src/training/ + src/commun/finetune/ → mlops_readiness mandatory. File location : documentation/stories/CVN-N001-EE-S17/mlops_readiness.md. Content adapts the template :
- §1 monitoring : event=hpo_fallback_applied Loki query as the "% Console coverage" health metric
- §2 alerting : CRIT alert when fallback rate > 30%
- §3 drift : N/A (no model drift surface — this is a code refactor)
- §4 rollout : staged in 2 PRs (S17 = LGB+XGB+CB ; S18 = threshold/calibration/etc.) ; W1 canary on defi_top5 5m
- §5 rollback : revert helm tag = revert resolver = back to pre-S17 state ; Console seeding script idempotent so seeded keys harmless
- §6 DRI : @dococeven, sunset 2026-08-11
5. Risk analysis¶
| Risk | Severity | Mitigation |
|---|---|---|
| Console seeding script applies wrong legacy values (e.g. for LGB which had no per-TF defaults, the script replicates 1 value × 5 TFs ; what if 1m needs a different value?) | Medium | Script is idempotent and SKIP-on-conflict ; operator reviews dry-run output before --apply ; parity tests gate the seeded values |
Threading timeframe through the harness DAGs requires touching every node signature |
Medium | timeframe is already in Datasets dataclass via Datasets.feature_version / similar metadata ; minimal additional wiring |
440 keys overload ftf_config.base_env JSONB column (size limit) |
Low | PG JSONB has multi-MB capacity ; 440 short string values ≪ 1 MB |
| The Console UI doesn't paginate well at 440+ keys → operator can't find what they need | High UX | Accepted explicitly by operator on 2026-05-11 ; future "Console UX" sprint addresses |
The CI grep gate G5 has false positives (e.g. matches a comment that mentions learning_rate=0.05) |
Low | grep targets = followed by digit on left ; comments would have # before. Tested in dry-run mode for first sprint to catch false positives. |
| Legacy values that we re-seed are themselves out of date (e.g., the pre-harness baseline f1=0.42 may have been reachable but not optimal) | Medium | The seeding restores the pre-regression baseline ; subsequent FTF sweeps are then free to find better values via HPO with the corrected ranges |
Parity test compares against git show e75418ca^ but the legacy file may have been edited intermediate to that commit |
Low | Cross-check with git log on the legacy files ; pick the LAST commit before deletion as the reference snapshot |
| Operator forgets to run the seeding script post-deploy → resolver fail-fasts → ALL trainings crash | High | Seeding is part of the Story closure ritual (ADR-79 §5) ; CI deploy-time check : kubectl exec -- python -c 'from commun.finetune.hyperparams import audit_console_coverage; audit_console_coverage()' ; refuses to mark deploy green if coverage < 100% |
| The 7-day fallback audit gate (Clause 4 of ADR-90) is missed and fallback path becomes permanent | Medium | OP Story comment template includes the 7-day check as a TODO ; calendar reminder ; Grafana panel CRIT threshold flips at d+7 |
6. Definition of done¶
- Committee plan_review verdict PASSED on this dossier + ADR-90
- Resolver helper
commun/finetune/hyperparams.pyimplemented + unit tests - LGB+XGB+CB DAG files patched : every numeric hyperparam routes through
resolve()/resolve_hpo_range() -
scripts/seed_hyperparams_console.pyimplemented + dry-run + apply modes + idempotent - Parity tests vs legacy values (
tests/unit/training_harness/test_hyperparams_resolver_parity.py) green - CI grep gate
Story workflow guardrails (G5)shipped in warn-only mode (PR-1) → fails-the-build mode (PR-2) - mlops_readiness file
documentation/stories/CVN-N001-EE-S17/mlops_readiness.mdcomplete - PR opened, CR rounds clean, committee pr_review PASSED
- PR merged + deploy CI green
- Operator runs the seeding script post-deploy, Console coverage = 100%
- Grafana panel
cvntrade-hp-coveragelive + reads 100% Console coverage - Validation FTF run : trigger
finetune__pte/ factor=model_type/crypto_group=defi_top5/power_mode=standard→ confirmf1_buy_val ≥ 0.40on ≥ 4/5 cryptos for ALL 3 model types (the original validation goal of CVN-N001-EE-S16)
7. Plan B / fail-back¶
If the externalization PR fails post-merge (e.g., a critical resolver bug crashes every training in prod) : - Revert the merge commit (single-commit revert) - Helm upgrade back to pre-S17 SHA - Console-seeded keys are harmless (the pre-S17 code ignores them — falls back to in-code defaults) - Open a new investigation Story for the resolver bug ; do NOT block on it because the pre-S17 state, while harness-broken, IS the current production state today
8. Out of scope (deferred to follow-up Stories)¶
- PR-2 / CVN-N001-EE-S18 : Externalize threshold sweep params, calibration choice, regime weighting alpha, focal-loss params (the rest of the training-config surface)
- CVN-N001-EE-S19 : Console UX optimization sprint — search, group-by-model, presets ("legacy 5m", "Track 11 v2"), bulk-edit
- CVN-N0XX : Per-PTE / per-crypto hyperparam dimensions IF empirical evidence justifies (TBD ; current ADR-90 explicitly rejects this as out-of-scope)
- CVN-N0XX : Auto-tuned HPO ranges based on data drift detection (long-term FTF protocol evolution)
10. Committee plan_review v1 — addressed (2026-05-11)¶
Session ea2e71ff — verdict PASSED / OK / strong consensus (5 experts, all in favour). Reason : "The implementation plan is sound, comprehensively addressing critical hyperparameter divergence issues and establishing a robust, observable, and mechanically enforced framework."
Recommendations addressed in v2¶
| # | Reco | Status |
|---|---|---|
| 1 | Adjust Grafana CRIT threshold from 30% to 5-10% (more aggressive — 30% means major config issue) | ✅ Updated §4.6 : WARN at 1%, CRIT at 5% (was 30%) |
| 2 | Refine LGB/CB seeding : explicitly flag the replicated TF-agnostic values as TODO: differentiate by TF in the script output |
✅ Updated §4.3 : seeding script logs [TODO_PER_TF] next to each LGB/CB key during --dry-run and --apply ; documented in seeding-output template |
| 3 | Expedite CI grep gate : warn-only window must be days (not full sprint) | ✅ Updated §4.4 : warn-only window capped at 3 days post-merge ; flip to fail-the-build is part of S17 closure (not a separate Story) |
| 4 | Explicitly define config injection mechanism : PG → env vars in K8s pods (full lifecycle) | ✅ Added §11 below |
| 5 | Detail AuthN/AuthZ for Console UI + ftf_config access | ✅ Added §12 below |
| 6 | Prioritize Console UX sprint (CVN-N001-EE-S19) to mitigate 440-key UX friction | ✅ Acknowledged ; S19 priority bumped (operator to schedule in next sprint planning) |
Dissents — operator decisions¶
- TF-agnostic vs TF-aware granularity : 2 experts argued for hybrid (TF-agnostic for invariants like
random_state). Operator decision : stick with TF-aware naming everywhere in PR-1 to keep the resolver mechanically simple ; invariant values likerandom_state=42useCVN_HPO_<MODEL>_<TF>_RANDOM_STATE=42replicated × 5 TFs. Documented in §3 already. Acceptable redundancy ; future ADR amendment may collapse if needed. - Strictness of fail-fast during migration : 3 experts wanted gentler migration (fallback acceptable for 1 sprint), 2 wanted strict fail-fast from PR-1. Operator decision : fail-fast with explicit fallback parameter from PR-1 (matches ADR-90 Clause 2). The fallback path is what the WARN log catches ; aggressive Grafana CRIT (reco #1) means missing keys surface fast.
11. Config injection mechanism (committee reco #4)¶
The full lifecycle from ftf_config.base_env to the harness Python code at runtime :
ftf_config (PG JSONB column, id=1)
│
│ (read at DAG parse time per ADR-65, via commun.finetune.dag_config.load_finetune_pte_defaults)
↓
finetune__pte DAG task `validate_params`
│
│ ──────────────► os.environ injection : `os.environ[key] = value`
│ for every k,v in BASE_ENV
│ (one-shot before the K8sPodOperator spawns the worker pod)
↓
worker pod (Airflow KubernetesExecutor task) inherits the full os.environ
│
│ ──────────────► commun.finetune.hyperparams.resolve()
│ reads os.environ[CVN_HPO_<MODEL>_<TF>_<PARAM>]
↓
LGB / XGB / CB DAG node uses the resolved value in lgb.train / xgb.train / model.fit
Critical assumption : the K8sPodOperator inherits the scheduler pod's os.environ. Verified in production : the existing CVN_EARLY_STOPPING_ROUNDS=150 Console value reaches the training pod (confirmed via event=training_started Loki sampling pre-2026-05-09 ; broken after the harness migration because the code stopped reading the env var, not because the env var stopped being injected).
Fail-fast on env-not-injected : the resolver's RuntimeError doubles as a guard against env-injection failure. If the Console value is set but os.environ.get(key) returns None, the resolver fails immediately with the canonical message ; the operator inspects kubectl exec ... env | grep CVN_HPO_ to confirm the injection path.
Lifecycle of a value change :
1. Operator edits value in Console UI
2. Console writes to ftf_config.base_env (PG)
3. Operator re-triggers the FTF DAG via Airflow UI (manual, ADR-22)
4. validate_params task reads ftf_config.base_env, injects into os.environ
5. K8sPodOperator spawns worker pod with the env
6. Training task reads via resolve() and uses the new value
Wall time from edit to first training-with-new-value : <30 seconds (Console save + 1 click DAG trigger + scheduler queues task). Compare to pre-S17 : 10 min CI build + 5 min helm upgrade + 1 click DAG trigger = ~15 min. 30× speedup is the headline operator benefit.
12. AuthN/AuthZ for Console UI + ftf_config access (committee reco #5)¶
Current state (pre-S17) :
- Console UI : single-user, single-secret (operator-only). No role-based access. Authentication via shared cvntrade-env-secrets-bound HTTP basic auth.
- ftf_config direct PG access : same credentials as the Airflow scheduler (read+write via mlflow PG user). Anyone with kubectl exec rights into the scheduler pod can psql -c "UPDATE ftf_config ...".
S17 changes : NONE in scope — the Console UI access model is unchanged ; the PG access model is unchanged. The 440 new keys inherit the existing security posture, which is :
- Console is operator-only (single-user). No SSO. No audit log of who edited what.
- PG ftf_config.updated_by column captures the writer's identity (manual writes set it ; the script seed_hyperparams_console.py will set updated_by='seed_script@<git_sha>' for traceability).
- kubectl RBAC governs who can shell into pods to bypass the Console.
Future ADR-91 candidate : multi-user Console with role separation (read-only auditors, write-allowed operators), session-scoped audit log of every key change, MFA. NOT in S17. Documented as a known gap.
13. Open questions for the committee (v1 — ANSWERED in §10)¶
The original §9 open questions are answered in the v2 committee verdict + this revision. Re-listed for reference but no longer require a response :
- Is the
timeframe-aware naming convention (CVN_HPO_<MODEL>_<TF>_<PARAM>) the right granularity, OR should we ALSO support timeframe-agnostic keys (for params truly invariant across TFs, e.g.random_state=42) to reduce Console clutter? - Should the resolver fail-fast
RuntimeErrorblock the WHOLE FTF DAG (current design) OR fall back to the harness in-code values silently for the first sprint then flip to fail-fast (gentler migration)? ADR-90 currently mandates fail-fast ; this is the operator's explicit decision ; flag if disagreement. - The 440-key seeding script bundles legacy values. For LGB and CB which had NO per-TF differentiation in legacy, we replicate 1 value × 5 TFs. Is this the right default, or should the seeding script flag "TODO : differentiate by TF" for LGB/CB so the operator can tune later?
- The CI grep gate G5 is initially warn-only. Is the 1-sprint warn-only window acceptable, or should it ship in fails-the-build mode from PR-1?
- The Grafana panel CRIT threshold for fallback rate is set to 30%. Is this appropriate, or too lenient/strict?