MLOps readiness — CVN-N001-EE-S17 — Training hyperparameters externalization to PG ftf_config (ADR-90)¶
Story : CVN-N001-EE-S17 — OP work-package and GitHub issue MUST exist before this PR is opened per the canonical flow (ADR-69, ADR-81, documentation/process/STORY_WORKFLOW.md §5 : Pick Story → Issue → Plan → … → PR). This PR (#904) is exceptionally tracked back-to-front : the operator opens the OP wp + GH issue immediately and back-fills the references in this section (OP wp#NNN, Closes #NNN) before the merge gate.
Owner : @dococeven (DRI for production behaviour of this change)
Filled on : 2026-05-11
Reviewed by committee :
- Plan dossier plan_review v1 (2026-05-11) — session ea2e71ff PASSED / OK / strong consensus (5 experts, 6 recos addressed in dossier v2)
Scope context : this Story is a code refactor that externalizes 455 training hyperparameter keys (110 HPO-tunable defaults + 330 RANGE_{MIN,MAX,SCALE} triples + 15 EARLY_STOPPING_ROUNDS defaults) from in-code defaults to the PG ftf_config.base_env Console-managed surface. It does NOT change any model architecture, feature, or label. The mathematical training behaviour is byte-for-byte identical to the legacy values (parity tests at tests/unit/training_harness/test_hyperparams_seeding_parity.py lock the 125 default values against the pre-#899 git history). The structural change : zero hyperparameter values can live in Python source anymore (CI grep gate G5 enforces).
1. Production monitoring (MUST)¶
| Status | Metric | Type | Source | Dashboard | Threshold (warn / crit) | Owner |
|---|---|---|---|---|---|---|
| ✅ LIVE | event=hpo_fallback_applied model=... timeframe=... param=... key=... fallback=... reason=env_key_missing |
counter | commun.finetune.hyperparams.resolve |
Grafana cvntrade-hp-coverage (this Story) |
rate > 1 % of event=training_started → P3 (Console seeding incomplete) ; rate > 5 % → P2 (CRIT) |
@dococeven |
| ✅ LIVE | event=training_started carries learning_rate=... max_depth=... (existing log_emit node) |
gauge bundle | training.harness.nodes.log_emit.emit_training_started |
Grafana cvntrade-training-harness (existing) |
for the audit pattern : any learning_rate < 0.01 OR max_depth = -1 → P2 (HPO range too wide) |
@dococeven |
| ✅ LIVE | event=training_complete carries best_iteration / training_time_sec / f1_buy_val (existing) |
gauge bundle | same | same | best_iteration < 50 OR training_time_sec p95 > 2× pre-S17 baseline → P2 (hyperparam regression) |
@dococeven |
| ⏸️ PLANNED | harness.hp_coverage_pct{} (% Console seeded keys / 455 expected) |
gauge (Prometheus via OTel) | commun.finetune.hyperparams.audit_console_coverage polled at FTF DAG start |
same Grafana panel | < 100 % → WARN ; < 95 % → CRIT (committee reco #1 — lowered from 70 % in dossier v1) | @dococeven |
| ⏸️ PLANNED | harness.hp_fallback_rate_24h{} (count of hpo_fallback_applied / 24 h) |
counter rate | Loki aggregation | same Grafana panel | > 0 7 days post-merge → P3 (operator MUST seed remaining keys) ; > 0 14 days post-merge → P2 | @dococeven |
Required minima coverage :
- ✅ prediction-rate metric : signals.buy_proba (existing, unchanged — no inference path touched).
- ✅ outcome metric : f1_buy_val per fold per variant per crypto (existing in finetune_results PG + event=training_complete Loki).
- ✅ health metric : event=hpo_fallback_applied rate is the canonical Console-coverage audit signal — directly informs whether the externalization mechanism itself is healthy.
The metric event=hpo_fallback_applied is the non-negotiable addition this Story makes : it converts the implicit "in-code default got used silently" failure mode (which hid the harness regression for 3 days) into an explicit, queryable observable.
2. Alerting & runbooks (MUST)¶
| Alert | Trigger | Runbook | Severity | Notification channel |
|---|---|---|---|---|
hp_fallback_rate_high |
Loki count of event=hpo_fallback_applied over 6 h > 10 occurrences (covering steady-state FTF traffic) |
documentation/runbooks/runbook_hp_console_seeding.md (P3 follow-up — first version filed at PR merge as part of the closure ritual) |
P2 | Slack #ops-cvn |
hp_console_coverage_low |
audit_console_coverage reports < 95 % seeded keys (CRIT threshold per committee reco #1) |
same runbook | P1 (page on-call — the FTF will fail-fast on the missing keys at next training) | Pagerduty |
hp_fallback_persistent |
hpo_fallback_applied event observed 7 days after PR merge (the operator's transition window per ADR-90 Clause 4) |
same runbook | P3 → P2 escalation at d+14 | Slack #ops-cvn |
Required minima :
- ✅ 1 P1 alert : hp_console_coverage_low — covers the silent-failure mode that would surface as "FTF crashes on every trigger" post-deploy if the operator forgot to run the seeding script.
- ✅ Runbook : documentation/runbooks/runbook_hp_console_seeding.md is planned for merge day (the seeding script + this readiness file are sufficient for the operator to recover ; the runbook formalizes the steps with a diagnosis tree). SKIP — JUSTIFICATION : the seeding script itself is the runbook ; its --dry-run mode is the diagnosis tool. A separate runbook adds detail (e.g., what to do if PG is unreachable, who has write rights) that is filed as a P3 follow-up.
3. Drift detection (MUST)¶
This Story is a code refactor — there is no model, feature, or label change. No new drift surface to monitor.
The internal contract drift is structurally prevented :
- commun.finetune.hyperparams.resolve is the single authoritative reader (audit query : grep -rn "os.environ.*CVN_HPO_" src/training/ src/commun/finetune/ | grep -v hyperparams.py must return zero matches outside the resolver itself).
- The CI grep gate G5 (.github/workflows/pr-workflow-guardrails.yml, job pr-workflow-guardrails → step G5 — ADR-90 hyperparameters externalization) catches any numeric hyperparam literal that sneaks back into src/training/harness/ or src/commun/finetune/, in both assignment form (learning_rate=0.1) and dict-literal form ("learning_rate": 0.1). Legacy src/training/{XGBoost,LightGBM,CatBoost,MetaLabel,TwoStage}/ modules are out of S17 scope — they are scheduled for ADR-90 cutover in follow-up Stories.
- The parity tests test_hyperparams_seeding_parity.py pin the bundled legacy values against known good references ; a future PR that mutates the seeding defaults without an explicit reason fails the test.
| Drift type | Detection method | Threshold | Action |
|---|---|---|---|
| Hyperparam regression in code | CI grep gate G5 | any literal numeric matching the param-name regex | WARN-only until the 3-day flip window closes (see §G5 below) — emits a CI annotation but does NOT block merge. Once WARN_ONLY=1 is removed in pr-workflow-guardrails.yml (target : merge SHA + 3d), the action becomes "CI fails — block merge". |
| Seeding script value drift | test_hyperparams_seeding_parity.py parametrized × 110 default keys + 110 range triples |
any mismatch vs the pinned legacy reference | CI fails — block merge |
| Resolver behaviour drift | test_hyperparams_resolver.py 24 contract assertions |
any drift from fail-fast / WARN-fallback semantics | CI fails — block merge |
| Console value tampering | ftf_config_history table (PG, existing) |
any change to ftf_config.base_env is appended to the history table with updated_by + change_reason + diff |
n/a (audit-only) |
| Optuna picks bad hyperparams | (existing) event=training_complete Loki query |
best_iteration < 50 OR learning_rate < legacy_min → triggers narrowing of HPO range via Console (committee dissent #4) |
operator action via Console |
4. Staged rollout (MUST)¶
This Story is a code refactor with a parity test guarantee : the seeded values match legacy byte-for-byte. The rollout is therefore operator-driven (Console seeding is the gate, not a traffic split).
| Stage | Traffic % | Duration | Pass criteria | Rollback trigger |
|---|---|---|---|---|
| Shadow | N/A (refactor, no model behaviour change) | — | event=training_complete Loki shows the SAME hyperparams (post-seeding) as a pre-S17 baseline FTF run on the same crypto/strategy/fold |
n/a |
| Canary (W1) | 1 FTF run on defi_top5 5m ATR0.5_1.5_H4 (the original CVN-N001-EE-S16 regression scope) |
1 sweep (~2 h wall) | f1_buy_val ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types (xgboost / lightgbm / catboost). This is the validation that motivated the Story. |
f1_buy_val < 0.30 on any model type (regression vs the in-code patches that DIDN'T work) |
| Full (W2) | All future FTF runs (no traffic limit ; the refactor is fully deployed) | indefinite | n/a (the refactor IS the change) | per ADR-26 oncall procedure |
Required minima :
- ✅ Shadow stage N/A justification : the parity tests + the Console seeding ensure the SAME numerical values reach xgb.train / lgb.train / model.fit ; there is no "champion" to shadow because the model is unchanged. SKIP — JUSTIFICATION : refactor with parity test guarantee.
- ✅ Canary on a single batch first : 5 cryptos × 1 FTF run is the W1.
- ✅ Named canary cryptos : BTCUSDC, ETHUSDC, SOLUSDC, AAVEUSDC, UNIUSDC — the same defi_top5 that exposed the regression on 2026-05-11. Apples-to-apples comparison.
- ✅ Operator sign-off : the seeding script --apply step is the manual gate ; the FTF DAG trigger is the second manual gate (ADR-22).
5. Rollback plan (MUST)¶
| Mechanism | Description | Revert SLA |
|---|---|---|
| Revert merge commit (primary) | Revert the S17 merge SHA. The pre-S17 code has its own (broken) in-code defaults ; broken-but-working is preferred over broken-after-rollback. | < 30 min (PR revert + CI deploy) |
| Helm tag pin (secondary, no code change) | Set cvntradeImageTag to the pre-S17 SHA. Console values are inert for pre-S17 code (it doesn't read CVN_HPO_*). |
< 5 min |
| Console rollback (data layer) | If a Console seeding error broke values, the canonical restore mechanism is the Console UI : Config → ftf_config → History → Restore version <previous_version_id> writing back through the same audit path. ⏸️ STATUS — NOT IMPLEMENTED today. The Console exposes a read-only history view ; the "Restore version N" button is scoped under the Configuration Control Plane Epic (OP wp#149 — typed/versioned/audited/scoped/rollbackable config plane, planned as the immediate successor to this Story per operator roadmap). Until that Epic ships, the only rollback path at the data layer is the break-glass PG procedure documented below. Operators MUST treat this as a known operational gap. |
n/a (NOT IMPLEMENTED) |
| Break-glass PG restore (data layer, ONLY current option) | INSERT INTO ftf_config SELECT FROM ftf_config_history WHERE id=<previous_version_id> issued directly against PG. Full step-by-step procedure (4 scenarios A/B/C/D + dry-run drill template + post-mortem template) lives in documentation/runbooks/break-glass-hyperparams.md. Gated on : (a) @dococeven written approval in the incident channel before execution, (b) change_reason='break-glass-<incident-id>' recorded in the next Console-driven write within 1h, (c) post-mortem entry filed against the incident within 24h. Acknowledges the ADR-59 invariant violation as a controlled exception during the Epic CCP transition window ; the runbook itself sunsets when Epic CCP wp#149 ships the Console UI restore button. |
< 10 min |
Required minima :
- ✅ Revertable WITHOUT a code deploy : helm tag pin (mechanism 2) does this in < 5 min. The merge revert is the more thorough fallback if the helm rollback alone proves insufficient.
- ✅ Tested revert : the helm tag pin pattern was exercised on 2026-05-11 during the validation experiment of CVN-N001-EE-S16 (helm upgrade to experimental SHA, then revert to main SHA — both directions worked, scheduler rolled out, code verified via kubectl exec ... grep).
- ✅ Specific config value : cvntradeImageTag=<pre-S17 main SHA>. Documented at runbook level.
The pre-S17 code IS broken (f1=0.22) but its breakage is documented and contained ; rolling back to it does NOT introduce new failure modes.
6. Owner & DRI (MUST)¶
- DRI :
@dococeven(single human, accountable for this change for the next 90 days) - Backup DRI :
@dococeven(solo operator on cvntrade ; the seeding script + this readiness + the dossier are sufficient for any future handoff) - Decision authority :
@dococevencan flip the helm tag / Console keys without committee - Sunset milestone : tied to the Configuration Control Plane Epic (OP wp#149) shipping the typed/versioned/audited/rollbackable config plane. By that milestone :
- The fallback parameter in
commun.finetune.hyperparams.resolve(...)callsites is removed (ADR-90 Clause 2 transition window closed) - The
event=hpo_fallback_appliedLoki rate is 0 - The Console UI restore-from-history is live (closes the rollback gap noted in §5)
- The break-glass PG path documented in §5 is moved from "current option" to "emergency-only" per the Epic's permissioned model
- PR-2 (CVN-N001-EE-S18) externalizes the remaining training params (threshold sweep, calibration, regime weighting)
6.bis Calendar floor — multi-stage backstop (per CR session 628a49ba reco #8 + #5)¶
The single fixed date carried two failure modes : (a) silent extension if Epic CCP slips, (b) no early-warning mechanism that lets the operator course-correct before the hard cliff. The committee crypto-trader + architect explicitly called for a tighter floor ; data-scientist + ops called for an explicit slippage contingency. Both concerns are folded into a single staged backstop :
| Checkpoint | Date (calendar days from merge SHA 0b0a9c7f on 2026-05-12T00:00:00Z UTC) |
Trigger | Action |
|---|---|---|---|
| T+30 | 2026-06-11T00:00:00Z |
Epic CCP wp#149 still status New OR no children opened |
Operator dedicates the next sprint slot to opening Epic CCP children + plan_review the first deliverable (Console UI restore-from-history). Log decision in wp#153 + Epic wp#149. |
| T+60 | 2026-07-11T00:00:00Z |
Epic CCP first child not yet In progress |
Yellow flag : 30-day risk runway left. Operator schedules a self-review of the trade-off : does S17's break-glass-only rollback still feel acceptable, or do we accept committee blocker reframing and pause new harness work until the gap closes ? Outcome logged as a comment on wp#153. |
| T+90 (calendar floor) | 2026-08-10T00:00:00Z |
Epic CCP not shipped (no Closed first child OR Console UI restore not LIVE on console.cvntrade.eu) |
Hard escalation : the ADR-90 transition window has expired. Operator opens a [gate-failure] issue, runs a self-pr_review committee with explicit re-evaluation of the rollback gap, and either : (i) ships Epic CCP children with a 30-day waiver extension + risk acceptance recorded on wp#153, (ii) reverts the harness DAGs back to in-code defaults (PR #904 revert) + closes the gap that way, or (iii) parks new harness factor sweeps until Epic CCP ships. Silent extension is NOT an admissible outcome. |
| T+120 (waiver cap) | 2026-09-09T00:00:00Z |
Already on a T+90 waiver extension AND still no Epic CCP shipped | Mandatory revert of PR #904 (option ii). The waiver is single-use ; a second extension is not allowed. |
Date calculation rule : T+N is computed as 2026-05-12T00:00:00Z + N × 24h, i.e. full calendar days from midnight UTC on the day of merge. The trigger at T+N fires at the first UTC instant of that day (a check run at 2026-08-10T00:00:01Z passes the T+90 evaluation as at or after T+90). The operator's local timezone is irrelevant for the trigger ; logs + check evidence MUST cite UTC timestamps.
Why staged, not a single 60j date : a 60-day cliff (committee reco #8 alternative) compresses the Epic CCP delivery to a window that's likely unrealistic (Epic CCP scope = typed schemas + versioning + audit + scoped resolution + snapshots + approval flow + OpenFeature integration — see OP wp#149 description, ~1 sprint per child × ≥4 children minimum). A staged backstop gives Epic CCP a realistic delivery runway (~90j) while injecting two earlier check-in points that prevent silent extension and produce auditable evidence at each gate. Operator decision authority preserved (single human, no committee dependency on the checkpoints themselves — only on the T+90 hard-escalation).
Slippage triggers — machine-checkable today :
- gh issue list --search "label:gate-failure CVN-N001-EE-S17" → must remain empty before T+90 is reached ; at T+90 the operator opens the [gate-failure] issue per the hard-escalation path above (i.e. the gate flips from "empty == healthy" to "1+ open issue == expected" at T+90T00:00:00Z).
- OP wp#149 status (REST query : GET /api/v3/work_packages/149, inspect _embedded.status.name) — must move past New by T+30.
- OP wp#149 children (REST query : GET /api/v3/work_packages?filters=%5B%7B%22parent%22%3A%7B%22operator%22%3A%22%3D%22%2C%22values%22%3A%5B%22149%22%5D%7D%7D%5D, inspect each element's _embedded.status.name) — at least one child must show status.name past In progress (i.e. Developed, In testing, Tested, or Closed) by T+60. The filter syntax is validated against the live openproject.cvntrade.eu API at S17 time (used by scripts/openproject_import_gh.py).
Console UI restore signal — manual today, automated post-Epic CCP :
- The Console UI restore-from-history feature is the deliverable of Epic CCP wp#149 (an HTTP endpoint specification is part of that Epic's scope, not S17's). Until that endpoint exists, the T+90 check is manual : operator (authenticated with the admin or config_write role — same role that gates Console writes today) opens https://console.cvntrade.eu/config and visually confirms the ftf_config → History page exposes a Restore version N button. Fallback if Console UI is inaccessible (5xx, network error, auth failure on the right role) : treat the check as "feature not LIVE" and follow the T+90 hard-escalation path. Do NOT lower the threshold by trying an alternate auth route ; the inaccessibility itself is evidence the Console transition isn't complete.
- A formal HTTP API specification (endpoint path, method, request payload schema, auth, expected status codes + response bodies) for the restore action is tracked as part of Epic CCP wp#149 — see that work package for the design spec. Once shipped, the manual check above is replaced by a single curl health probe.
The operator can wire the machine-checkable triggers into a single Airflow DAG that pages at each checkpoint (deferred — out of S17 scope, not load-bearing for the contingency).
Sign-off checklist (gate before PR merge)¶
- §1 monitoring :
event=hpo_fallback_applied+ Grafana coverage panel defined ;event=training_started+event=training_completealready wired by S16. - §2 alerting :
hp_console_coverage_lowP1 alert defined ; runbook deferred to merge-day with justified SKIP. - §3 drift : CI grep gate G5 + parity tests + resolver behaviour tests = 3 mechanical drift safeguards. Concept drift N/A documented.
- §4 rollout : refactor with parity test guarantee, Shadow stage justified-SKIP ; W1 canary on defi_top5 5m ATR0.5_1.5_H4 (the original CVN-N001-EE-S16 regression scope).
- §5 rollback : helm tag pin (primary, < 5 min) + merge revert (secondary) + ftf_config_history (Console data layer).
- §6 DRI :
@dococevennamed, sunset milestone-anchored (Epic CCP wp#149) ; §6.bis multi-stage backstop with T+30 / T+60 / T+90 / T+120 checkpoints (addresses CR session 628a49ba recos #5 + #8). - Dependency declarations (CVN-N011-EA-S11 wp#92, F15) : NO new Python imports added (
commun.finetune.hyperparamsuses only stdlib +commun.logs.cvntrade_log_manageralready pinned). H1deptrygate green expected on PR open. - Committee
pr_reviewPASSED on the implementation (post PR open, before merge — per ADR-68) - Story OP comment links the committee session id (Meetings #131 for plan, #TBD for PR review)