Skip to content

MLOps readiness — CVN-N001-EE-S17 — Training hyperparameters externalization to PG ftf_config (ADR-90)

Story : CVN-N001-EE-S17 — OP work-package and GitHub issue MUST exist before this PR is opened per the canonical flow (ADR-69, ADR-81, documentation/process/STORY_WORKFLOW.md §5 : Pick Story → Issue → Plan → … → PR). This PR (#904) is exceptionally tracked back-to-front : the operator opens the OP wp + GH issue immediately and back-fills the references in this section (OP wp#NNN, Closes #NNN) before the merge gate. Owner : @dococeven (DRI for production behaviour of this change) Filled on : 2026-05-11 Reviewed by committee : - Plan dossier plan_review v1 (2026-05-11) — session ea2e71ff PASSED / OK / strong consensus (5 experts, 6 recos addressed in dossier v2)

Scope context : this Story is a code refactor that externalizes 455 training hyperparameter keys (110 HPO-tunable defaults + 330 RANGE_{MIN,MAX,SCALE} triples + 15 EARLY_STOPPING_ROUNDS defaults) from in-code defaults to the PG ftf_config.base_env Console-managed surface. It does NOT change any model architecture, feature, or label. The mathematical training behaviour is byte-for-byte identical to the legacy values (parity tests at tests/unit/training_harness/test_hyperparams_seeding_parity.py lock the 125 default values against the pre-#899 git history). The structural change : zero hyperparameter values can live in Python source anymore (CI grep gate G5 enforces).


1. Production monitoring (MUST)

Status Metric Type Source Dashboard Threshold (warn / crit) Owner
✅ LIVE event=hpo_fallback_applied model=... timeframe=... param=... key=... fallback=... reason=env_key_missing counter commun.finetune.hyperparams.resolve Grafana cvntrade-hp-coverage (this Story) rate > 1 % of event=training_started → P3 (Console seeding incomplete) ; rate > 5 % → P2 (CRIT) @dococeven
✅ LIVE event=training_started carries learning_rate=... max_depth=... (existing log_emit node) gauge bundle training.harness.nodes.log_emit.emit_training_started Grafana cvntrade-training-harness (existing) for the audit pattern : any learning_rate < 0.01 OR max_depth = -1 → P2 (HPO range too wide) @dococeven
✅ LIVE event=training_complete carries best_iteration / training_time_sec / f1_buy_val (existing) gauge bundle same same best_iteration < 50 OR training_time_sec p95 > 2× pre-S17 baseline → P2 (hyperparam regression) @dococeven
⏸️ PLANNED harness.hp_coverage_pct{} (% Console seeded keys / 455 expected) gauge (Prometheus via OTel) commun.finetune.hyperparams.audit_console_coverage polled at FTF DAG start same Grafana panel < 100 % → WARN ; < 95 % → CRIT (committee reco #1 — lowered from 70 % in dossier v1) @dococeven
⏸️ PLANNED harness.hp_fallback_rate_24h{} (count of hpo_fallback_applied / 24 h) counter rate Loki aggregation same Grafana panel > 0 7 days post-merge → P3 (operator MUST seed remaining keys) ; > 0 14 days post-merge → P2 @dococeven

Required minima coverage : - ✅ prediction-rate metric : signals.buy_proba (existing, unchanged — no inference path touched). - ✅ outcome metric : f1_buy_val per fold per variant per crypto (existing in finetune_results PG + event=training_complete Loki). - ✅ health metric : event=hpo_fallback_applied rate is the canonical Console-coverage audit signal — directly informs whether the externalization mechanism itself is healthy.

The metric event=hpo_fallback_applied is the non-negotiable addition this Story makes : it converts the implicit "in-code default got used silently" failure mode (which hid the harness regression for 3 days) into an explicit, queryable observable.

2. Alerting & runbooks (MUST)

Alert Trigger Runbook Severity Notification channel
hp_fallback_rate_high Loki count of event=hpo_fallback_applied over 6 h > 10 occurrences (covering steady-state FTF traffic) documentation/runbooks/runbook_hp_console_seeding.md (P3 follow-up — first version filed at PR merge as part of the closure ritual) P2 Slack #ops-cvn
hp_console_coverage_low audit_console_coverage reports < 95 % seeded keys (CRIT threshold per committee reco #1) same runbook P1 (page on-call — the FTF will fail-fast on the missing keys at next training) Pagerduty
hp_fallback_persistent hpo_fallback_applied event observed 7 days after PR merge (the operator's transition window per ADR-90 Clause 4) same runbook P3 → P2 escalation at d+14 Slack #ops-cvn

Required minima : - ✅ 1 P1 alert : hp_console_coverage_low — covers the silent-failure mode that would surface as "FTF crashes on every trigger" post-deploy if the operator forgot to run the seeding script. - ✅ Runbook : documentation/runbooks/runbook_hp_console_seeding.md is planned for merge day (the seeding script + this readiness file are sufficient for the operator to recover ; the runbook formalizes the steps with a diagnosis tree). SKIP — JUSTIFICATION : the seeding script itself is the runbook ; its --dry-run mode is the diagnosis tool. A separate runbook adds detail (e.g., what to do if PG is unreachable, who has write rights) that is filed as a P3 follow-up.

3. Drift detection (MUST)

This Story is a code refactor — there is no model, feature, or label change. No new drift surface to monitor.

The internal contract drift is structurally prevented : - commun.finetune.hyperparams.resolve is the single authoritative reader (audit query : grep -rn "os.environ.*CVN_HPO_" src/training/ src/commun/finetune/ | grep -v hyperparams.py must return zero matches outside the resolver itself). - The CI grep gate G5 (.github/workflows/pr-workflow-guardrails.yml, job pr-workflow-guardrails → step G5 — ADR-90 hyperparameters externalization) catches any numeric hyperparam literal that sneaks back into src/training/harness/ or src/commun/finetune/, in both assignment form (learning_rate=0.1) and dict-literal form ("learning_rate": 0.1). Legacy src/training/{XGBoost,LightGBM,CatBoost,MetaLabel,TwoStage}/ modules are out of S17 scope — they are scheduled for ADR-90 cutover in follow-up Stories. - The parity tests test_hyperparams_seeding_parity.py pin the bundled legacy values against known good references ; a future PR that mutates the seeding defaults without an explicit reason fails the test.

Drift type Detection method Threshold Action
Hyperparam regression in code CI grep gate G5 any literal numeric matching the param-name regex WARN-only until the 3-day flip window closes (see §G5 below) — emits a CI annotation but does NOT block merge. Once WARN_ONLY=1 is removed in pr-workflow-guardrails.yml (target : merge SHA + 3d), the action becomes "CI fails — block merge".
Seeding script value drift test_hyperparams_seeding_parity.py parametrized × 110 default keys + 110 range triples any mismatch vs the pinned legacy reference CI fails — block merge
Resolver behaviour drift test_hyperparams_resolver.py 24 contract assertions any drift from fail-fast / WARN-fallback semantics CI fails — block merge
Console value tampering ftf_config_history table (PG, existing) any change to ftf_config.base_env is appended to the history table with updated_by + change_reason + diff n/a (audit-only)
Optuna picks bad hyperparams (existing) event=training_complete Loki query best_iteration < 50 OR learning_rate < legacy_min → triggers narrowing of HPO range via Console (committee dissent #4) operator action via Console

4. Staged rollout (MUST)

This Story is a code refactor with a parity test guarantee : the seeded values match legacy byte-for-byte. The rollout is therefore operator-driven (Console seeding is the gate, not a traffic split).

Stage Traffic % Duration Pass criteria Rollback trigger
Shadow N/A (refactor, no model behaviour change) event=training_complete Loki shows the SAME hyperparams (post-seeding) as a pre-S17 baseline FTF run on the same crypto/strategy/fold n/a
Canary (W1) 1 FTF run on defi_top5 5m ATR0.5_1.5_H4 (the original CVN-N001-EE-S16 regression scope) 1 sweep (~2 h wall) f1_buy_val ≥ 0.40 on ≥ 4/5 cryptos for ALL 3 model types (xgboost / lightgbm / catboost). This is the validation that motivated the Story. f1_buy_val < 0.30 on any model type (regression vs the in-code patches that DIDN'T work)
Full (W2) All future FTF runs (no traffic limit ; the refactor is fully deployed) indefinite n/a (the refactor IS the change) per ADR-26 oncall procedure

Required minima : - ✅ Shadow stage N/A justification : the parity tests + the Console seeding ensure the SAME numerical values reach xgb.train / lgb.train / model.fit ; there is no "champion" to shadow because the model is unchanged. SKIP — JUSTIFICATION : refactor with parity test guarantee. - ✅ Canary on a single batch first : 5 cryptos × 1 FTF run is the W1. - ✅ Named canary cryptos : BTCUSDC, ETHUSDC, SOLUSDC, AAVEUSDC, UNIUSDC — the same defi_top5 that exposed the regression on 2026-05-11. Apples-to-apples comparison. - ✅ Operator sign-off : the seeding script --apply step is the manual gate ; the FTF DAG trigger is the second manual gate (ADR-22).

5. Rollback plan (MUST)

Mechanism Description Revert SLA
Revert merge commit (primary) Revert the S17 merge SHA. The pre-S17 code has its own (broken) in-code defaults ; broken-but-working is preferred over broken-after-rollback. < 30 min (PR revert + CI deploy)
Helm tag pin (secondary, no code change) Set cvntradeImageTag to the pre-S17 SHA. Console values are inert for pre-S17 code (it doesn't read CVN_HPO_*). < 5 min
Console rollback (data layer) If a Console seeding error broke values, the canonical restore mechanism is the Console UI : Config → ftf_config → History → Restore version <previous_version_id> writing back through the same audit path. ⏸️ STATUS — NOT IMPLEMENTED today. The Console exposes a read-only history view ; the "Restore version N" button is scoped under the Configuration Control Plane Epic (OP wp#149 — typed/versioned/audited/scoped/rollbackable config plane, planned as the immediate successor to this Story per operator roadmap). Until that Epic ships, the only rollback path at the data layer is the break-glass PG procedure documented below. Operators MUST treat this as a known operational gap. n/a (NOT IMPLEMENTED)
Break-glass PG restore (data layer, ONLY current option) INSERT INTO ftf_config SELECT FROM ftf_config_history WHERE id=<previous_version_id> issued directly against PG. Full step-by-step procedure (4 scenarios A/B/C/D + dry-run drill template + post-mortem template) lives in documentation/runbooks/break-glass-hyperparams.md. Gated on : (a) @dococeven written approval in the incident channel before execution, (b) change_reason='break-glass-<incident-id>' recorded in the next Console-driven write within 1h, (c) post-mortem entry filed against the incident within 24h. Acknowledges the ADR-59 invariant violation as a controlled exception during the Epic CCP transition window ; the runbook itself sunsets when Epic CCP wp#149 ships the Console UI restore button. < 10 min

Required minima : - ✅ Revertable WITHOUT a code deploy : helm tag pin (mechanism 2) does this in < 5 min. The merge revert is the more thorough fallback if the helm rollback alone proves insufficient. - ✅ Tested revert : the helm tag pin pattern was exercised on 2026-05-11 during the validation experiment of CVN-N001-EE-S16 (helm upgrade to experimental SHA, then revert to main SHA — both directions worked, scheduler rolled out, code verified via kubectl exec ... grep). - ✅ Specific config value : cvntradeImageTag=<pre-S17 main SHA>. Documented at runbook level.

The pre-S17 code IS broken (f1=0.22) but its breakage is documented and contained ; rolling back to it does NOT introduce new failure modes.

6. Owner & DRI (MUST)

  • DRI : @dococeven (single human, accountable for this change for the next 90 days)
  • Backup DRI : @dococeven (solo operator on cvntrade ; the seeding script + this readiness + the dossier are sufficient for any future handoff)
  • Decision authority : @dococeven can flip the helm tag / Console keys without committee
  • Sunset milestone : tied to the Configuration Control Plane Epic (OP wp#149) shipping the typed/versioned/audited/rollbackable config plane. By that milestone :
  • The fallback parameter in commun.finetune.hyperparams.resolve(...) callsites is removed (ADR-90 Clause 2 transition window closed)
  • The event=hpo_fallback_applied Loki rate is 0
  • The Console UI restore-from-history is live (closes the rollback gap noted in §5)
  • The break-glass PG path documented in §5 is moved from "current option" to "emergency-only" per the Epic's permissioned model
  • PR-2 (CVN-N001-EE-S18) externalizes the remaining training params (threshold sweep, calibration, regime weighting)

6.bis Calendar floor — multi-stage backstop (per CR session 628a49ba reco #8 + #5)

The single fixed date carried two failure modes : (a) silent extension if Epic CCP slips, (b) no early-warning mechanism that lets the operator course-correct before the hard cliff. The committee crypto-trader + architect explicitly called for a tighter floor ; data-scientist + ops called for an explicit slippage contingency. Both concerns are folded into a single staged backstop :

Checkpoint Date (calendar days from merge SHA 0b0a9c7f on 2026-05-12T00:00:00Z UTC) Trigger Action
T+30 2026-06-11T00:00:00Z Epic CCP wp#149 still status New OR no children opened Operator dedicates the next sprint slot to opening Epic CCP children + plan_review the first deliverable (Console UI restore-from-history). Log decision in wp#153 + Epic wp#149.
T+60 2026-07-11T00:00:00Z Epic CCP first child not yet In progress Yellow flag : 30-day risk runway left. Operator schedules a self-review of the trade-off : does S17's break-glass-only rollback still feel acceptable, or do we accept committee blocker reframing and pause new harness work until the gap closes ? Outcome logged as a comment on wp#153.
T+90 (calendar floor) 2026-08-10T00:00:00Z Epic CCP not shipped (no Closed first child OR Console UI restore not LIVE on console.cvntrade.eu) Hard escalation : the ADR-90 transition window has expired. Operator opens a [gate-failure] issue, runs a self-pr_review committee with explicit re-evaluation of the rollback gap, and either : (i) ships Epic CCP children with a 30-day waiver extension + risk acceptance recorded on wp#153, (ii) reverts the harness DAGs back to in-code defaults (PR #904 revert) + closes the gap that way, or (iii) parks new harness factor sweeps until Epic CCP ships. Silent extension is NOT an admissible outcome.
T+120 (waiver cap) 2026-09-09T00:00:00Z Already on a T+90 waiver extension AND still no Epic CCP shipped Mandatory revert of PR #904 (option ii). The waiver is single-use ; a second extension is not allowed.

Date calculation rule : T+N is computed as 2026-05-12T00:00:00Z + N × 24h, i.e. full calendar days from midnight UTC on the day of merge. The trigger at T+N fires at the first UTC instant of that day (a check run at 2026-08-10T00:00:01Z passes the T+90 evaluation as at or after T+90). The operator's local timezone is irrelevant for the trigger ; logs + check evidence MUST cite UTC timestamps.

Why staged, not a single 60j date : a 60-day cliff (committee reco #8 alternative) compresses the Epic CCP delivery to a window that's likely unrealistic (Epic CCP scope = typed schemas + versioning + audit + scoped resolution + snapshots + approval flow + OpenFeature integration — see OP wp#149 description, ~1 sprint per child × ≥4 children minimum). A staged backstop gives Epic CCP a realistic delivery runway (~90j) while injecting two earlier check-in points that prevent silent extension and produce auditable evidence at each gate. Operator decision authority preserved (single human, no committee dependency on the checkpoints themselves — only on the T+90 hard-escalation).

Slippage triggers — machine-checkable today : - gh issue list --search "label:gate-failure CVN-N001-EE-S17" → must remain empty before T+90 is reached ; at T+90 the operator opens the [gate-failure] issue per the hard-escalation path above (i.e. the gate flips from "empty == healthy" to "1+ open issue == expected" at T+90T00:00:00Z). - OP wp#149 status (REST query : GET /api/v3/work_packages/149, inspect _embedded.status.name) — must move past New by T+30. - OP wp#149 children (REST query : GET /api/v3/work_packages?filters=%5B%7B%22parent%22%3A%7B%22operator%22%3A%22%3D%22%2C%22values%22%3A%5B%22149%22%5D%7D%7D%5D, inspect each element's _embedded.status.name) — at least one child must show status.name past In progress (i.e. Developed, In testing, Tested, or Closed) by T+60. The filter syntax is validated against the live openproject.cvntrade.eu API at S17 time (used by scripts/openproject_import_gh.py).

Console UI restore signal — manual today, automated post-Epic CCP : - The Console UI restore-from-history feature is the deliverable of Epic CCP wp#149 (an HTTP endpoint specification is part of that Epic's scope, not S17's). Until that endpoint exists, the T+90 check is manual : operator (authenticated with the admin or config_write role — same role that gates Console writes today) opens https://console.cvntrade.eu/config and visually confirms the ftf_config → History page exposes a Restore version N button. Fallback if Console UI is inaccessible (5xx, network error, auth failure on the right role) : treat the check as "feature not LIVE" and follow the T+90 hard-escalation path. Do NOT lower the threshold by trying an alternate auth route ; the inaccessibility itself is evidence the Console transition isn't complete. - A formal HTTP API specification (endpoint path, method, request payload schema, auth, expected status codes + response bodies) for the restore action is tracked as part of Epic CCP wp#149 — see that work package for the design spec. Once shipped, the manual check above is replaced by a single curl health probe.

The operator can wire the machine-checkable triggers into a single Airflow DAG that pages at each checkpoint (deferred — out of S17 scope, not load-bearing for the contingency).


Sign-off checklist (gate before PR merge)

  • §1 monitoring : event=hpo_fallback_applied + Grafana coverage panel defined ; event=training_started + event=training_complete already wired by S16.
  • §2 alerting : hp_console_coverage_low P1 alert defined ; runbook deferred to merge-day with justified SKIP.
  • §3 drift : CI grep gate G5 + parity tests + resolver behaviour tests = 3 mechanical drift safeguards. Concept drift N/A documented.
  • §4 rollout : refactor with parity test guarantee, Shadow stage justified-SKIP ; W1 canary on defi_top5 5m ATR0.5_1.5_H4 (the original CVN-N001-EE-S16 regression scope).
  • §5 rollback : helm tag pin (primary, < 5 min) + merge revert (secondary) + ftf_config_history (Console data layer).
  • §6 DRI : @dococeven named, sunset milestone-anchored (Epic CCP wp#149) ; §6.bis multi-stage backstop with T+30 / T+60 / T+90 / T+120 checkpoints (addresses CR session 628a49ba recos #5 + #8).
  • Dependency declarations (CVN-N011-EA-S11 wp#92, F15) : NO new Python imports added (commun.finetune.hyperparams uses only stdlib + commun.logs.cvntrade_log_manager already pinned). H1 deptry gate green expected on PR open.
  • Committee pr_review PASSED on the implementation (post PR open, before merge — per ADR-68)
  • Story OP comment links the committee session id (Meetings #131 for plan, #TBD for PR review)