Runbook — Break-glass hyperparameter rollback (ADR-90 transition window)¶
Severity : P1 if FTF / training is breached ; P2 if Console UI is the only thing down
Owner : @dococeven (sole approval authority — no committee approval needed during transition window)
Story : CVN-N001-EE-S17 (GH #905, OP wp#153) · pr_review verdict 628a49ba PASSED OK
Linked code : scripts/seed_hyperparams_console.py · src/commun/finetune/hyperparams.py
Supersedes : the rollback paragraph in documentation/stories/CVN-N001-EE-S17/mlops_readiness.md §5 (this runbook is the operational long-form ; the readiness file is the contract).
When to invoke this runbook¶
ONLY in one of these three scenarios :
ftf_config.base_envgot corrupted — a--applyran with a wrong payload, or a manual SQL was issued against the table by mistake, or aforce-overwriteclobbered operator-tuned values. The harness DAGs now produceevent=hpo_fallback_appliedfloods or fail-fast on missing keys.- Console UI is down AND the
ftf_config.base_envneeds to be edited urgently (e.g., to widen an HPO range that's clipping the optimum). The canonical edit path is unavailable, the transition window from PR #904 to Epic CCP wp#149 has us on break-glass-only data-layer rollback. - Operator typo — a Console UI edit committed a value that's about to break the next FTF (e.g.,
learning_rate: 0.0instead of0.07). The Console UI history view is read-only today (no Restore button until Epic CCP ships). Break-glass restore fromftf_config_historyis the only recovery path.
NOT a break-glass scenario :
- Routine value tuning → use Console UI write
- Wholesale
ftf_configreset → usepython scripts/seed_hyperparams_console.py --apply --force-overwrite(canonical seed path, no break-glass needed) - Binary regression on the training code →
helm rollbackto the pre-S17 SHA. The break-glass PG path is for the data layer, not the binary layer.
Gating ritual — MUST happen BEFORE issuing any SQL¶
The break-glass PG path violates the ADR-59 "Console-only" invariant. It is permitted only during the ADR-90 transition window AND only with the following three gates :
- Written approval in the incident channel :
@dococevenposts a message in#ops-cvn(or the active incident channel) of the form :
event=break_glass_approval incident_id=<NNN> story=CVN-N001-EE-S17 rationale='<one sentence>' approver=@dococeven approval_ts=<UTC iso>
The incident_id is the Slack thread permalink OR a GH issue number. The approval comment MUST predate any SQL issuance by at least 2 minutes (operator pause-and-think gate).
-
change_reasoncarried into the next Console-driven write within 1h : after the break-glass SQL completes, the operator MUST issue a Console UI write (any value — including a no-opversionbump) within 60 minutes carryingchange_reason='break-glass-<incident_id>'. This reconciles the audit trail :ftf_config_historywill show the break-glass row + the next operator write with the linking incident_id, restoring the ADR-59 audit invariant ex-post. -
Post-mortem entry : within 24h of the incident, the operator files a comment on OP wp#153 with :
- the rendered
event=break_glass_approvalline - the SQL that was executed (verbatim)
- the
change_reasonlinking write timestamp + theftf_config_history.idrows involved - what the next operator response will be (Console UI fix, follow-up Story, ADR addendum)
If any of these three gates is skipped, the incident is non-conforme ADR-68 + ADR-82 and MUST be raised at the next pr_review committee.
The 4 break-glass procedures¶
⚠️ SQL placeholder substitution — security rules (READ BEFORE any procedure)¶
Every procedure below uses SQL templates with <NEW_VALUE>, <MODEL>, <TF>, <PARAM>, <INCIDENT_ID>, <TARGET_ID> placeholders that the operator substitutes manually before piping the command to psql. Substituting raw values inline opens an injection surface (a value containing a ' breaks the quoting ; a malicious value could exfiltrate other rows).
Mandatory rules :
- Numeric values (
<NEW_VALUE>like0.07,150,<TARGET_ID>like42) — substitute as-is. PostgreSQL parses them as numbers regardless of the surrounding context. Zero injection risk if the operator visually confirms the value is purely numeric (^-?[0-9]+(\.[0-9]+)?$). - String values containing
'— DOUBLE every single quote (PG-canonical escape) :O'Reilly→O''Reilly. Then wrap in single quotes :'O''Reilly'. If unsure whether escaping is needed, preferpsql -vvariable binding (see §SQL safe substitution below) instead of inline substitution. <INCIDENT_ID>values — MUST be alphanumeric + dashes only (^[a-zA-Z0-9_-]+$). The runbook expects the operator to use a GH issue number or a Slack thread short-id. A value containing',;,--,/*is rejected by the operator's own visual check before piping — if you see those characters, fix the incident_id first.<MODEL>,<TF>,<PARAM>— these are enum-constrained (MODEL ∈ {XGB, LGB, CB},TF ∈ {1M, 5M, 15M, 30M, 1H},PARAMmatches^[A-Z_]+$). Any value outside these patterns is a typo, not an injection — abort the substitution.
SQL-safe substitution pattern (recommended for <NEW_VALUE> strings) — use psql -v instead of inline interpolation :
PGPASSWORD="$PG_PASS" psql -h "$PG_HOST" -U "$PG_USER" -d champollion \
-v incident_id="<INCIDENT_ID>" \
-v param_key="CVN_HPO_<MODEL>_<TF>_<PARAM>" \
-v new_value="<NEW_VALUE>" \
-c "UPDATE ftf_config
SET base_env = jsonb_set(base_env, ARRAY[:'param_key'], to_jsonb(:'new_value'::text), false)
WHERE id = 1 AND base_env ? :'param_key';"
psql quotes the bound variables safely. Use this form when the <NEW_VALUE> is user-controlled (e.g., a value pasted from a chat message). For purely numeric / enum values from the trusted catalog, inline substitution is acceptable but still subject to rules 1-4.
If any of rules 1-4 cannot be confirmed visually, abort the break-glass execution and escalate to a second operator pair-review before issuing the SQL. The 2-minute pause-and-think gate from the §gating ritual exists precisely for this.
Procedure A — Restore from ftf_config_history (most common)¶
When : a Console UI edit (or a --apply with a wrong payload) committed bad values and the operator needs to revert to a previous version.
Step 1 — Identify the target version_id :
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
SELECT id, version, changed_by, change_reason, changed_at
FROM ftf_config_history
ORDER BY changed_at DESC
LIMIT 20;
"
'
Pick the most recent id BEFORE the bad write. Verify the change_reason shows what you expect ; sanity-check the base_env snapshot if needed :
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
SELECT base_env->>'\''CVN_HPO_LGB_5M_LEARNING_RATE'\'' AS sample_key,
base_env->>'\''CVN_HPO_XGB_5M_N_ESTIMATORS'\'' AS sample_key_2
FROM ftf_config_history WHERE id = <TARGET_ID>;
"
'
Step 2 — Do the gating ritual (see above).
Step 3 — Issue the restore SQL :
The transaction below issues the restore + records the audit history. A DO $$ block in front asserts that <TARGET_ID> actually exists in ftf_config_history before any INSERT — without it, a wrong-id substitution would SELECT ... LIMIT 0 and silently no-op the restore while still tagging a break-glass row that references a non-existent parent_id (lying audit trail).
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
BEGIN;
-- Precondition : <TARGET_ID> MUST exist in ftf_config_history.
-- Without this guard, a wrong-id substitution would silently no-op both
-- INSERTs while still completing the transaction — producing a missing
-- audit row with no error signal to the operator.
DO \$\$
DECLARE n INT;
BEGIN
SELECT COUNT(*) INTO n FROM ftf_config_history WHERE id = <TARGET_ID>;
IF n = 0 THEN
RAISE EXCEPTION '\''Procedure A aborted : ftf_config_history.id=<TARGET_ID> does not exist. Verify the id from Step 1 query, then re-run.'\'';
END IF;
END\$\$;
INSERT INTO ftf_config (id, base_env, version, locked_by, updated_at, updated_by)
SELECT 1, base_env, version || '\''_restored_<INCIDENT_ID>'\'', locked_by, NOW(), '\''break-glass-<INCIDENT_ID>'\''
FROM ftf_config_history WHERE id = <TARGET_ID>
ON CONFLICT (id) DO UPDATE SET
base_env = EXCLUDED.base_env,
version = EXCLUDED.version,
updated_at = NOW(),
updated_by = EXCLUDED.updated_by;
INSERT INTO ftf_config_history (base_env, version, locked_by, changed_by, change_reason, parent_id, diff, tags)
SELECT base_env, version || '\''_restored_<INCIDENT_ID>'\'', locked_by, '\''break-glass-<INCIDENT_ID>'\'',
'\''break-glass restore from id=<TARGET_ID>, incident=<INCIDENT_ID>'\'', <TARGET_ID>,
jsonb_build_object('\''restored_from'\'', <TARGET_ID>),
'\''[\"break-glass\", \"adr-90\", \"<INCIDENT_ID>\"]'\''::jsonb
FROM ftf_config_history WHERE id = <TARGET_ID>;
COMMIT;
"
'
If the assertion fires, the entire transaction rolls back (no partial state, no misleading history row). The operator fixes the <TARGET_ID> from the Step 1 query output and retries.
Step 4 — Verify the restore took effect :
# 1. ftf_config row shows the restored snapshot
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
SELECT version, updated_at, updated_by FROM ftf_config WHERE id=1;
"
'
# 2. A new ftf_config_history row exists with the break-glass tag
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
SELECT id, version, changed_by, change_reason FROM ftf_config_history WHERE tags ? '\''break-glass'\'' ORDER BY changed_at DESC LIMIT 1;
"
'
# 3. The seed script's dry-run reports 0 conflicts vs the canonical legacy payload
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
CVN_DB_HOST=$POSTGRES_HOST CVN_DB_USER=$POSTGRES_USER CVN_DB_PASS=$POSTGRES_PASSWORD CVN_DB_NAME=champollion \
python /opt/airflow/scripts/seed_hyperparams_console.py --dry-run
' 2>&1 | grep -E "Total|NEW|EQUAL|CONFLICT"
If step 4.3 reports CONFLICT > 0 you restored to a version that diverged from the legacy bundle — that may be the intent (e.g., restoring an operator-tuned snapshot) or a sign you picked the wrong id. Pause and verify.
Step 5 — Reconciling Console UI write within 1h :
Operator opens https://console.cvntrade.eu → Config → ftf_config → any single key edit (e.g., bump a RANDOM_STATE) with change_reason='break-glass-<incident_id>'. This writes through the canonical Console path and restores the ADR-59 invariant ex-post.
Procedure B — Console UI is down, urgent value edit needed¶
When : Console UI returns 5xx OR the Streamlit pod is crashlooping AND an HPO range is clipping the optimum on a sweep that must restart in the next hour.
Step 1 — Verify Console is the only thing down :
# Pod-level readiness : explicit Ready condition, not phase/status string match.
# A pod can be "Running" (phase) but have Ready=False (e.g. failing healthcheck) —
# the phase match would miss this. The jsonpath query returns "True" or "False" per pod.
kubectl get pods -n cvntrade -l app=console -o jsonpath='{range .items[*]}{.metadata.name}{"="}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
# HTTP-level readiness : exit code 0 ↔ 200, non-zero ↔ DOWN.
curl -sf https://console.cvntrade.eu/ -o /dev/null && echo HTTP_OK || echo HTTP_DOWN
The break-glass path is justified ONLY if both checks fail (no pod with Ready=True AND HTTP_DOWN). If pods are Ready but HTTP_DOWN → ingress / DNS issue, escalate to platform without break-glass. If HTTP_OK but a specific page is broken → Console UI bug, file as a normal incident — do NOT break-glass.
Step 2 — Gating ritual + ensure the operator can self-attest to the urgency. If the edit can wait until Console is back up (typically < 4h MTTR for Streamlit issues), prefer waiting.
Step 3 — Issue the targeted UPDATE :
The transaction below also asserts that the targeted UPDATE actually changed a row before recording the audit history. Without this, a <MODEL>/<TF>/<PARAM> typo combined with the base_env ? WHERE clause silently no-ops the UPDATE while still inserting a break-glass history row (lying audit trail again).
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
BEGIN;
UPDATE ftf_config
SET base_env = jsonb_set(
base_env,
'\''{CVN_HPO_<MODEL>_<TF>_<PARAM>}'\'',
to_jsonb('\''<NEW_VALUE>'\''::text),
false -- create_missing=false : refuse to create a new key on PARAM typo
),
version = version || '\''_breakglass_<INCIDENT_ID>'\'',
updated_at = NOW(),
updated_by = '\''break-glass-<INCIDENT_ID>'\''
WHERE id = 1
AND base_env ? '\''CVN_HPO_<MODEL>_<TF>_<PARAM>'\''; -- defence-in-depth : refuse the UPDATE if the key does NOT exist
-- Hard assertion : the UPDATE must have hit exactly the id=1 row that has
-- the break-glass updated_by marker AND the new version suffix. If 0 rows
-- match, the UPDATE no-op'd (typo on MODEL/TF/PARAM) — rollback to avoid
-- writing a misleading history row.
DO \$\$
DECLARE n INT;
BEGIN
SELECT COUNT(*) INTO n FROM ftf_config
WHERE id = 1
AND updated_by = '\''break-glass-<INCIDENT_ID>'\''
AND version LIKE '\''%_breakglass_<INCIDENT_ID>'\'';
IF n = 0 THEN
RAISE EXCEPTION '\''Procedure B aborted : UPDATE matched 0 rows (CVN_HPO_<MODEL>_<TF>_<PARAM> not found in base_env or id=1 missing). Verify the key spelling, then re-run.'\'';
END IF;
END\$\$;
INSERT INTO ftf_config_history (base_env, version, locked_by, changed_by, change_reason, parent_id, diff, tags)
SELECT base_env, version, locked_by, '\''break-glass-<INCIDENT_ID>'\'',
'\''urgent value edit during Console UI outage, incident=<INCIDENT_ID>'\'', NULL,
jsonb_build_object('\''key'\'', '\''CVN_HPO_<MODEL>_<TF>_<PARAM>'\'', '\''to'\'', '\''<NEW_VALUE>'\''),
'\''[\"break-glass\", \"console-outage\", \"adr-90\", \"<INCIDENT_ID>\"]'\''::jsonb
FROM ftf_config WHERE id=1;
COMMIT;
"
'
Step 4 — Verify (as Procedure A step 4, but check only the targeted key).
Step 5 — Reconciling Console write (as Procedure A step 5) — wait for Console UI to come back up, then issue the no-op Console write with the linking change_reason.
Procedure C — Recovery from seed_hyperparams_console.py --apply --force-overwrite accident¶
When : a --force-overwrite clobbered operator-tuned values across the board.
Step 1 — Identify the pre-clobber ftf_config_history.id (the row whose change_reason was NOT 'ADR-90 seeding (CVN-N001-EE-S17 PR-1)').
Step 2-5 — Same as Procedure A using that id.
Procedure D — Total recovery (ftf_config row missing entirely)¶
When : the row id=1 was DELETEd by mistake. The harness DAGs now fail at _decode_base_env with row is None.
Step 1a — Verify id=1 is actually missing (precondition check) :
A bad diagnosis ("oh the row must be gone") followed by Procedure D would silently re-INSERT history rows over an intact ftf_config (because ON CONFLICT DO NOTHING skips the INSERT, then the subsequent history INSERT runs against the pre-existing row's data — producing a misleading audit trail). The runbook refuses Procedure D unless id=1 is verifiably absent.
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -tAc "
SELECT CASE WHEN EXISTS (SELECT 1 FROM ftf_config WHERE id=1)
THEN '\''ROW_EXISTS_USE_PROCEDURE_A_NOT_D'\''
ELSE '\''ROW_MISSING_OK_PROCEED'\''
END;
"
'
If the output is ROW_EXISTS_USE_PROCEDURE_A_NOT_D → STOP. Pivot to Procedure A (restore from ftf_config_history while id=1 already exists). Do NOT proceed with Step 1b.
Step 1b — Re-bootstrap the row from the most recent history snapshot :
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
BEGIN;
-- 1) Re-bootstrap ftf_config (id=1) from the most recent history snapshot.
-- Uses ON CONFLICT (id) DO NOTHING as defence-in-depth ; the precondition
-- check in Step 1a already verified id=1 is missing. If a concurrent
-- process inserted id=1 between Step 1a and this INSERT, the next
-- assertion will catch it and roll back.
INSERT INTO ftf_config (id, base_env, version, locked_by, updated_at, updated_by)
SELECT 1, base_env, version || '\''_rebootstrap_<INCIDENT_ID>'\'', locked_by, NOW(), '\''break-glass-<INCIDENT_ID>'\''
FROM ftf_config_history ORDER BY changed_at DESC LIMIT 1
ON CONFLICT (id) DO NOTHING;
-- 2) Hard assertion : if 0 rows were inserted (race condition vs Step 1a),
-- rollback the entire transaction. Audit trail would otherwise lie.
DO \$\$
DECLARE n_inserted INT;
BEGIN
SELECT COUNT(*) INTO n_inserted
FROM ftf_config
WHERE id = 1 AND updated_by = '\''break-glass-<INCIDENT_ID>'\'';
IF n_inserted = 0 THEN
RAISE EXCEPTION '\''rebootstrap aborted : ftf_config(id=1) was already present (concurrent INSERT after Step 1a check, or operator mis-diagnosis). Use Procedure A instead.'\'';
END IF;
END\$\$;
-- 3) Record the rebootstrap in ftf_config_history (audit trail invariant —
-- Procedure A.4.2 verification would fail otherwise because no
-- break-glass-tagged row would exist for this incident).
INSERT INTO ftf_config_history (base_env, version, locked_by, changed_by, change_reason, parent_id, diff, tags)
SELECT base_env, version, locked_by, '\''break-glass-<INCIDENT_ID>'\'',
'\''rebootstrap after missing ftf_config.id=1, incident=<INCIDENT_ID>'\'', NULL,
jsonb_build_object('\''rebootstrap'\'', true),
'\''[\"break-glass\", \"rebootstrap\", \"adr-90\", \"<INCIDENT_ID>\"]'\''::jsonb
FROM ftf_config WHERE id=1;
COMMIT;
"
'
If the assertion fires (RAISE EXCEPTION rebootstrap aborted ...), the transaction rolls back atomically (no row in ftf_config_history is created with the wrong context). The operator pivots to Procedure A as instructed.
Step 2 — Verify + reconcile (as Procedure A steps 4-5).
If ftf_config_history is ALSO empty (catastrophic loss), the operator falls back to python scripts/seed_hyperparams_console.py --apply which re-bootstraps from the bundled legacy values. This is NOT a break-glass — it's the canonical seed path described in documentation/stories/CVN-N001-EE-S17/mlops_readiness.md §5.
Dry-run drill template¶
Operator should self-exercise this runbook every 30 days during the ADR-90 transition window to keep the procedure muscle-memory fresh and surface any cluster credential / role drift.
Drill scenario : "Procedure A — restore from a 3-version-back snapshot" against a staging copy of ftf_config_history (use a sentinel id=999 row).
Drill SLO : end-to-end from incident detected → restore verified < 10 min. If the drill exceeds 15 min, the runbook needs tightening (file a doc issue).
Drill log : commit the drill outcome to documentation/missions/break-glass-drills/YYYY-MM-DD-drill.md (1 paragraph : scenario, wall clock per step, surprises).
Post-mortem entry template¶
After any real break-glass execution, file a comment on OP wp#153 (and any incident-specific OP wp) using this template :
## Break-glass post-mortem — incident <INCIDENT_ID>
**When** : <UTC timestamp>
**Procedure** : A / B / C / D (per `documentation/runbooks/break-glass-hyperparams.md`)
**Approver** : @dococeven
**Rationale** : <one sentence — what was broken, why Console UI wasn't usable>
### Evidence trail
- Approval message : <slack permalink OR GH issue>
- SQL executed : <inline code block, verbatim>
- ftf_config_history.id created : <NN>
- Reconciling Console write : <ftf_config_history.id of the linking write, timestamp>
### What broke / what worked
- <fact 1, no judgement>
- <fact 2>
### Follow-ups
- <action 1 — e.g., "file a Story to fix the Console UI bug that necessitated this">
- <action 2 — e.g., "extend the break-glass runbook for scenario X">
### ADR-90 transition window impact
Is this incident a signal that we need to **accelerate Epic CCP wp#149 delivery** ? <yes/no, brief justification>.
The post-mortem MUST be filed within 24 hours of the break-glass execution. Failure to file = non-conforme ADR-68 + ADR-82.
Sunset¶
This runbook becomes OBSOLETE when Epic CCP wp#149 ships :
- The Console UI restore-from-history button replaces Procedure A.
- The Console UI typed schema + approval flow prevents the typo class of incident Procedure A addresses.
- The Console UI write API can be exercised via
kubectl execeven when the Streamlit pod is down (separate API surface), addressing Procedure B. - Procedure D becomes a non-event because the typed schema + atomic snapshot semantics prevent the
id=1row from being DELETEd.
At that point, this runbook moves to an archive section in documentation/runbooks/archive/ with a closing comment "Superseded by Epic CCP wp#149 — Console UI restore + approval flow", and the link from mlops_readiness.md §5 is removed.
Per mlops_readiness.md §6.bis the staged backstop, this sunset is expected by T+90 (2026-08-11) at the latest ; if Epic CCP hasn't shipped, the runbook stays load-bearing and the operator follows the §6.bis escalation path.