Runbook — Break-glass hyperparameter rollback (ADR-90 transition window)¶

Severity : P1 if FTF / training is breached ; P2 if Console UI is the only thing down Owner : @dococeven (sole approval authority — no committee approval needed during transition window) Story : CVN-N001-EE-S17 (GH #905, OP wp#153) · pr_review verdict 628a49ba PASSED OK Linked code : scripts/seed_hyperparams_console.py · src/commun/finetune/hyperparams.py Supersedes : the rollback paragraph in documentation/stories/CVN-N001-EE-S17/mlops_readiness.md §5 (this runbook is the operational long-form ; the readiness file is the contract).

When to invoke this runbook¶

ONLY in one of these three scenarios :

ftf_config.base_env got corrupted — a --apply ran with a wrong payload, or a manual SQL was issued against the table by mistake, or a force-overwrite clobbered operator-tuned values. The harness DAGs now produce event=hpo_fallback_applied floods or fail-fast on missing keys.
Console UI is down AND the ftf_config.base_env needs to be edited urgently (e.g., to widen an HPO range that's clipping the optimum). The canonical edit path is unavailable, the transition window from PR #904 to Epic CCP wp#149 has us on break-glass-only data-layer rollback.
Operator typo — a Console UI edit committed a value that's about to break the next FTF (e.g., learning_rate: 0.0 instead of 0.07). The Console UI history view is read-only today (no Restore button until Epic CCP ships). Break-glass restore from ftf_config_history is the only recovery path.

NOT a break-glass scenario :

Routine value tuning → use Console UI write
Wholesale ftf_config reset → use python scripts/seed_hyperparams_console.py --apply --force-overwrite (canonical seed path, no break-glass needed)
Binary regression on the training code → helm rollback to the pre-S17 SHA. The break-glass PG path is for the data layer, not the binary layer.

Gating ritual — MUST happen BEFORE issuing any SQL¶

The break-glass PG path violates the ADR-59 "Console-only" invariant. It is permitted only during the ADR-90 transition window AND only with the following three gates :

Written approval in the incident channel : @dococeven posts a message in #ops-cvn (or the active incident channel) of the form :

event=break_glass_approval incident_id=<NNN> story=CVN-N001-EE-S17 rationale='<one sentence>' approver=@dococeven approval_ts=<UTC iso>

The incident_id is the Slack thread permalink OR a GH issue number. The approval comment MUST predate any SQL issuance by at least 2 minutes (operator pause-and-think gate).

change_reason carried into the next Console-driven write within 1h : after the break-glass SQL completes, the operator MUST issue a Console UI write (any value — including a no-op version bump) within 60 minutes carrying change_reason='break-glass-<incident_id>'. This reconciles the audit trail : ftf_config_history will show the break-glass row + the next operator write with the linking incident_id, restoring the ADR-59 audit invariant ex-post.
Post-mortem entry : within 24h of the incident, the operator files a comment on OP wp#153 with :
the rendered event=break_glass_approval line
the SQL that was executed (verbatim)
the change_reason linking write timestamp + the ftf_config_history.id rows involved
what the next operator response will be (Console UI fix, follow-up Story, ADR addendum)

If any of these three gates is skipped, the incident is non-conforme ADR-68 + ADR-82 and MUST be raised at the next pr_review committee.

The 4 break-glass procedures¶

⚠️ SQL placeholder substitution — security rules (READ BEFORE any procedure)¶

Every procedure below uses SQL templates with <NEW_VALUE>, <MODEL>, <TF>, <PARAM>, <INCIDENT_ID>, <TARGET_ID> placeholders that the operator substitutes manually before piping the command to psql. Substituting raw values inline opens an injection surface (a value containing a ' breaks the quoting ; a malicious value could exfiltrate other rows).

Mandatory rules :

Numeric values (<NEW_VALUE> like 0.07, 150, <TARGET_ID> like 42) — substitute as-is. PostgreSQL parses them as numbers regardless of the surrounding context. Zero injection risk if the operator visually confirms the value is purely numeric (^-?[0-9]+(\.[0-9]+)?$).
String values containing ' — DOUBLE every single quote (PG-canonical escape) : O'Reilly → O''Reilly. Then wrap in single quotes : 'O''Reilly'. If unsure whether escaping is needed, prefer psql -v variable binding (see §SQL safe substitution below) instead of inline substitution.
<INCIDENT_ID> values — MUST be alphanumeric + dashes only (^[a-zA-Z0-9_-]+$). The runbook expects the operator to use a GH issue number or a Slack thread short-id. A value containing ', ;, --, /* is rejected by the operator's own visual check before piping — if you see those characters, fix the incident_id first.
<MODEL>, <TF>, <PARAM> — these are enum-constrained (MODEL ∈ {XGB, LGB, CB}, TF ∈ {1M, 5M, 15M, 30M, 1H}, PARAM matches ^[A-Z_]+$). Any value outside these patterns is a typo, not an injection — abort the substitution.

SQL-safe substitution pattern (recommended for <NEW_VALUE> strings) — use psql -v instead of inline interpolation :

PGPASSWORD="$PG_PASS" psql -h "$PG_HOST" -U "$PG_USER" -d champollion \
  -v incident_id="<INCIDENT_ID>" \
  -v param_key="CVN_HPO_<MODEL>_<TF>_<PARAM>" \
  -v new_value="<NEW_VALUE>" \
  -c "UPDATE ftf_config
      SET base_env = jsonb_set(base_env, ARRAY[:'param_key'], to_jsonb(:'new_value'::text), false)
      WHERE id = 1 AND base_env ? :'param_key';"

psql quotes the bound variables safely. Use this form when the <NEW_VALUE> is user-controlled (e.g., a value pasted from a chat message). For purely numeric / enum values from the trusted catalog, inline substitution is acceptable but still subject to rules 1-4.

If any of rules 1-4 cannot be confirmed visually, abort the break-glass execution and escalate to a second operator pair-review before issuing the SQL. The 2-minute pause-and-think gate from the §gating ritual exists precisely for this.

Procedure A — Restore from `ftf_config_history` (most common)¶

When : a Console UI edit (or a --apply with a wrong payload) committed bad values and the operator needs to revert to a previous version.

Step 1 — Identify the target version_id :

kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    SELECT id, version, changed_by, change_reason, changed_at
    FROM ftf_config_history
    ORDER BY changed_at DESC
    LIMIT 20;
  "
'

Pick the most recent id BEFORE the bad write. Verify the change_reason shows what you expect ; sanity-check the base_env snapshot if needed :

kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    SELECT base_env->>'\''CVN_HPO_LGB_5M_LEARNING_RATE'\'' AS sample_key,
           base_env->>'\''CVN_HPO_XGB_5M_N_ESTIMATORS'\'' AS sample_key_2
    FROM ftf_config_history WHERE id = <TARGET_ID>;
  "
'

Step 2 — Do the gating ritual (see above).

Step 3 — Issue the restore SQL :

The transaction below issues the restore + records the audit history. A DO $$ block in front asserts that <TARGET_ID> actually exists in ftf_config_history before any INSERT — without it, a wrong-id substitution would SELECT ... LIMIT 0 and silently no-op the restore while still tagging a break-glass row that references a non-existent parent_id (lying audit trail).

kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    BEGIN;
    -- Precondition : <TARGET_ID> MUST exist in ftf_config_history.
    -- Without this guard, a wrong-id substitution would silently no-op both
    -- INSERTs while still completing the transaction — producing a missing
    -- audit row with no error signal to the operator.
    DO \$\$
    DECLARE n INT;
    BEGIN
      SELECT COUNT(*) INTO n FROM ftf_config_history WHERE id = <TARGET_ID>;
      IF n = 0 THEN
        RAISE EXCEPTION '\''Procedure A aborted : ftf_config_history.id=<TARGET_ID> does not exist. Verify the id from Step 1 query, then re-run.'\'';
      END IF;
    END\$\$;

    INSERT INTO ftf_config (id, base_env, version, locked_by, updated_at, updated_by)
    SELECT 1, base_env, version || '\''_restored_<INCIDENT_ID>'\'', locked_by, NOW(), '\''break-glass-<INCIDENT_ID>'\''
    FROM ftf_config_history WHERE id = <TARGET_ID>
    ON CONFLICT (id) DO UPDATE SET
        base_env = EXCLUDED.base_env,
        version = EXCLUDED.version,
        updated_at = NOW(),
        updated_by = EXCLUDED.updated_by;

    INSERT INTO ftf_config_history (base_env, version, locked_by, changed_by, change_reason, parent_id, diff, tags)
    SELECT base_env, version || '\''_restored_<INCIDENT_ID>'\'', locked_by, '\''break-glass-<INCIDENT_ID>'\'',
           '\''break-glass restore from id=<TARGET_ID>, incident=<INCIDENT_ID>'\'', <TARGET_ID>,
           jsonb_build_object('\''restored_from'\'', <TARGET_ID>),
           '\''[\"break-glass\", \"adr-90\", \"<INCIDENT_ID>\"]'\''::jsonb
    FROM ftf_config_history WHERE id = <TARGET_ID>;
    COMMIT;
  "
'

If the assertion fires, the entire transaction rolls back (no partial state, no misleading history row). The operator fixes the <TARGET_ID> from the Step 1 query output and retries.

Step 4 — Verify the restore took effect :

# 1. ftf_config row shows the restored snapshot
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    SELECT version, updated_at, updated_by FROM ftf_config WHERE id=1;
  "
'

# 2. A new ftf_config_history row exists with the break-glass tag
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    SELECT id, version, changed_by, change_reason FROM ftf_config_history WHERE tags ? '\''break-glass'\'' ORDER BY changed_at DESC LIMIT 1;
  "
'

# 3. The seed script's dry-run reports 0 conflicts vs the canonical legacy payload
kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  CVN_DB_HOST=$POSTGRES_HOST CVN_DB_USER=$POSTGRES_USER CVN_DB_PASS=$POSTGRES_PASSWORD CVN_DB_NAME=champollion \
    python /opt/airflow/scripts/seed_hyperparams_console.py --dry-run
' 2>&1 | grep -E "Total|NEW|EQUAL|CONFLICT"

If step 4.3 reports CONFLICT > 0 you restored to a version that diverged from the legacy bundle — that may be the intent (e.g., restoring an operator-tuned snapshot) or a sign you picked the wrong id. Pause and verify.

Step 5 — Reconciling Console UI write within 1h :

Operator opens https://console.cvntrade.eu → Config → ftf_config → any single key edit (e.g., bump a RANDOM_STATE) with change_reason='break-glass-<incident_id>'. This writes through the canonical Console path and restores the ADR-59 invariant ex-post.

Procedure B — Console UI is down, urgent value edit needed¶

When : Console UI returns 5xx OR the Streamlit pod is crashlooping AND an HPO range is clipping the optimum on a sweep that must restart in the next hour.

Step 1 — Verify Console is the only thing down :

# Pod-level readiness : explicit Ready condition, not phase/status string match.
# A pod can be "Running" (phase) but have Ready=False (e.g. failing healthcheck) —
# the phase match would miss this. The jsonpath query returns "True" or "False" per pod.
kubectl get pods -n cvntrade -l app=console -o jsonpath='{range .items[*]}{.metadata.name}{"="}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'

# HTTP-level readiness : exit code 0 ↔ 200, non-zero ↔ DOWN.
curl -sf https://console.cvntrade.eu/ -o /dev/null && echo HTTP_OK || echo HTTP_DOWN

The break-glass path is justified ONLY if both checks fail (no pod with Ready=True AND HTTP_DOWN). If pods are Ready but HTTP_DOWN → ingress / DNS issue, escalate to platform without break-glass. If HTTP_OK but a specific page is broken → Console UI bug, file as a normal incident — do NOT break-glass.

Step 2 — Gating ritual + ensure the operator can self-attest to the urgency. If the edit can wait until Console is back up (typically < 4h MTTR for Streamlit issues), prefer waiting.

Step 3 — Issue the targeted UPDATE :

The transaction below also asserts that the targeted UPDATE actually changed a row before recording the audit history. Without this, a <MODEL>/<TF>/<PARAM> typo combined with the base_env ? WHERE clause silently no-ops the UPDATE while still inserting a break-glass history row (lying audit trail again).

kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    BEGIN;
    UPDATE ftf_config
    SET base_env = jsonb_set(
        base_env,
        '\''{CVN_HPO_<MODEL>_<TF>_<PARAM>}'\'',
        to_jsonb('\''<NEW_VALUE>'\''::text),
        false  -- create_missing=false : refuse to create a new key on PARAM typo
    ),
        version = version || '\''_breakglass_<INCIDENT_ID>'\'',
        updated_at = NOW(),
        updated_by = '\''break-glass-<INCIDENT_ID>'\''
    WHERE id = 1
      AND base_env ? '\''CVN_HPO_<MODEL>_<TF>_<PARAM>'\'';  -- defence-in-depth : refuse the UPDATE if the key does NOT exist

    -- Hard assertion : the UPDATE must have hit exactly the id=1 row that has
    -- the break-glass updated_by marker AND the new version suffix. If 0 rows
    -- match, the UPDATE no-op'd (typo on MODEL/TF/PARAM) — rollback to avoid
    -- writing a misleading history row.
    DO \$\$
    DECLARE n INT;
    BEGIN
      SELECT COUNT(*) INTO n FROM ftf_config
       WHERE id = 1
         AND updated_by = '\''break-glass-<INCIDENT_ID>'\''
         AND version LIKE '\''%_breakglass_<INCIDENT_ID>'\'';
      IF n = 0 THEN
        RAISE EXCEPTION '\''Procedure B aborted : UPDATE matched 0 rows (CVN_HPO_<MODEL>_<TF>_<PARAM> not found in base_env or id=1 missing). Verify the key spelling, then re-run.'\'';
      END IF;
    END\$\$;

    INSERT INTO ftf_config_history (base_env, version, locked_by, changed_by, change_reason, parent_id, diff, tags)
    SELECT base_env, version, locked_by, '\''break-glass-<INCIDENT_ID>'\'',
           '\''urgent value edit during Console UI outage, incident=<INCIDENT_ID>'\'', NULL,
           jsonb_build_object('\''key'\'', '\''CVN_HPO_<MODEL>_<TF>_<PARAM>'\'', '\''to'\'', '\''<NEW_VALUE>'\''),
           '\''[\"break-glass\", \"console-outage\", \"adr-90\", \"<INCIDENT_ID>\"]'\''::jsonb
    FROM ftf_config WHERE id=1;
    COMMIT;
  "
'

Step 4 — Verify (as Procedure A step 4, but check only the targeted key).

Step 5 — Reconciling Console write (as Procedure A step 5) — wait for Console UI to come back up, then issue the no-op Console write with the linking change_reason.

Procedure C — Recovery from `seed_hyperparams_console.py --apply --force-overwrite` accident¶

When : a --force-overwrite clobbered operator-tuned values across the board.

Step 1 — Identify the pre-clobber ftf_config_history.id (the row whose change_reason was NOT 'ADR-90 seeding (CVN-N001-EE-S17 PR-1)').

Step 2-5 — Same as Procedure A using that id.

Procedure D — Total recovery (`ftf_config` row missing entirely)¶

When : the row id=1 was DELETEd by mistake. The harness DAGs now fail at _decode_base_env with row is None.

Step 1a — Verify id=1 is actually missing (precondition check) :

A bad diagnosis ("oh the row must be gone") followed by Procedure D would silently re-INSERT history rows over an intact ftf_config (because ON CONFLICT DO NOTHING skips the INSERT, then the subsequent history INSERT runs against the pre-existing row's data — producing a misleading audit trail). The runbook refuses Procedure D unless id=1 is verifiably absent.

kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -tAc "
    SELECT CASE WHEN EXISTS (SELECT 1 FROM ftf_config WHERE id=1)
                THEN '\''ROW_EXISTS_USE_PROCEDURE_A_NOT_D'\''
                ELSE '\''ROW_MISSING_OK_PROCEED'\''
           END;
  "
'

If the output is ROW_EXISTS_USE_PROCEDURE_A_NOT_D → STOP. Pivot to Procedure A (restore from ftf_config_history while id=1 already exists). Do NOT proceed with Step 1b.

Step 1b — Re-bootstrap the row from the most recent history snapshot :

kubectl exec -n cvntrade airflow-scheduler-<pod> -c scheduler -- bash -c '
  PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -d champollion -c "
    BEGIN;
    -- 1) Re-bootstrap ftf_config (id=1) from the most recent history snapshot.
    --    Uses ON CONFLICT (id) DO NOTHING as defence-in-depth ; the precondition
    --    check in Step 1a already verified id=1 is missing. If a concurrent
    --    process inserted id=1 between Step 1a and this INSERT, the next
    --    assertion will catch it and roll back.
    INSERT INTO ftf_config (id, base_env, version, locked_by, updated_at, updated_by)
    SELECT 1, base_env, version || '\''_rebootstrap_<INCIDENT_ID>'\'', locked_by, NOW(), '\''break-glass-<INCIDENT_ID>'\''
    FROM ftf_config_history ORDER BY changed_at DESC LIMIT 1
    ON CONFLICT (id) DO NOTHING;

    -- 2) Hard assertion : if 0 rows were inserted (race condition vs Step 1a),
    --    rollback the entire transaction. Audit trail would otherwise lie.
    DO \$\$
    DECLARE n_inserted INT;
    BEGIN
      SELECT COUNT(*) INTO n_inserted
      FROM ftf_config
      WHERE id = 1 AND updated_by = '\''break-glass-<INCIDENT_ID>'\'';
      IF n_inserted = 0 THEN
        RAISE EXCEPTION '\''rebootstrap aborted : ftf_config(id=1) was already present (concurrent INSERT after Step 1a check, or operator mis-diagnosis). Use Procedure A instead.'\'';
      END IF;
    END\$\$;

    -- 3) Record the rebootstrap in ftf_config_history (audit trail invariant —
    --    Procedure A.4.2 verification would fail otherwise because no
    --    break-glass-tagged row would exist for this incident).
    INSERT INTO ftf_config_history (base_env, version, locked_by, changed_by, change_reason, parent_id, diff, tags)
    SELECT base_env, version, locked_by, '\''break-glass-<INCIDENT_ID>'\'',
           '\''rebootstrap after missing ftf_config.id=1, incident=<INCIDENT_ID>'\'', NULL,
           jsonb_build_object('\''rebootstrap'\'', true),
           '\''[\"break-glass\", \"rebootstrap\", \"adr-90\", \"<INCIDENT_ID>\"]'\''::jsonb
    FROM ftf_config WHERE id=1;
    COMMIT;
  "
'

If the assertion fires (RAISE EXCEPTION rebootstrap aborted ...), the transaction rolls back atomically (no row in ftf_config_history is created with the wrong context). The operator pivots to Procedure A as instructed.

Step 2 — Verify + reconcile (as Procedure A steps 4-5).

If ftf_config_history is ALSO empty (catastrophic loss), the operator falls back to python scripts/seed_hyperparams_console.py --apply which re-bootstraps from the bundled legacy values. This is NOT a break-glass — it's the canonical seed path described in documentation/stories/CVN-N001-EE-S17/mlops_readiness.md §5.

Dry-run drill template¶

Operator should self-exercise this runbook every 30 days during the ADR-90 transition window to keep the procedure muscle-memory fresh and surface any cluster credential / role drift.

Drill scenario : "Procedure A — restore from a 3-version-back snapshot" against a staging copy of ftf_config_history (use a sentinel id=999 row).

Drill SLO : end-to-end from incident detected → restore verified < 10 min. If the drill exceeds 15 min, the runbook needs tightening (file a doc issue).

Drill log : commit the drill outcome to documentation/missions/break-glass-drills/YYYY-MM-DD-drill.md (1 paragraph : scenario, wall clock per step, surprises).

Post-mortem entry template¶

After any real break-glass execution, file a comment on OP wp#153 (and any incident-specific OP wp) using this template :

## Break-glass post-mortem — incident <INCIDENT_ID>

**When** : <UTC timestamp>
**Procedure** : A / B / C / D (per `documentation/runbooks/break-glass-hyperparams.md`)
**Approver** : @dococeven
**Rationale** : <one sentence — what was broken, why Console UI wasn't usable>

### Evidence trail

- Approval message : <slack permalink OR GH issue>
- SQL executed : <inline code block, verbatim>
- ftf_config_history.id created : <NN>
- Reconciling Console write : <ftf_config_history.id of the linking write, timestamp>

### What broke / what worked

- <fact 1, no judgement>
- <fact 2>

### Follow-ups

- <action 1 — e.g., "file a Story to fix the Console UI bug that necessitated this">
- <action 2 — e.g., "extend the break-glass runbook for scenario X">

### ADR-90 transition window impact

Is this incident a signal that we need to **accelerate Epic CCP wp#149 delivery** ? <yes/no, brief justification>.

The post-mortem MUST be filed within 24 hours of the break-glass execution. Failure to file = non-conforme ADR-68 + ADR-82.

Sunset¶

This runbook becomes OBSOLETE when Epic CCP wp#149 ships :

The Console UI restore-from-history button replaces Procedure A.
The Console UI typed schema + approval flow prevents the typo class of incident Procedure A addresses.
The Console UI write API can be exercised via kubectl exec even when the Streamlit pod is down (separate API surface), addressing Procedure B.
Procedure D becomes a non-event because the typed schema + atomic snapshot semantics prevent the id=1 row from being DELETEd.

At that point, this runbook moves to an archive section in documentation/runbooks/archive/ with a closing comment "Superseded by Epic CCP wp#149 — Console UI restore + approval flow", and the link from mlops_readiness.md §5 is removed.

Per mlops_readiness.md §6.bis the staged backstop, this sunset is expected by T+90 (2026-08-11) at the latest ; if Epic CCP hasn't shipped, the runbook stays load-bearing and the operator follows the §6.bis escalation path.