Skip to content

Design — System-wide trading kill-switch (CVN-N001-EF-S02)

story_id: CVN-N001-EF-S02 epic_id: CVN-N001-EF (#729) need_id: CVN-N001 (F1 mission) Authors: Dominique (operator) + Claude Reviewers: Expert Committee (plan_review mode B) Status: draft Created: 2026-04-28 Last updated: 2026-04-28

Changelog

Date Change
2026-04-28 Initial draft for committee review

1. Problem restated

Per F1_buy boost committee finding (session 8db2529d, Ops persona) :

Implement a System-Wide Kill-Switch: Develop and integrate an explicit, easily accessible mechanism to immediately halt all trading activity or revert the system to a predefined safe state in an emergency.

Today the project has partial kill-switch primitives :

Gap : the state is per-process in-memory. Multiple paper / live engine instances each hold their own _killed flag. Restart loses state. Activation requires triggering one specific Airflow DAG. There is no single source of truth, no Console button, no env-var safety, no Grafana auto-trigger, no audit table beyond Loki logs.

2. Goals (from #708, refined)

  1. Single source of truth : kill-switch state lives in PostgreSQL ; every trading decision reads it (no per-process cache surviving > 500ms).
  2. Multi-channel activation : Console UI button, CLI command, K8s env var (boot-time), Grafana alert webhook, existing Airflow DAG.
  3. Reversible : disengagement uses the same channels (except env var, which requires a pod restart).
  4. Auditable : every flip emits structured log + writes to history table + OpenTelemetry span.
  5. Fail-safe on connectivity loss : if PG is unreachable for > N consecutive polls, behave as if engaged (do NOT trade in the dark).
  6. Halt latency < 1 s : from operator click to pre-trade rejection.
  7. Operator-only disengagement : auto-engage exists (drawdown / circuit breaker / connectivity) ; auto-disengage does NOT exist (prevents flapping).

3. Non-goals

  • Not a per-strategy / per-crypto pause — that lives in the FTF factor toggles (ADR-56) or in ftf_config.cryptos.
  • Not a graceful drain by itself — in-flight orders are NOT cancelled by the kill-switch transition. However, per committee session 4c388b4c BLOCKER, a companion flatten_all Story is now MANDATORY before any live trading — it ships in the implementation Epic CVN-N001-EG (see §7) and provides a separate channel (Console button + CLI) to cancel open positions on demand. Until flatten_all ships, this engine is allowed only in paper mode.
  • Not a partial halt — there is no "pause new trades but allow exits". Exits (TP/SL/timeout) continue to run on existing positions because that's the safer behavior ; the switch halts new BUY signals + new allocations.
  • Not a per-account / per-API-key gate — that's exchange-side rate limiting / risk controls, out of our scope.

4. Architecture

4.1 Storage (PostgreSQL)

New table kill_switch_state (single-row, like ftf_config) :

-- infra/migrations/0XX_kill_switch.sql
CREATE TABLE IF NOT EXISTS kill_switch_state (
    id           INT PRIMARY KEY DEFAULT 1 CHECK (id = 1),  -- single row
    engaged      BOOLEAN NOT NULL DEFAULT FALSE,
    engaged_at   TIMESTAMPTZ,
    engaged_by   TEXT,                    -- operator handle, "system:drawdown", "system:circuit_breaker", etc.
    reason       TEXT,
    lock_version INT NOT NULL DEFAULT 0,  -- optimistic concurrency
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
INSERT INTO kill_switch_state (id, engaged) VALUES (1, FALSE) ON CONFLICT DO NOTHING;

CREATE TABLE IF NOT EXISTS kill_switch_history (
    id          BIGSERIAL PRIMARY KEY,
    transition  TEXT NOT NULL CHECK (transition IN ('engage', 'disengage')),
    actor       TEXT NOT NULL,           -- operator handle or "system:..."
    reason      TEXT,
    channel     TEXT NOT NULL,           -- 'console_ui' | 'cli' | 'env_var' | 'grafana_webhook' | 'airflow_dag' | 'auto_drawdown' | 'auto_circuit_breaker' | 'auto_pg_unreachable'
    occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX kill_switch_history_occurred_at_idx ON kill_switch_history (occurred_at DESC);

Why a dedicated table and not ftf_config.base_env.kill_switch.* : - ftf_config carries pipeline parameters (ADR-59 scope) ; mixing safety state into the same blob risks an operator editing the wrong key. - A separate table allows row-level audit trigger (kill_switch_history) without polluting ftf_config_history. - The 1-row check + lock_version makes concurrent writes from multiple channels safe.

4.2 Read path — KillSwitchClient

New module src/commun/safety/kill_switch_client.py exposing :

class KillSwitchClient:
    def is_engaged(self) -> bool: ...           # synchronous, < 10ms (in-memory)
    def state(self) -> KillSwitchState: ...      # full row (cached)
    def start_polling(self, interval_s: float = 0.5) -> None: ...
    def stop_polling(self) -> None: ...
  • Background thread polls SELECT engaged, engaged_at, engaged_by, reason FROM kill_switch_state WHERE id=1 every 500 ms (configurable).
  • In-memory cache holds last successful read.
  • Fail-safe: if N=3 consecutive polls fail (PG unreachable, exception, timeout > 200ms), the in-memory state flips to engaged=True, engaged_by="system:pg_unreachable". A kill_switch_history row is written when PG comes back (deferred audit). Per ADR-25, a CRITICAL log fires on every fail-safe activation — no silent fallback.
  • KillSwitchClient is a process-global singleton initialised in the trading kernel boot path.
  • Deactivation polls the same way — PG flips back to engaged=False, next poll picks it up within < 500 ms, total worst-case halt-to-resume < 1 s.

4.3 Kernel integration

  • _main_loop() at paper_trading_engine.py:1667 replaces self._risk_manager.is_killed() with self._kill_switch_client.is_engaged().
  • RiskManager.check_pre_trade() at line 89 also reads from the client (defense in depth — caught at both layers).
  • When engaged : skip _generate_and_execute_signals() (line 1693) ; continue check_exits() so existing positions can hit TP/SL.
  • Boot-time check : if CVN_KILL_SWITCH=engaged env var is set OR the PG state is engaged, the engine logs a WARNING and starts in halted mode (does NOT exit ; the operator may disengage at runtime).
  • Friendly env-var error (per committee reco 3): if CVN_KILL_SWITCH is set to any value other than engaged or empty (e.g. disengaged, off, 0), the engine FAILS FAST at boot with an explicit message : "CVN_KILL_SWITCH supports only 'engaged' (force-engage at boot) or unset. Disengagement always goes through PostgreSQL — see ADR-71 I8.". This addresses falsifiable hypothesis H5 (operator confusion).

4.4 Activation channels (write paths)

All channels POST to a single internal helper engage_kill_switch(actor, reason, channel) that : 1. Performs the optimistic-concurrency UPDATE on kill_switch_state (with lock_version check). 2. Inserts a row in kill_switch_history in the same transaction. 3. Emits structured log : event=kill_switch_engaged actor=<...> reason=<...> channel=<...> per ADR-32/33. 4. Emits OpenTelemetry span kill_switch.engage per ADR-62.

Channel inventory :

Channel Surface Backend Auth Use case
Console UI red "EMERGENCY KILL" button at scripts/ftf_config_ui.py top bar (port 8501) direct psycopg2 UPDATE Streamlit session operator at desk
CLI make kill-switch ON REASON="..." / OFF (Makefile target → scripts/kill_switch_cli.py) direct psycopg2 UPDATE local shell + OPENPROJECT_API_KEY-style env (KILL_SWITCH_ACTOR=$USER) operator on phone via SSH
K8s env var CVN_KILL_SWITCH=engaged in pod spec, picked up at engine boot bypasses PG ; engine refuses to start trading K8s RBAC restart all pods in safe mode (mass incident)
Grafana webhook Grafana alert → AlertManager → POST /api/v1/kill-switch on a tiny FastAPI sidecar services/kill_switch_webhook/main.py direct psycopg2 UPDATE shared secret in Authorization header auto-trigger on expectancy_net < 0 over 24h, pnl_drawdown > X%, etc.
Airflow DAG existing dag_pte__7_killswitch.py refactored to call engage_kill_switch(...) instead of RiskManager.activate_kill_switch(...) direct psycopg2 UPDATE Airflow Connection scheduled or manual trigger
Auto (system) drawdown / circuit breaker / pg_unreachable direct psycopg2 UPDATE (except pg_unreachable, which writes deferred) n/a safety-net

4.5 Disengagement (disengage_kill_switch)

Same shape as engagement. Only operator-initiated channels (Console UI, CLI, Airflow DAG) can disengage. Grafana webhook, env var, and auto channels CANNOT disengage (would create flapping). The fail-safe system:pg_unreachable clears automatically when PG comes back (with a kill_switch_history row recording the auto-clear).

5. Invariants (codified in ADR-71)

These MUST hold forever (or until ADR-71 is superseded) :

  • I1 — Single source of truth : the only authoritative kill-switch state is the kill_switch_state row in PostgreSQL. No code path may set a process-local killed flag without polling PG.
  • I2 — Operator-only disengage : auto channels can engage but never disengage. Disengagement requires a human actor named in engaged_by / kill_switch_history.actor.
  • I3 — Fail-safe on connectivity loss : if PG is unreachable for > 3 consecutive polls (default 1.5 s), the client behaves as engaged and emits CRITICAL log + writes deferred history row when PG returns. Per ADR-25, no silent fallback.
  • I4 — No graceful drain : engaging the switch does NOT cancel in-flight orders. It halts new BUY signals + new allocations only ; exit checks (TP/SL) continue to run.
  • I5 — Audit per flip : every transition (engage / disengage / fail-safe / fail-safe-clear) writes a row in kill_switch_history AND emits structured log per ADR-32/33 AND emits OTel span per ADR-62.
  • I6 — Halt latency < 1 s : from successful engagement transaction to next pre-trade reject is bounded by polling_interval_s (default 500 ms) + one PG round-trip. Implementation MUST measure this via OTel span and emit a metric.
  • I7 — Boot-time honors PG state : on engine start, the kernel reads kill_switch_state once synchronously (before accepting any trade) ; only then does it start the polling thread. A True state at boot means the engine starts in halted mode without ever sending an order.
  • I8 — Env-var bypass is engage-only : CVN_KILL_SWITCH=engaged env var forces engaged at boot regardless of PG. There is no CVN_KILL_SWITCH=disengaged ; disengagement always goes through PG (otherwise the env var would override operator decisions silently).

6. Observability

  • Grafana panel "Kill switch state" : gauge (0/1) + timestamp of last flip + actor + channel.
  • Loki query for audit feed : {job="kill_switch"} | json | line_format "{{.actor}} {{.channel}} {{.transition}}".
  • Prometheus metric cvntrade_kill_switch_engaged{} (gauge 0/1) scraped from the FastAPI sidecar's /metrics.
  • Prometheus metric cvntrade_kill_switch_pg_unreachable_total{} (counter) emitted by KillSwitchClient on every fail-safe activation. Alert : rate(cvntrade_kill_switch_pg_unreachable_total[1h]) > 0.1 (per committee reco 6 — validates H1 over time).
  • PostgreSQL load monitoring (per committee reco 4) : add pg_stat_statements query for the kill_switch_state SELECT p99 latency. Alert if > 50 ms or if QPS × engine count overwhelms PG.
  • SLI : kill_switch_halt_latency_seconds p99 < 1 s (alert when violated).
  • Console UI Quick Actions panel (per committee reco 8) : top-bar widget with current state indicator + one-click engage button + CLI cheat sheet + clear disengage confirmation workflow (2-step : type "DISENGAGE" + reason text + confirm).
  • MLOps readiness §1 for any future Story modifying this surface MUST include a metric for the channel distribution (which channel triggers most).

7. Implementation plan (out of this Story's scope — future Epic)

This Story ships the DESIGN + ADR-71 only. Implementation lives in a future Epic CVN-N001-EG "Kill-switch implementation", broken into Stories following ADR-69 :

  • EG-S01 : PG schema + KillSwitchClient + kernel integration + boot-time honor (the core)
  • EG-S02 : CLI + Makefile target + Console UI Quick Actions panel (per committee reco 8)
  • EG-S03 : Grafana webhook sidecar + AlertManager wiring
  • EG-S04 : Airflow DAG refactor + auto fail-safe path
  • EG-S05 : Observability (panel + SLI + metrics including pg_unreachable_total)
  • EG-S06 : flatten_all companion (cancel open positions on demand) — MANDATORY before live deployment per committee session 4c388b4c BLOCKER (separate channel from kill-switch ; Console button + CLI ; integrates with exchange API)
  • EG-S07 : Chaos engineering harness for I3 (PG blips) + I7 (K8s pod kills) validation (per committee reco 9 ; runs as part of pre-live gate)
  • Cross-cutting : operator runbook in documentation/runbooks/kill_switch_<channel>.md for each channel (per committee reco 5) + ADR-70 amendment to add kill-switch-specific MLOps checks (per committee reco 7)

Each EG Story must fill TEMPLATE_mlops_readiness.md per ADR-70.

8. Failure mode coverage

Mapping each invariant to the failure modes it prevents :

Invariant Prevented failure mode
I1 single PG source "We restarted the pod and the kill-switch was forgotten"
I2 operator-only disengage Auto-disengage flapping during transient PG hiccup
I3 fail-safe on PG loss Trading silently in the dark when PG goes down
I4 no graceful drain Operator confusion about "did the switch cancel my orders?" — no, by design
I5 audit per flip "Who killed the engine at 3am and why?" — answer in kill_switch_history
I6 latency < 1 s "I clicked but trades kept happening for 10 s"
I7 boot honors PG Pod restart in middle of engaged window starts trading again
I8 env-var engage-only An operator setting CVN_KILL_SWITCH=disengaged to bypass another operator's engage decision silently

9. Alternatives considered (full list rejected)

  • Single in-memory flag, broadcast via Redis pub/sub : faster (~10 ms latency) but introduces a second source of truth ; if Redis loses the message during reconnect, state diverges. PG poll is slower but simpler and survives any infra blip.
  • Filesystem-flag (e.g., /tmp/kill_switch) : doesn't survive pod restart, doesn't work across pods.
  • K8s ConfigMap watch : per-namespace blast radius is fine but ConfigMap eventual consistency under load can take seconds, and there's no audit trail equivalent to kill_switch_history.
  • Exchange-side risk controls only : doesn't cover paper-trading or the in-house signal generation ; also doesn't help when the bug is in our position-sizing code, not the exchange.
  • Cancel all in-flight orders on engage : tempting but introduces a partial-state failure mode (some cancelled, some not) and adds exchange API dependency to the safety path. Operator handles via exchange UI.

10. Open questions for committee

The committee plan_review should explicitly answer :

  1. Storage : separate kill_switch_state table vs nesting under ftf_config.base_env.kill_switch.* — is the dedicated table justified?
  2. Polling interval : 500 ms default. Too aggressive (PG load) or too slow (1 s p99 halt latency feels too long for a kill switch)?
  3. Fail-safe threshold : 3 consecutive failures (1.5 s). Should it be tighter (e.g., 1 failure = engage immediately)?
  4. Env-var asymmetry : engaged works but disengaged doesn't. Are there scenarios where this hurts (e.g., a botched PG state from a previous engage that you cannot un-stick because the Console is down)?
  5. Channel completeness : 6 channels listed. Missing a channel (e.g., a phone-app webhook, an exchange-side webhook)?
  6. Auto-engage triggers : drawdown + circuit breaker + connectivity loss are listed. Should "P1 alert from Grafana on expectancy_net < 0 over 24h" be an auto channel, or stay manual to avoid over-aggressive engagement?
  7. Disengage authorization : currently any operator can disengage if they have Console / CLI access. Should there be a 2-person rule for disengagement (analogous to nuclear-launch keys)?
  8. In-flight orders policy : the design explicitly does NOT cancel in-flight orders. Is "operator handles via exchange UI" acceptable, or should there be a follow-up Story for flatten_all?

11. Acceptance criteria (Story level)

  • Design doc merged at documentation/design/CVN-N001-EF-S02-kill-switch-design.md
  • ADR-71 file merged codifying I1-I8
  • ADR index + CLAUDE.md ADR cluster updated
  • Committee verdict ≥ ACCEPTED or ACCEPTED-WITH-CHANGES (≥ 8.0 avg, no blockers)
  • Open questions §10 answered by committee or recorded as deferred-to-implementation
  • Story OP wp#56 closed with PR + commit reference
  • Sprint F1B-S0-Prereqs closeable (S01 + S02 both done)
  • Live deployment gate : engine MUST NOT trade live until companion flatten_all Story (EG-S06) ships per committee session 4c388b4c BLOCKER. Paper trading not gated.

12. References

  • Need : CVN-N001 (F1 mission)
  • Epic : CVN-N001-EF (#729) — F1 mission operational prereqs
  • Sibling Story : CVN-N001-EF-S01 (#709) — MLOps readiness template (merged in PR #730)
  • ADRs touched / built upon :
  • ADR-25 — No silent fallback (the basis of I3 fail-safe)
  • ADR-32, ADR-33 — log_event format + closed event catalogue (I5 audit)
  • ADR-39 — Runtime standalone, API façade (kernel boundary)
  • ADR-40 — Paper and live share the same kernel (one switch covers both)
  • ADR-56 — Pipeline change must be FTF-testable (kill-switch is NOT FTF-testable, intentionally — it's a safety layer outside the A/B surface)
  • ADR-59 — All pipeline params in PostgreSQL (the data plane this design builds on, but with a dedicated table)
  • ADR-62 — Unified observability via OpenTelemetry (I5 + I6 spans / metrics)
  • ADR-68 — Committee = default review channel (this dossier)
  • ADR-69 — OpenProject orchestrator (Story discipline)
  • ADR-70 — MLOps readiness template (future EG Stories must fill it)
  • Code surfaces referenced :
  • src/paper_trading/cvntrade_paper_trading_engine.py — kernel
  • src/paper_trading/safety/cvntrade_risk_manager.py — existing in-memory primitive
  • scripts/ftf_config_ui.py — Console UI host for the new button
  • infra/migrations/ — schema migrations land here
  • Issues : #708 (this Story), #729 (Epic), #608 (Need), session 8db2529d (committee finding source)