Design — System-wide trading kill-switch (`CVN-N001-EF-S02`)¶

story_id: CVN-N001-EF-S02 epic_id: CVN-N001-EF (#729) need_id: CVN-N001 (F1 mission) Authors: Dominique (operator) + Claude Reviewers: Expert Committee (plan_review mode B) Status: draft Created: 2026-04-28 Last updated: 2026-04-28

Changelog¶

Date	Change
2026-04-28	Initial draft for committee review

1. Problem restated¶

Per F1_buy boost committee finding (session 8db2529d, Ops persona) :

Implement a System-Wide Kill-Switch: Develop and integrate an explicit, easily accessible mechanism to immediately halt all trading activity or revert the system to a predefined safe state in an emergency.

Today the project has partial kill-switch primitives :

RiskManager.activate_kill_switch(reason) at src/paper_trading/safety/cvntrade_risk_manager.py:201 — sets in-memory _killed=True
Kernel checks _killed at src/paper_trading/cvntrade_paper_trading_engine.py:1667 (every ~1s tick)
Pre-trade check rejects when killed at line 2225
Auto-activation on global drawdown breach (line 175)
One activation channel : Airflow DAG dag_pte__7_killswitch.py

Gap : the state is per-process in-memory. Multiple paper / live engine instances each hold their own _killed flag. Restart loses state. Activation requires triggering one specific Airflow DAG. There is no single source of truth, no Console button, no env-var safety, no Grafana auto-trigger, no audit table beyond Loki logs.

2. Goals (from #708, refined)¶

Single source of truth : kill-switch state lives in PostgreSQL ; every trading decision reads it (no per-process cache surviving > 500ms).
Multi-channel activation : Console UI button, CLI command, K8s env var (boot-time), Grafana alert webhook, existing Airflow DAG.
Reversible : disengagement uses the same channels (except env var, which requires a pod restart).
Auditable : every flip emits structured log + writes to history table + OpenTelemetry span.
Fail-safe on connectivity loss : if PG is unreachable for > N consecutive polls, behave as if engaged (do NOT trade in the dark).
Halt latency < 1 s : from operator click to pre-trade rejection.
Operator-only disengagement : auto-engage exists (drawdown / circuit breaker / connectivity) ; auto-disengage does NOT exist (prevents flapping).

3. Non-goals¶

Not a per-strategy / per-crypto pause — that lives in the FTF factor toggles (ADR-56) or in ftf_config.cryptos.
Not a graceful drain by itself — in-flight orders are NOT cancelled by the kill-switch transition. However, per committee session 4c388b4c BLOCKER, a companion flatten_all Story is now MANDATORY before any live trading — it ships in the implementation Epic CVN-N001-EG (see §7) and provides a separate channel (Console button + CLI) to cancel open positions on demand. Until flatten_all ships, this engine is allowed only in paper mode.
Not a partial halt — there is no "pause new trades but allow exits". Exits (TP/SL/timeout) continue to run on existing positions because that's the safer behavior ; the switch halts new BUY signals + new allocations.
Not a per-account / per-API-key gate — that's exchange-side rate limiting / risk controls, out of our scope.

4. Architecture¶

4.1 Storage (PostgreSQL)¶

New table kill_switch_state (single-row, like ftf_config) :

-- infra/migrations/0XX_kill_switch.sql
CREATE TABLE IF NOT EXISTS kill_switch_state (
    id           INT PRIMARY KEY DEFAULT 1 CHECK (id = 1),  -- single row
    engaged      BOOLEAN NOT NULL DEFAULT FALSE,
    engaged_at   TIMESTAMPTZ,
    engaged_by   TEXT,                    -- operator handle, "system:drawdown", "system:circuit_breaker", etc.
    reason       TEXT,
    lock_version INT NOT NULL DEFAULT 0,  -- optimistic concurrency
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
INSERT INTO kill_switch_state (id, engaged) VALUES (1, FALSE) ON CONFLICT DO NOTHING;

CREATE TABLE IF NOT EXISTS kill_switch_history (
    id          BIGSERIAL PRIMARY KEY,
    transition  TEXT NOT NULL CHECK (transition IN ('engage', 'disengage')),
    actor       TEXT NOT NULL,           -- operator handle or "system:..."
    reason      TEXT,
    channel     TEXT NOT NULL,           -- 'console_ui' | 'cli' | 'env_var' | 'grafana_webhook' | 'airflow_dag' | 'auto_drawdown' | 'auto_circuit_breaker' | 'auto_pg_unreachable'
    occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX kill_switch_history_occurred_at_idx ON kill_switch_history (occurred_at DESC);

Why a dedicated table and not ftf_config.base_env.kill_switch.* : - ftf_config carries pipeline parameters (ADR-59 scope) ; mixing safety state into the same blob risks an operator editing the wrong key. - A separate table allows row-level audit trigger (kill_switch_history) without polluting ftf_config_history. - The 1-row check + lock_version makes concurrent writes from multiple channels safe.

4.2 Read path — `KillSwitchClient`¶

New module src/commun/safety/kill_switch_client.py exposing :

class KillSwitchClient:
    def is_engaged(self) -> bool: ...           # synchronous, < 10ms (in-memory)
    def state(self) -> KillSwitchState: ...      # full row (cached)
    def start_polling(self, interval_s: float = 0.5) -> None: ...
    def stop_polling(self) -> None: ...

Background thread polls SELECT engaged, engaged_at, engaged_by, reason FROM kill_switch_state WHERE id=1 every 500 ms (configurable).
In-memory cache holds last successful read.
Fail-safe: if N=3 consecutive polls fail (PG unreachable, exception, timeout > 200ms), the in-memory state flips to engaged=True, engaged_by="system:pg_unreachable". A kill_switch_history row is written when PG comes back (deferred audit). Per ADR-25, a CRITICAL log fires on every fail-safe activation — no silent fallback.
KillSwitchClient is a process-global singleton initialised in the trading kernel boot path.
Deactivation polls the same way — PG flips back to engaged=False, next poll picks it up within < 500 ms, total worst-case halt-to-resume < 1 s.

4.3 Kernel integration¶

_main_loop() at paper_trading_engine.py:1667 replaces self._risk_manager.is_killed() with self._kill_switch_client.is_engaged().
RiskManager.check_pre_trade() at line 89 also reads from the client (defense in depth — caught at both layers).
When engaged : skip _generate_and_execute_signals() (line 1693) ; continue check_exits() so existing positions can hit TP/SL.
Boot-time check : if CVN_KILL_SWITCH=engaged env var is set OR the PG state is engaged, the engine logs a WARNING and starts in halted mode (does NOT exit ; the operator may disengage at runtime).
Friendly env-var error (per committee reco 3): if CVN_KILL_SWITCH is set to any value other than engaged or empty (e.g. disengaged, off, 0), the engine FAILS FAST at boot with an explicit message : "CVN_KILL_SWITCH supports only 'engaged' (force-engage at boot) or unset. Disengagement always goes through PostgreSQL — see ADR-71 I8.". This addresses falsifiable hypothesis H5 (operator confusion).

4.4 Activation channels (write paths)¶

All channels POST to a single internal helper engage_kill_switch(actor, reason, channel) that : 1. Performs the optimistic-concurrency UPDATE on kill_switch_state (with lock_version check). 2. Inserts a row in kill_switch_history in the same transaction. 3. Emits structured log : event=kill_switch_engaged actor=<...> reason=<...> channel=<...> per ADR-32/33. 4. Emits OpenTelemetry span kill_switch.engage per ADR-62.

Channel inventory :

Channel	Surface	Backend	Auth	Use case
Console UI	red "EMERGENCY KILL" button at `scripts/ftf_config_ui.py` top bar (port 8501)	direct psycopg2 UPDATE	Streamlit session	operator at desk
CLI	`make kill-switch ON REASON="..."` / `OFF` (Makefile target → `scripts/kill_switch_cli.py`)	direct psycopg2 UPDATE	local shell + `OPENPROJECT_API_KEY`-style env (`KILL_SWITCH_ACTOR=$USER`)	operator on phone via SSH
K8s env var	`CVN_KILL_SWITCH=engaged` in pod spec, picked up at engine boot	bypasses PG ; engine refuses to start trading	K8s RBAC	restart all pods in safe mode (mass incident)
Grafana webhook	Grafana alert → AlertManager → POST `/api/v1/kill-switch` on a tiny FastAPI sidecar `services/kill_switch_webhook/main.py`	direct psycopg2 UPDATE	shared secret in `Authorization` header	auto-trigger on `expectancy_net < 0` over 24h, `pnl_drawdown > X%`, etc.
Airflow DAG	existing `dag_pte__7_killswitch.py` refactored to call `engage_kill_switch(...)` instead of `RiskManager.activate_kill_switch(...)`	direct psycopg2 UPDATE	Airflow Connection	scheduled or manual trigger
Auto (system)	drawdown / circuit breaker / pg_unreachable	direct psycopg2 UPDATE (except pg_unreachable, which writes deferred)	n/a	safety-net

4.5 Disengagement (`disengage_kill_switch`)¶

Same shape as engagement. Only operator-initiated channels (Console UI, CLI, Airflow DAG) can disengage. Grafana webhook, env var, and auto channels CANNOT disengage (would create flapping). The fail-safe system:pg_unreachable clears automatically when PG comes back (with a kill_switch_history row recording the auto-clear).

5. Invariants (codified in ADR-71)¶

These MUST hold forever (or until ADR-71 is superseded) :

I1 — Single source of truth : the only authoritative kill-switch state is the kill_switch_state row in PostgreSQL. No code path may set a process-local killed flag without polling PG.
I2 — Operator-only disengage : auto channels can engage but never disengage. Disengagement requires a human actor named in engaged_by / kill_switch_history.actor.
I3 — Fail-safe on connectivity loss : if PG is unreachable for > 3 consecutive polls (default 1.5 s), the client behaves as engaged and emits CRITICAL log + writes deferred history row when PG returns. Per ADR-25, no silent fallback.
I4 — No graceful drain : engaging the switch does NOT cancel in-flight orders. It halts new BUY signals + new allocations only ; exit checks (TP/SL) continue to run.
I5 — Audit per flip : every transition (engage / disengage / fail-safe / fail-safe-clear) writes a row in kill_switch_history AND emits structured log per ADR-32/33 AND emits OTel span per ADR-62.
I6 — Halt latency < 1 s : from successful engagement transaction to next pre-trade reject is bounded by polling_interval_s (default 500 ms) + one PG round-trip. Implementation MUST measure this via OTel span and emit a metric.
I7 — Boot-time honors PG state : on engine start, the kernel reads kill_switch_state once synchronously (before accepting any trade) ; only then does it start the polling thread. A True state at boot means the engine starts in halted mode without ever sending an order.
I8 — Env-var bypass is engage-only : CVN_KILL_SWITCH=engaged env var forces engaged at boot regardless of PG. There is no CVN_KILL_SWITCH=disengaged ; disengagement always goes through PG (otherwise the env var would override operator decisions silently).

6. Observability¶

Grafana panel "Kill switch state" : gauge (0/1) + timestamp of last flip + actor + channel.
Loki query for audit feed : {job="kill_switch"} | json | line_format "{{.actor}} {{.channel}} {{.transition}}".
Prometheus metric cvntrade_kill_switch_engaged{} (gauge 0/1) scraped from the FastAPI sidecar's /metrics.
Prometheus metric cvntrade_kill_switch_pg_unreachable_total{} (counter) emitted by KillSwitchClient on every fail-safe activation. Alert : rate(cvntrade_kill_switch_pg_unreachable_total[1h]) > 0.1 (per committee reco 6 — validates H1 over time).
PostgreSQL load monitoring (per committee reco 4) : add pg_stat_statements query for the kill_switch_state SELECT p99 latency. Alert if > 50 ms or if QPS × engine count overwhelms PG.
SLI : kill_switch_halt_latency_seconds p99 < 1 s (alert when violated).
Console UI Quick Actions panel (per committee reco 8) : top-bar widget with current state indicator + one-click engage button + CLI cheat sheet + clear disengage confirmation workflow (2-step : type "DISENGAGE" + reason text + confirm).
MLOps readiness §1 for any future Story modifying this surface MUST include a metric for the channel distribution (which channel triggers most).

7. Implementation plan (out of this Story's scope — future Epic)¶

This Story ships the DESIGN + ADR-71 only. Implementation lives in a future Epic CVN-N001-EG "Kill-switch implementation", broken into Stories following ADR-69 :

EG-S01 : PG schema + KillSwitchClient + kernel integration + boot-time honor (the core)
EG-S02 : CLI + Makefile target + Console UI Quick Actions panel (per committee reco 8)
EG-S03 : Grafana webhook sidecar + AlertManager wiring
EG-S04 : Airflow DAG refactor + auto fail-safe path
EG-S05 : Observability (panel + SLI + metrics including pg_unreachable_total)
EG-S06 : flatten_all companion (cancel open positions on demand) — MANDATORY before live deployment per committee session 4c388b4c BLOCKER (separate channel from kill-switch ; Console button + CLI ; integrates with exchange API)
EG-S07 : Chaos engineering harness for I3 (PG blips) + I7 (K8s pod kills) validation (per committee reco 9 ; runs as part of pre-live gate)
Cross-cutting : operator runbook in documentation/runbooks/kill_switch_<channel>.md for each channel (per committee reco 5) + ADR-70 amendment to add kill-switch-specific MLOps checks (per committee reco 7)

Each EG Story must fill TEMPLATE_mlops_readiness.md per ADR-70.

8. Failure mode coverage¶

Mapping each invariant to the failure modes it prevents :

Invariant	Prevented failure mode
I1 single PG source	"We restarted the pod and the kill-switch was forgotten"
I2 operator-only disengage	Auto-disengage flapping during transient PG hiccup
I3 fail-safe on PG loss	Trading silently in the dark when PG goes down
I4 no graceful drain	Operator confusion about "did the switch cancel my orders?" — no, by design
I5 audit per flip	"Who killed the engine at 3am and why?" — answer in `kill_switch_history`
I6 latency < 1 s	"I clicked but trades kept happening for 10 s"
I7 boot honors PG	Pod restart in middle of engaged window starts trading again
I8 env-var engage-only	An operator setting `CVN_KILL_SWITCH=disengaged` to bypass another operator's engage decision silently

9. Alternatives considered (full list rejected)¶

Single in-memory flag, broadcast via Redis pub/sub : faster (~10 ms latency) but introduces a second source of truth ; if Redis loses the message during reconnect, state diverges. PG poll is slower but simpler and survives any infra blip.
Filesystem-flag (e.g., /tmp/kill_switch) : doesn't survive pod restart, doesn't work across pods.
K8s ConfigMap watch : per-namespace blast radius is fine but ConfigMap eventual consistency under load can take seconds, and there's no audit trail equivalent to kill_switch_history.
Exchange-side risk controls only : doesn't cover paper-trading or the in-house signal generation ; also doesn't help when the bug is in our position-sizing code, not the exchange.
Cancel all in-flight orders on engage : tempting but introduces a partial-state failure mode (some cancelled, some not) and adds exchange API dependency to the safety path. Operator handles via exchange UI.

10. Open questions for committee¶

The committee plan_review should explicitly answer :

Storage : separate kill_switch_state table vs nesting under ftf_config.base_env.kill_switch.* — is the dedicated table justified?
Polling interval : 500 ms default. Too aggressive (PG load) or too slow (1 s p99 halt latency feels too long for a kill switch)?
Fail-safe threshold : 3 consecutive failures (1.5 s). Should it be tighter (e.g., 1 failure = engage immediately)?
Env-var asymmetry : engaged works but disengaged doesn't. Are there scenarios where this hurts (e.g., a botched PG state from a previous engage that you cannot un-stick because the Console is down)?
Channel completeness : 6 channels listed. Missing a channel (e.g., a phone-app webhook, an exchange-side webhook)?
Auto-engage triggers : drawdown + circuit breaker + connectivity loss are listed. Should "P1 alert from Grafana on expectancy_net < 0 over 24h" be an auto channel, or stay manual to avoid over-aggressive engagement?
Disengage authorization : currently any operator can disengage if they have Console / CLI access. Should there be a 2-person rule for disengagement (analogous to nuclear-launch keys)?
In-flight orders policy : the design explicitly does NOT cancel in-flight orders. Is "operator handles via exchange UI" acceptable, or should there be a follow-up Story for flatten_all?

11. Acceptance criteria (Story level)¶

Design doc merged at documentation/design/CVN-N001-EF-S02-kill-switch-design.md
ADR-71 file merged codifying I1-I8
ADR index + CLAUDE.md ADR cluster updated
Committee verdict ≥ ACCEPTED or ACCEPTED-WITH-CHANGES (≥ 8.0 avg, no blockers)
Open questions §10 answered by committee or recorded as deferred-to-implementation
Story OP wp#56 closed with PR + commit reference
Sprint F1B-S0-Prereqs closeable (S01 + S02 both done)
Live deployment gate : engine MUST NOT trade live until companion flatten_all Story (EG-S06) ships per committee session 4c388b4c BLOCKER. Paper trading not gated.

12. References¶

Need : CVN-N001 (F1 mission)
Epic : CVN-N001-EF (#729) — F1 mission operational prereqs
Sibling Story : CVN-N001-EF-S01 (#709) — MLOps readiness template (merged in PR #730)
ADRs touched / built upon :
ADR-25 — No silent fallback (the basis of I3 fail-safe)
ADR-32, ADR-33 — log_event format + closed event catalogue (I5 audit)
ADR-39 — Runtime standalone, API façade (kernel boundary)
ADR-40 — Paper and live share the same kernel (one switch covers both)
ADR-56 — Pipeline change must be FTF-testable (kill-switch is NOT FTF-testable, intentionally — it's a safety layer outside the A/B surface)
ADR-59 — All pipeline params in PostgreSQL (the data plane this design builds on, but with a dedicated table)
ADR-62 — Unified observability via OpenTelemetry (I5 + I6 spans / metrics)
ADR-68 — Committee = default review channel (this dossier)
ADR-69 — OpenProject orchestrator (Story discipline)
ADR-70 — MLOps readiness template (future EG Stories must fill it)
Code surfaces referenced :
src/paper_trading/cvntrade_paper_trading_engine.py — kernel
src/paper_trading/safety/cvntrade_risk_manager.py — existing in-memory primitive
scripts/ftf_config_ui.py — Console UI host for the new button
infra/migrations/ — schema migrations land here
Issues : #708 (this Story), #729 (Epic), #608 (Need), session 8db2529d (committee finding source)

Design — System-wide trading kill-switch (CVN-N001-EF-S02)¶