Design — System-wide trading kill-switch (CVN-N001-EF-S02)¶
story_id: CVN-N001-EF-S02
epic_id: CVN-N001-EF (#729)
need_id: CVN-N001 (F1 mission)
Authors: Dominique (operator) + Claude
Reviewers: Expert Committee (plan_review mode B)
Status: draft
Created: 2026-04-28
Last updated: 2026-04-28
Changelog¶
| Date | Change |
|---|---|
| 2026-04-28 | Initial draft for committee review |
1. Problem restated¶
Per F1_buy boost committee finding (session 8db2529d, Ops persona) :
Implement a System-Wide Kill-Switch: Develop and integrate an explicit, easily accessible mechanism to immediately halt all trading activity or revert the system to a predefined safe state in an emergency.
Today the project has partial kill-switch primitives :
RiskManager.activate_kill_switch(reason)atsrc/paper_trading/safety/cvntrade_risk_manager.py:201— sets in-memory_killed=True- Kernel checks
_killedatsrc/paper_trading/cvntrade_paper_trading_engine.py:1667(every ~1s tick) - Pre-trade check rejects when killed at line 2225
- Auto-activation on global drawdown breach (line 175)
- One activation channel : Airflow DAG
dag_pte__7_killswitch.py
Gap : the state is per-process in-memory. Multiple paper / live engine instances each hold their own _killed flag. Restart loses state. Activation requires triggering one specific Airflow DAG. There is no single source of truth, no Console button, no env-var safety, no Grafana auto-trigger, no audit table beyond Loki logs.
2. Goals (from #708, refined)¶
- Single source of truth : kill-switch state lives in PostgreSQL ; every trading decision reads it (no per-process cache surviving > 500ms).
- Multi-channel activation : Console UI button, CLI command, K8s env var (boot-time), Grafana alert webhook, existing Airflow DAG.
- Reversible : disengagement uses the same channels (except env var, which requires a pod restart).
- Auditable : every flip emits structured log + writes to history table + OpenTelemetry span.
- Fail-safe on connectivity loss : if PG is unreachable for > N consecutive polls, behave as if engaged (do NOT trade in the dark).
- Halt latency < 1 s : from operator click to pre-trade rejection.
- Operator-only disengagement : auto-engage exists (drawdown / circuit breaker / connectivity) ; auto-disengage does NOT exist (prevents flapping).
3. Non-goals¶
- Not a per-strategy / per-crypto pause — that lives in the FTF factor toggles (ADR-56) or in
ftf_config.cryptos. - Not a graceful drain by itself — in-flight orders are NOT cancelled by the kill-switch transition. However, per committee session
4c388b4cBLOCKER, a companionflatten_allStory is now MANDATORY before any live trading — it ships in the implementation EpicCVN-N001-EG(see §7) and provides a separate channel (Console button + CLI) to cancel open positions on demand. Untilflatten_allships, this engine is allowed only in paper mode. - Not a partial halt — there is no "pause new trades but allow exits". Exits (TP/SL/timeout) continue to run on existing positions because that's the safer behavior ; the switch halts new BUY signals + new allocations.
- Not a per-account / per-API-key gate — that's exchange-side rate limiting / risk controls, out of our scope.
4. Architecture¶
4.1 Storage (PostgreSQL)¶
New table kill_switch_state (single-row, like ftf_config) :
-- infra/migrations/0XX_kill_switch.sql
CREATE TABLE IF NOT EXISTS kill_switch_state (
id INT PRIMARY KEY DEFAULT 1 CHECK (id = 1), -- single row
engaged BOOLEAN NOT NULL DEFAULT FALSE,
engaged_at TIMESTAMPTZ,
engaged_by TEXT, -- operator handle, "system:drawdown", "system:circuit_breaker", etc.
reason TEXT,
lock_version INT NOT NULL DEFAULT 0, -- optimistic concurrency
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
INSERT INTO kill_switch_state (id, engaged) VALUES (1, FALSE) ON CONFLICT DO NOTHING;
CREATE TABLE IF NOT EXISTS kill_switch_history (
id BIGSERIAL PRIMARY KEY,
transition TEXT NOT NULL CHECK (transition IN ('engage', 'disengage')),
actor TEXT NOT NULL, -- operator handle or "system:..."
reason TEXT,
channel TEXT NOT NULL, -- 'console_ui' | 'cli' | 'env_var' | 'grafana_webhook' | 'airflow_dag' | 'auto_drawdown' | 'auto_circuit_breaker' | 'auto_pg_unreachable'
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX kill_switch_history_occurred_at_idx ON kill_switch_history (occurred_at DESC);
Why a dedicated table and not ftf_config.base_env.kill_switch.* :
- ftf_config carries pipeline parameters (ADR-59 scope) ; mixing safety state into the same blob risks an operator editing the wrong key.
- A separate table allows row-level audit trigger (kill_switch_history) without polluting ftf_config_history.
- The 1-row check + lock_version makes concurrent writes from multiple channels safe.
4.2 Read path — KillSwitchClient¶
New module src/commun/safety/kill_switch_client.py exposing :
class KillSwitchClient:
def is_engaged(self) -> bool: ... # synchronous, < 10ms (in-memory)
def state(self) -> KillSwitchState: ... # full row (cached)
def start_polling(self, interval_s: float = 0.5) -> None: ...
def stop_polling(self) -> None: ...
- Background thread polls
SELECT engaged, engaged_at, engaged_by, reason FROM kill_switch_state WHERE id=1every 500 ms (configurable). - In-memory cache holds last successful read.
- Fail-safe: if N=3 consecutive polls fail (PG unreachable, exception, timeout > 200ms), the in-memory state flips to
engaged=True, engaged_by="system:pg_unreachable". Akill_switch_historyrow is written when PG comes back (deferred audit). Per ADR-25, a CRITICAL log fires on every fail-safe activation — no silent fallback. KillSwitchClientis a process-global singleton initialised in the trading kernel boot path.- Deactivation polls the same way — PG flips back to
engaged=False, next poll picks it up within < 500 ms, total worst-case halt-to-resume < 1 s.
4.3 Kernel integration¶
_main_loop()atpaper_trading_engine.py:1667replacesself._risk_manager.is_killed()withself._kill_switch_client.is_engaged().RiskManager.check_pre_trade()at line 89 also reads from the client (defense in depth — caught at both layers).- When engaged : skip
_generate_and_execute_signals()(line 1693) ; continuecheck_exits()so existing positions can hit TP/SL. - Boot-time check : if
CVN_KILL_SWITCH=engagedenv var is set OR the PG state is engaged, the engine logs a WARNING and starts in halted mode (does NOT exit ; the operator may disengage at runtime). - Friendly env-var error (per committee reco 3): if
CVN_KILL_SWITCHis set to any value other thanengagedor empty (e.g.disengaged,off,0), the engine FAILS FAST at boot with an explicit message :"CVN_KILL_SWITCH supports only 'engaged' (force-engage at boot) or unset. Disengagement always goes through PostgreSQL — see ADR-71 I8.". This addresses falsifiable hypothesis H5 (operator confusion).
4.4 Activation channels (write paths)¶
All channels POST to a single internal helper engage_kill_switch(actor, reason, channel) that :
1. Performs the optimistic-concurrency UPDATE on kill_switch_state (with lock_version check).
2. Inserts a row in kill_switch_history in the same transaction.
3. Emits structured log : event=kill_switch_engaged actor=<...> reason=<...> channel=<...> per ADR-32/33.
4. Emits OpenTelemetry span kill_switch.engage per ADR-62.
Channel inventory :
| Channel | Surface | Backend | Auth | Use case |
|---|---|---|---|---|
| Console UI | red "EMERGENCY KILL" button at scripts/ftf_config_ui.py top bar (port 8501) |
direct psycopg2 UPDATE | Streamlit session | operator at desk |
| CLI | make kill-switch ON REASON="..." / OFF (Makefile target → scripts/kill_switch_cli.py) |
direct psycopg2 UPDATE | local shell + OPENPROJECT_API_KEY-style env (KILL_SWITCH_ACTOR=$USER) |
operator on phone via SSH |
| K8s env var | CVN_KILL_SWITCH=engaged in pod spec, picked up at engine boot |
bypasses PG ; engine refuses to start trading | K8s RBAC | restart all pods in safe mode (mass incident) |
| Grafana webhook | Grafana alert → AlertManager → POST /api/v1/kill-switch on a tiny FastAPI sidecar services/kill_switch_webhook/main.py |
direct psycopg2 UPDATE | shared secret in Authorization header |
auto-trigger on expectancy_net < 0 over 24h, pnl_drawdown > X%, etc. |
| Airflow DAG | existing dag_pte__7_killswitch.py refactored to call engage_kill_switch(...) instead of RiskManager.activate_kill_switch(...) |
direct psycopg2 UPDATE | Airflow Connection | scheduled or manual trigger |
| Auto (system) | drawdown / circuit breaker / pg_unreachable | direct psycopg2 UPDATE (except pg_unreachable, which writes deferred) | n/a | safety-net |
4.5 Disengagement (disengage_kill_switch)¶
Same shape as engagement. Only operator-initiated channels (Console UI, CLI, Airflow DAG) can disengage. Grafana webhook, env var, and auto channels CANNOT disengage (would create flapping). The fail-safe system:pg_unreachable clears automatically when PG comes back (with a kill_switch_history row recording the auto-clear).
5. Invariants (codified in ADR-71)¶
These MUST hold forever (or until ADR-71 is superseded) :
- I1 — Single source of truth : the only authoritative kill-switch state is the
kill_switch_staterow in PostgreSQL. No code path may set a process-localkilledflag without polling PG. - I2 — Operator-only disengage : auto channels can engage but never disengage. Disengagement requires a human actor named in
engaged_by/kill_switch_history.actor. - I3 — Fail-safe on connectivity loss : if PG is unreachable for > 3 consecutive polls (default 1.5 s), the client behaves as engaged and emits CRITICAL log + writes deferred history row when PG returns. Per ADR-25, no silent fallback.
- I4 — No graceful drain : engaging the switch does NOT cancel in-flight orders. It halts new BUY signals + new allocations only ; exit checks (TP/SL) continue to run.
- I5 — Audit per flip : every transition (engage / disengage / fail-safe / fail-safe-clear) writes a row in
kill_switch_historyAND emits structured log per ADR-32/33 AND emits OTel span per ADR-62. - I6 — Halt latency < 1 s : from successful engagement transaction to next pre-trade reject is bounded by
polling_interval_s(default 500 ms) + one PG round-trip. Implementation MUST measure this via OTel span and emit a metric. - I7 — Boot-time honors PG state : on engine start, the kernel reads
kill_switch_stateonce synchronously (before accepting any trade) ; only then does it start the polling thread. ATruestate at boot means the engine starts in halted mode without ever sending an order. - I8 — Env-var bypass is engage-only :
CVN_KILL_SWITCH=engagedenv var forces engaged at boot regardless of PG. There is noCVN_KILL_SWITCH=disengaged; disengagement always goes through PG (otherwise the env var would override operator decisions silently).
6. Observability¶
- Grafana panel "Kill switch state" : gauge (0/1) + timestamp of last flip + actor + channel.
- Loki query for audit feed :
{job="kill_switch"} | json | line_format "{{.actor}} {{.channel}} {{.transition}}". - Prometheus metric
cvntrade_kill_switch_engaged{}(gauge 0/1) scraped from the FastAPI sidecar's/metrics. - Prometheus metric
cvntrade_kill_switch_pg_unreachable_total{}(counter) emitted byKillSwitchClienton every fail-safe activation. Alert :rate(cvntrade_kill_switch_pg_unreachable_total[1h]) > 0.1(per committee reco 6 — validates H1 over time). - PostgreSQL load monitoring (per committee reco 4) : add
pg_stat_statementsquery for thekill_switch_state SELECTp99 latency. Alert if > 50 ms or if QPS × engine count overwhelms PG. - SLI :
kill_switch_halt_latency_secondsp99 < 1 s (alert when violated). - Console UI Quick Actions panel (per committee reco 8) : top-bar widget with current state indicator + one-click engage button + CLI cheat sheet + clear disengage confirmation workflow (2-step : type "DISENGAGE" + reason text + confirm).
- MLOps readiness §1 for any future Story modifying this surface MUST include a metric for the channel distribution (which channel triggers most).
7. Implementation plan (out of this Story's scope — future Epic)¶
This Story ships the DESIGN + ADR-71 only. Implementation lives in a future Epic CVN-N001-EG "Kill-switch implementation", broken into Stories following ADR-69 :
- EG-S01 : PG schema +
KillSwitchClient+ kernel integration + boot-time honor (the core) - EG-S02 : CLI + Makefile target + Console UI Quick Actions panel (per committee reco 8)
- EG-S03 : Grafana webhook sidecar + AlertManager wiring
- EG-S04 : Airflow DAG refactor + auto fail-safe path
- EG-S05 : Observability (panel + SLI + metrics including
pg_unreachable_total) - EG-S06 :
flatten_allcompanion (cancel open positions on demand) — MANDATORY before live deployment per committee session4c388b4cBLOCKER (separate channel from kill-switch ; Console button + CLI ; integrates with exchange API) - EG-S07 : Chaos engineering harness for I3 (PG blips) + I7 (K8s pod kills) validation (per committee reco 9 ; runs as part of pre-live gate)
- Cross-cutting : operator runbook in
documentation/runbooks/kill_switch_<channel>.mdfor each channel (per committee reco 5) + ADR-70 amendment to add kill-switch-specific MLOps checks (per committee reco 7)
Each EG Story must fill TEMPLATE_mlops_readiness.md per ADR-70.
8. Failure mode coverage¶
Mapping each invariant to the failure modes it prevents :
| Invariant | Prevented failure mode |
|---|---|
| I1 single PG source | "We restarted the pod and the kill-switch was forgotten" |
| I2 operator-only disengage | Auto-disengage flapping during transient PG hiccup |
| I3 fail-safe on PG loss | Trading silently in the dark when PG goes down |
| I4 no graceful drain | Operator confusion about "did the switch cancel my orders?" — no, by design |
| I5 audit per flip | "Who killed the engine at 3am and why?" — answer in kill_switch_history |
| I6 latency < 1 s | "I clicked but trades kept happening for 10 s" |
| I7 boot honors PG | Pod restart in middle of engaged window starts trading again |
| I8 env-var engage-only | An operator setting CVN_KILL_SWITCH=disengaged to bypass another operator's engage decision silently |
9. Alternatives considered (full list rejected)¶
- Single in-memory flag, broadcast via Redis pub/sub : faster (~10 ms latency) but introduces a second source of truth ; if Redis loses the message during reconnect, state diverges. PG poll is slower but simpler and survives any infra blip.
- Filesystem-flag (e.g.,
/tmp/kill_switch) : doesn't survive pod restart, doesn't work across pods. - K8s ConfigMap watch : per-namespace blast radius is fine but ConfigMap eventual consistency under load can take seconds, and there's no audit trail equivalent to
kill_switch_history. - Exchange-side risk controls only : doesn't cover paper-trading or the in-house signal generation ; also doesn't help when the bug is in our position-sizing code, not the exchange.
- Cancel all in-flight orders on engage : tempting but introduces a partial-state failure mode (some cancelled, some not) and adds exchange API dependency to the safety path. Operator handles via exchange UI.
10. Open questions for committee¶
The committee plan_review should explicitly answer :
- Storage : separate
kill_switch_statetable vs nesting underftf_config.base_env.kill_switch.*— is the dedicated table justified? - Polling interval : 500 ms default. Too aggressive (PG load) or too slow (1 s p99 halt latency feels too long for a kill switch)?
- Fail-safe threshold : 3 consecutive failures (1.5 s). Should it be tighter (e.g., 1 failure = engage immediately)?
- Env-var asymmetry :
engagedworks butdisengageddoesn't. Are there scenarios where this hurts (e.g., a botched PG state from a previous engage that you cannot un-stick because the Console is down)? - Channel completeness : 6 channels listed. Missing a channel (e.g., a phone-app webhook, an exchange-side webhook)?
- Auto-engage triggers : drawdown + circuit breaker + connectivity loss are listed. Should "P1 alert from Grafana on
expectancy_net < 0over 24h" be an auto channel, or stay manual to avoid over-aggressive engagement? - Disengage authorization : currently any operator can disengage if they have Console / CLI access. Should there be a 2-person rule for disengagement (analogous to nuclear-launch keys)?
- In-flight orders policy : the design explicitly does NOT cancel in-flight orders. Is "operator handles via exchange UI" acceptable, or should there be a follow-up Story for
flatten_all?
11. Acceptance criteria (Story level)¶
- Design doc merged at
documentation/design/CVN-N001-EF-S02-kill-switch-design.md - ADR-71 file merged codifying I1-I8
- ADR index + CLAUDE.md ADR cluster updated
- Committee verdict ≥ ACCEPTED or ACCEPTED-WITH-CHANGES (≥ 8.0 avg, no blockers)
- Open questions §10 answered by committee or recorded as deferred-to-implementation
- Story OP wp#56 closed with PR + commit reference
- Sprint
F1B-S0-Prereqscloseable (S01 + S02 both done) - Live deployment gate : engine MUST NOT trade live until companion
flatten_allStory (EG-S06) ships per committee session4c388b4cBLOCKER. Paper trading not gated.
12. References¶
- Need :
CVN-N001(F1 mission) - Epic :
CVN-N001-EF(#729) — F1 mission operational prereqs - Sibling Story :
CVN-N001-EF-S01(#709) — MLOps readiness template (merged in PR #730) - ADRs touched / built upon :
- ADR-25 — No silent fallback (the basis of I3 fail-safe)
- ADR-32, ADR-33 — log_event format + closed event catalogue (I5 audit)
- ADR-39 — Runtime standalone, API façade (kernel boundary)
- ADR-40 — Paper and live share the same kernel (one switch covers both)
- ADR-56 — Pipeline change must be FTF-testable (kill-switch is NOT FTF-testable, intentionally — it's a safety layer outside the A/B surface)
- ADR-59 — All pipeline params in PostgreSQL (the data plane this design builds on, but with a dedicated table)
- ADR-62 — Unified observability via OpenTelemetry (I5 + I6 spans / metrics)
- ADR-68 — Committee = default review channel (this dossier)
- ADR-69 — OpenProject orchestrator (Story discipline)
- ADR-70 — MLOps readiness template (future EG Stories must fill it)
- Code surfaces referenced :
src/paper_trading/cvntrade_paper_trading_engine.py— kernelsrc/paper_trading/safety/cvntrade_risk_manager.py— existing in-memory primitivescripts/ftf_config_ui.py— Console UI host for the new buttoninfra/migrations/— schema migrations land here- Issues : #708 (this Story), #729 (Epic), #608 (Need), session
8db2529d(committee finding source)