Skip to content

ADR-0071 — Trading kill-switch invariants (single PG source, operator-only disengage, fail-safe)

Status: active Date: 2026-04-28 Introduced by: CVN-N001-EF-S02 (#708), F1_buy boost committee session 8db2529d Supersedes: none


Context

The F1_buy boost plan committee (session 8db2529d, Ops persona) flagged the absence of an explicit, multi-channel, auditable kill-switch as a blocker for any new ML track entering production. Today the project has only a per-process in-memory _killed flag on the RiskManager (src/paper_trading/safety/cvntrade_risk_manager.py:201) activated by a single Airflow DAG. Multiple paper / live engine instances each hold their own copy ; a pod restart loses the engaged state ; there is no Console button, no env-var safety, no Grafana auto-trigger, no audit table beyond Loki logs.

The full design lives in documentation/design/CVN-N001-EF-S02-kill-switch-design.md. This ADR codifies the binding invariants — the rules that any future implementation Story (Epic CVN-N001-EG) MUST satisfy.

Decision

The trading kill-switch is governed by a single PostgreSQL row with operator-only disengagement, fail-safe semantics on connectivity loss, and full audit trail. Every trading decision in every engine instance reads its state from PostgreSQL (no per-process cache surviving > 500 ms).

The state lives in a dedicated kill_switch_state table (single-row, optimistic-concurrency). Every transition writes an immutable row in kill_switch_history. Activation channels are : Console UI button, CLI, K8s env var (boot-time only), Grafana alert webhook, Airflow DAG, system auto-engage (drawdown / circuit breaker / PG unreachable). Disengagement channels are restricted to Console UI, CLI, Airflow DAG.

The kernel's halt scope is : skip new BUY signals + new allocations. Exit checks (TP/SL/timeout) continue to run. In-flight orders are NOT cancelled (operator handles via exchange UI).

Invariants

These rules MUST hold forever (or until this ADR is superseded). Each is verifiable by reading code, schema, or runtime traces.

  • I1 — Single source of truth in PostgreSQL : the only authoritative kill-switch state is the kill_switch_state row. No code path may set a process-local killed flag without the value originating from a PG read. The existing RiskManager._killed attribute remains as a defense-in-depth check, but is mirrored from the KillSwitchClient cache, not set independently.
  • I2 — Operator-only disengagement : auto channels (drawdown, circuit breaker, PG unreachable, Grafana webhook) MAY engage but MUST NOT disengage. Disengagement requires a human actor named in engaged_by and kill_switch_history.actor. Allowed disengage channels : Console UI, CLI, Airflow DAG. The fail-safe system:pg_unreachable engagement is the one exception ; it auto-clears when PG returns, with an audit row recording the auto-clear.
  • I3 — Fail-safe on connectivity loss : if PG is unreachable for > 3 consecutive polls (default 1.5 s wall-clock), the in-memory cached state flips to engaged=True, engaged_by="system:pg_unreachable". A CRITICAL log fires per ADR-25 (no silent fallback). When PG returns, a deferred history row records the fail-safe and its clear.
  • I4 — No graceful drain ; companion flatten_all Story is mandatory before live trading : engaging the switch does NOT cancel in-flight orders by itself. It halts new BUY signals + new allocations only. Exit checks (TP/SL/timeout) continue to run on existing positions because halting them would leave positions exposed without stop-loss enforcement. Live deployment gate: any deployment of this engine to live trading is BLOCKED until a companion flatten_all Story (cancels open positions on demand, separate channel from kill-switch) ships and is wired to the Console / CLI. Paper trading is not gated by this requirement. Per committee session 4c388b4c (PASSED / EXECUTION_RISK), this gate replaces the original "operator handles via exchange UI" approach.
  • I5 — Audit per flip : every transition (engage / disengage / fail-safe-engage / fail-safe-clear) writes a row in kill_switch_history AND emits a structured log per ADR-32/33 (event=kill_switch_<transition>) AND emits an OpenTelemetry span per ADR-62 (kill_switch.<transition>). Loki + the history table are redundant by design — Loki is queryable, the table is the system of record.
  • I6 — Halt latency < 1 second : from the successful engagement transaction to the next pre-trade reject, total latency is bounded by polling_interval_s (default 500 ms) + one PG round-trip. The implementation MUST emit a metric cvntrade_kill_switch_halt_latency_seconds and the value MUST be alerted at p99 > 1 s.
  • I7 — Boot honors PG state : on engine boot, the kernel reads kill_switch_state once synchronously BEFORE accepting any trade. Only then does it start the polling thread. A True state at boot means the engine starts in halted mode and never sends an order ; the operator may disengage at runtime. This is what makes the switch survive pod restarts.
  • I8 — Env-var bypass is engage-only : CVN_KILL_SWITCH=engaged env var (set in K8s pod spec) forces engaged at boot regardless of PG state. There is NO CVN_KILL_SWITCH=disengaged value. Disengagement always goes through PG ; otherwise an env-var override would silently undo an operator's engage decision and create an unaudited bypass.

Alternatives rejected

  • In-memory flag broadcast via Redis pub/sub : faster (~10 ms latency vs ~500 ms PG poll) but introduces a second source of truth. If Redis loses a message during reconnect, state diverges across pods. The PG-poll latency is acceptable given the < 1 s SLI ; the simpler invariant ("only PG matters") is worth the 490 ms cost.
  • Filesystem flag (e.g., /tmp/kill_switch) : doesn't survive pod restart, doesn't propagate across pods, no audit trail. Considered for prototyping speed only ; rejected on correctness.
  • Kubernetes ConfigMap watch : per-namespace blast radius is fine but ConfigMap eventual consistency under load can take seconds, and writing to a ConfigMap requires K8s RBAC for every channel (Console, CLI, webhook), increasing the surface. PG already has a credential surface (psycopg2 connection string) and write-permission model.
  • Exchange-side risk controls only : doesn't cover paper-trading or in-house signal generation. Also doesn't help when the bug is in our position-sizing code, not the exchange. Exchange controls are complementary, not a substitute.
  • Cancel all in-flight orders on engage : tempting but introduces a partial-state failure mode (some cancelled, some not, some failed-to-cancel) and adds an exchange API dependency to the safety path. The safety path must be as simple and as local as possible. A separate flatten_all Story (out of scope here) handles this when needed.
  • Auto-disengage when underlying trigger clears : creates flapping during transient incidents. The cost of one human disengagement is bounded ; the cost of an auto-flap during a partial outage is unbounded.

Consequences

  • Positive : a single psql query (UPDATE kill_switch_state SET engaged=TRUE WHERE id=1) halts every paper / live engine in the cluster within < 1 s. No special tooling needed in an emergency at 3 am with a flaky laptop.
  • Positive : pod restarts respect prior engagement (I7) — no more "we restarted to fix a memory leak and accidentally resumed trading".
  • Positive : audit trail is queryable (SELECT * FROM kill_switch_history WHERE occurred_at > now() - interval '24h' ORDER BY occurred_at DESC) — answers "who engaged what, when, and why" without grep'ing logs.
  • Positive : fail-safe (I3) means a PG outage halts trading rather than letting it run unobserved. The trade-off is false-positive halts during transient PG hiccups ; the alternative (silent unobserved trading) is worse.
  • Negative : every trading decision involves a (cached) PG-derived check. The 500 ms polling is tunable but introduces baseline I/O.
  • Negative : the env-var asymmetry (I8) is non-obvious ; an operator may waste minutes trying CVN_KILL_SWITCH=disengaged before realising it doesn't exist. Mitigated by documenting prominently in the design + CLAUDE.md + a friendly error message in the engine boot path.
  • Negative : adding a new channel (e.g., a phone-app webhook) requires implementing the same engage_kill_switch(actor, reason, channel) helper and an audit-table channel name. Friction is intentional — every channel is a potential bypass.
  • Neutral : the in-flight-orders policy (I4) means an engagement during open positions leaves them subject to TP/SL only. Operator must consciously close them via exchange UI or trigger a future flatten_all Story. Documented prominently.

Rollback

This ADR is process + schema. Rollback = drop the kill_switch_state and kill_switch_history tables, revert the KillSwitchClient integration in the kernel, restore the RiskManager._killed-only path. The Airflow DAG continues to work since it currently calls RiskManager.activate_kill_switch() directly.

If any invariant proves systematically wrong (e.g., I3 fail-safe causes > 5 % spurious halts per month over a 3-month sample), revisit that specific invariant via amendment ; a full rollback is not expected.

References

  • ADR-25 — No silent fallback (basis of I3)
  • ADR-32, ADR-33 — log_event format + closed event catalogue (I5)
  • ADR-39 — Runtime standalone, API façade (kernel boundary)
  • ADR-40 — Paper and live share the same kernel (one switch covers both)
  • ADR-56 — Pipeline change must be FTF-testable (kill-switch is intentionally NOT FTF-testable — safety layer outside A/B surface)
  • ADR-59 — All pipeline params in PostgreSQL (data plane pattern this design extends)
  • ADR-62 — Unified observability via OpenTelemetry (I5, I6)
  • ADR-68 — Expert Committee = default review channel (the plan_review that approved this ADR)
  • ADR-69 — OpenProject orchestrator (Story-pull discipline this Story follows)
  • ADR-70 — MLOps readiness template (future EG implementation Stories must fill it ; per committee reco 7, future ADR-70 amendment may add kill-switch-specific checks: latency SLI validation, audit replay tests, fail-safe simulations, channel redundancy tests)
  • Design : documentation/design/CVN-N001-EF-S02-kill-switch-design.md
  • Code surfaces : src/paper_trading/cvntrade_paper_trading_engine.py, src/paper_trading/safety/cvntrade_risk_manager.py
  • Future Epic : CVN-N001-EG (kill-switch implementation, 7 Stories outlined in design §7 — including mandatory flatten_all for live gate per committee session 4c388b4c and chaos engineering for I3/I7 validation)
  • Issues : #708 (this Story), #729 (Epic CVN-N001-EF), #608 (Need CVN-N001)
  • Committee sessions : 8db2529d (Ops finding source), 4c388b4c (this ADR's plan_review, PASSED / EXECUTION_RISK avg ~8.3, 1 blocker addressed via I4 amendment + 9 recommendations applied to design §6/§7)