ADR-0096 — Every CPU-bound compute pod MUST cap ALL its parallelism (threads + processes) to its cgroup allocation¶
Status: active (operator-designed 2026-06-02 ; committee plan_review 4f0c4589 PASSED strong, 0 blocker, 5×9/9 — Meeting #242 ; process-parallelism vector incl.)
Context-of-record: RCA 2026-06-02-cvn-n001-ei-s04-s42-run-hang-rca.md (CVN-N001-EI-S04 diagnostic__s42 livelock).
Context¶
The diagnostic__s42 full run livelocked for ~16 h producing zero fits (confirmed by Prometheus: ~3.2 cores/pod burnt, 0 progress, identical across all 5 pods). Root cause: no thread-pool cap. A CPU-bound process that does not cap its thread-pools spawns n_physical_cores threads (the node's, not the pod's); the cgroup throttles execution to limits.cpu; the OpenMP/BLAS barriers busy-wait and almost never clear within the throttled budget → livelock. This is systemic and library-spanning: it hits LightGBM/XGBoost/CatBoost OpenMP and numpy/scipy/sklearn BLAS (OpenBLAS/MKL), which spawn their own threads (default = all cores). For numpy-heavy workloads (e.g. the s43 B=2000-resample bootstrap) BLAS threads are the dominant contributor, not the model OpenMP. So a model-param-only cap (num_threads) is insufficient. The same failure class (oversubscription → cgroup throttle → stall) recurs through process-level parallelism: sklearn n_jobs, joblib, and the parallelised bootstrap use loky (default process backend), whose workers read os.cpu_count() = the node's cores (not the cgroup). n_jobs=-1 → node_cores worker processes, each re-reading the env caps and capping ITS threads to cgroup_limit → node_cores × cgroup_limit threads = an explosion worse than the original bug. The s43 B=2000 bootstrap is a natural joblib.Parallel(n_jobs=-1) candidate → it would re-livelock through loky despite full thread-cap compliance. So the ADR must cap all parallelism (thread AND process), not just thread-pools. An RCA fixes the incident; this ADR fixes the failure mode so it cannot recur through another library, another vector, or another diagnostic.
Decision¶
Every pod that does CPU-bound compute (model fits or numpy-heavy numeric work — diagnostics s4x, the training/FTF harness) MUST cap all its parallelism — thread-pools AND process-pools — to its cgroup CPU allocation, env-level, before any Python import, enforced structurally (not per-diagnostic).
- Env-level thread caps at the pod entrypoint (covers the whole process — OpenMP and BLAS):
OMP_WAIT_POLICY=passive: threads sleep at barriers instead of busy-waiting → even a mis-set cap degrades to a visible slowdown, not an invisible livelock. Turns a 16 h catastrophe into "slow but progressing".- Fail-loud startup assertion (instance of ADR-25): if the caps are unset OR > cgroup limit, the pod refuses to start —
event=pod_thread_caps_unsafe severity=error. No run with unbounded threads. - Belt-and-suspenders code-level: model-param
num_threads = cap+ optionallythreadpoolctl.threadpool_limits()for already-imported libs. Secondary to (1), never primary. - Process-parallelism joint constraint (the second leg of the same failure class):
n_processes × threads_per_process ≤ cgroup_limitMUST hold. Safe default =n_jobs=1(no process parallelism ; the in-process thread cap uses the whole budget). If process parallelism is genuinely wanted, it MUST be paired so the product stays within the limit (e.g.n_jobs=limit, thread_cap=1).LOKY_MAX_CPU_COUNT(clause 1) is the structural backstop, but the code MUST NOT passn_jobs=-1on a compute pod.
The cgroup limit is read once via the Kubernetes Downward API (source of truth = the pod spec, cgroup-v1/v2 agnostic):
Invariants¶
- Invariant 1 — caps set from the cgroup limit, structurally + UNCONDITIONALLY: the shared entrypoint (base image / KPO template) sets
OMP/OPENBLAS/MKL/NUMEXPR_NUM_THREADS+LOKY_MAX_CPU_COUNT+OMP_WAIT_POLICY=passivefrom$CVN_POD_CPU_LIMIT(Downward API) before any Python import, on every pod inheriting the entrypoint, with NO conditional flag. A 4-thread cap on an I/O-only pod is harmless (it never spawns more); a conditional "am I compute-bound?" flag would be a point-of-failure-by-omission (the rot this ADR rejects). The scope compute pods describes where it matters, not where it applies — it applies everywhere, it matters on compute pods. - Invariant 2 — fail-loud on unsafe OR unknowable caps: caps unset, OR caps > cgroup limit, OR
CVN_POD_CPU_LIMITitself unset (the Downward-API env was not declared → the limit is unknowable) → pod refuses to start withevent=pod_thread_caps_unsafe severity=error reason=<caps_unset|caps_exceed_limit|cpu_limit_undeclared>(ADR-25). "I don't know my limit" is as unsafe as "I exceed it". - Invariant 3 — testable, two levels (a unit test does not run in the pod → no runtime cgroup access): unit asserts the caps are explicitly set from config/env (not None, not default, not hardcoded to the node core count) in the fit path ; smoke/integration asserts the effective value ≤ the observed cgroup limit on a real pod. (Mirrors the S03/Q1.g
call_count==0spy that breaks if re-violated.) - Invariant 4 —
OMP_WAIT_POLICY=passiveis always set on compute pods (degradation = slowdown, never livelock). - Invariant 5 — process parallelism bounded:
LOKY_MAX_CPU_COUNT = cgroup limitis set ; no compute-pod code passesn_jobs=-1;n_jobs × thread_cap ≤ cgroup_limit. Unit-testable: assert the fit/bootstrap path does not requestn_jobs=-1and that the joint product is bounded.
Alternatives rejected¶
- Model-param
num_threadsonly — rejected: caps LightGBM's OpenMP but not numpy/sklearn BLAS threads, which dominate numpy-heavy workloads (s43 bootstrap). Would leave the same livelock open via BLAS → near-identical 3rd RCA. - Per-diagnostic thread-pinning — rejected: "remember to pin in every probe" rots in six months (ADR-0093/0094 lesson). Must be structural at the pod entrypoint.
- A "pin your threads" guideline without enforcement — rejected: a slogan, not a control. The fail-loud assertion + the shared entrypoint are the load-bearing parts.
Consequences¶
- New compute diagnostics (s43+) inherit the protection with zero diagnostic-specific code; A3 of the RCA becomes "inherit the pod cap", not "pin in each probe".
- Residual node-level contention when
n_pods × cap > n_node_cores(e.g. 5 × 4 = 20 > 16): a slowdown ~25%, not a livelock (cgroup guarantees the per-pod budget, intra-pod barriers clear). Acceptable for one-shot verdicts ; use pod anti-affinity if industrialised. To be acted explicitly, not left implicit. - Minor startup-assertion overhead.
Decisions (post-plan_review 4f0c4589 PASSED)¶
Where the shared entrypoint lives → base Docker image (committee reco #2 : maximum structurality — all pods inherit, nothing to wire per-DAG ; KPO template = fallback for image-overriding pods). cgroup-read = Downward API, scope = compute pods (unconditional apply), fail-loud incl. undeclared limit — settled above.
Committee non-blocking recommendations (implementation-time) : (1) wire threadpoolctl.threadpool_info() into the startup assertion or a CI gate (confirm the effective BLAS cap, not just the env var) ; (3) lint/guard against n_jobs=-1 on compute pods (enforce Invariant 5) ; (4) Prometheus panel "fits completed vs CPU burnt" to detect a residual-contention slowdown (ties RCA A5) ; (5) pod anti-affinity for industrialised workloads (ties Consequences).
Validation notes (one-shot, not runtime invariants)¶
- Fractional CPU limits → floor, not ceil:
resourceFieldRef: limits.cpu(default divisor) rounds up (3.5 → 4) → a 4-thread cap on a 3.5-core budget slightly over-subscribes (slowdown underpassive, not livelock — acceptable). For exactness, read millicores (divisor: 1m→ 3500) and floor-divide by 1000 → 3. Either document the round-up as accepted, or assume integer limits (the usual case). - Confirm the actual BLAS backend once: the four env vars cover the usual Linux ML stack, but the effective cap depends on which BLAS numpy is linked against — OpenBLAS (common) vs MKL (conda/Intel). Run
threadpoolctl.threadpool_info()once at startup (or in CI) to confirm which BLAS responds and that the cap reaches its target — else we capMKL_NUM_THREADSwhile numpy is linked to OpenBLAS and the cap misses.
Rollback¶
Revert the entrypoint cap + assertion ; pods fall back to unbounded threads (the pre-incident state). No data/logic impact (env + a startup check only).
References¶
- RCA:
reviews/2026-06-02-cvn-n001-ei-s04-s42-run-hang-rca.md(action items A1-A6) - Related ADRs: ADR-25 (fail-loud structured — the startup assertion is an instance), ADR-0093 (cluster dry-run gate — backstop: a single-cell smoke already surfaces the intra-pod cap violation), ADR-0094 (invariant-source design), ADR-0095 (diagnostic-story template — its A3/thread-cap invariant defers to this ADR), ADR-92 (DAG provenance).
- Family: ADR-0093 (process) · ADR-0094 (invariant source) · ADR-0095 (diagnostic docs) · ADR-0096 (compute resources) — all "systemic failure-mode prevention, not just incident".