Skip to content

ADR-0096 — Every CPU-bound compute pod MUST cap ALL its parallelism (threads + processes) to its cgroup allocation

Status: active (operator-designed 2026-06-02 ; committee plan_review 4f0c4589 PASSED strong, 0 blocker, 5×9/9 — Meeting #242 ; process-parallelism vector incl.)

Context-of-record: RCA 2026-06-02-cvn-n001-ei-s04-s42-run-hang-rca.md (CVN-N001-EI-S04 diagnostic__s42 livelock).

Context

The diagnostic__s42 full run livelocked for ~16 h producing zero fits (confirmed by Prometheus: ~3.2 cores/pod burnt, 0 progress, identical across all 5 pods). Root cause: no thread-pool cap. A CPU-bound process that does not cap its thread-pools spawns n_physical_cores threads (the node's, not the pod's); the cgroup throttles execution to limits.cpu; the OpenMP/BLAS barriers busy-wait and almost never clear within the throttled budget → livelock. This is systemic and library-spanning: it hits LightGBM/XGBoost/CatBoost OpenMP and numpy/scipy/sklearn BLAS (OpenBLAS/MKL), which spawn their own threads (default = all cores). For numpy-heavy workloads (e.g. the s43 B=2000-resample bootstrap) BLAS threads are the dominant contributor, not the model OpenMP. So a model-param-only cap (num_threads) is insufficient. The same failure class (oversubscription → cgroup throttle → stall) recurs through process-level parallelism: sklearn n_jobs, joblib, and the parallelised bootstrap use loky (default process backend), whose workers read os.cpu_count() = the node's cores (not the cgroup). n_jobs=-1node_cores worker processes, each re-reading the env caps and capping ITS threads to cgroup_limitnode_cores × cgroup_limit threads = an explosion worse than the original bug. The s43 B=2000 bootstrap is a natural joblib.Parallel(n_jobs=-1) candidate → it would re-livelock through loky despite full thread-cap compliance. So the ADR must cap all parallelism (thread AND process), not just thread-pools. An RCA fixes the incident; this ADR fixes the failure mode so it cannot recur through another library, another vector, or another diagnostic.

Decision

Every pod that does CPU-bound compute (model fits or numpy-heavy numeric work — diagnostics s4x, the training/FTF harness) MUST cap all its parallelism — thread-pools AND process-pools — to its cgroup CPU allocation, env-level, before any Python import, enforced structurally (not per-diagnostic).

  1. Env-level thread caps at the pod entrypoint (covers the whole process — OpenMP and BLAS):
    OMP_NUM_THREADS        = <cgroup cpu limit>
    OPENBLAS_NUM_THREADS   = <cgroup cpu limit>
    MKL_NUM_THREADS        = <cgroup cpu limit>
    NUMEXPR_NUM_THREADS    = <cgroup cpu limit>
    LOKY_MAX_CPU_COUNT     = <cgroup cpu limit>   # caps loky/joblib process count to the cgroup, not the node
    OMP_WAIT_POLICY        = passive
    
  2. OMP_WAIT_POLICY=passive: threads sleep at barriers instead of busy-waiting → even a mis-set cap degrades to a visible slowdown, not an invisible livelock. Turns a 16 h catastrophe into "slow but progressing".
  3. Fail-loud startup assertion (instance of ADR-25): if the caps are unset OR > cgroup limit, the pod refuses to startevent=pod_thread_caps_unsafe severity=error. No run with unbounded threads.
  4. Belt-and-suspenders code-level: model-param num_threads = cap + optionally threadpoolctl.threadpool_limits() for already-imported libs. Secondary to (1), never primary.
  5. Process-parallelism joint constraint (the second leg of the same failure class): n_processes × threads_per_process ≤ cgroup_limit MUST hold. Safe default = n_jobs=1 (no process parallelism ; the in-process thread cap uses the whole budget). If process parallelism is genuinely wanted, it MUST be paired so the product stays within the limit (e.g. n_jobs=limit, thread_cap=1). LOKY_MAX_CPU_COUNT (clause 1) is the structural backstop, but the code MUST NOT pass n_jobs=-1 on a compute pod.

The cgroup limit is read once via the Kubernetes Downward API (source of truth = the pod spec, cgroup-v1/v2 agnostic):

env:
  - name: CVN_POD_CPU_LIMIT
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu

Invariants

  • Invariant 1 — caps set from the cgroup limit, structurally + UNCONDITIONALLY: the shared entrypoint (base image / KPO template) sets OMP/OPENBLAS/MKL/NUMEXPR_NUM_THREADS + LOKY_MAX_CPU_COUNT + OMP_WAIT_POLICY=passive from $CVN_POD_CPU_LIMIT (Downward API) before any Python import, on every pod inheriting the entrypoint, with NO conditional flag. A 4-thread cap on an I/O-only pod is harmless (it never spawns more); a conditional "am I compute-bound?" flag would be a point-of-failure-by-omission (the rot this ADR rejects). The scope compute pods describes where it matters, not where it applies — it applies everywhere, it matters on compute pods.
  • Invariant 2 — fail-loud on unsafe OR unknowable caps: caps unset, OR caps > cgroup limit, OR CVN_POD_CPU_LIMIT itself unset (the Downward-API env was not declared → the limit is unknowable) → pod refuses to start with event=pod_thread_caps_unsafe severity=error reason=<caps_unset|caps_exceed_limit|cpu_limit_undeclared> (ADR-25). "I don't know my limit" is as unsafe as "I exceed it".
  • Invariant 3 — testable, two levels (a unit test does not run in the pod → no runtime cgroup access): unit asserts the caps are explicitly set from config/env (not None, not default, not hardcoded to the node core count) in the fit path ; smoke/integration asserts the effective value ≤ the observed cgroup limit on a real pod. (Mirrors the S03/Q1.g call_count==0 spy that breaks if re-violated.)
  • Invariant 4 — OMP_WAIT_POLICY=passive is always set on compute pods (degradation = slowdown, never livelock).
  • Invariant 5 — process parallelism bounded: LOKY_MAX_CPU_COUNT = cgroup limit is set ; no compute-pod code passes n_jobs=-1 ; n_jobs × thread_cap ≤ cgroup_limit. Unit-testable: assert the fit/bootstrap path does not request n_jobs=-1 and that the joint product is bounded.

Alternatives rejected

  • Model-param num_threads only — rejected: caps LightGBM's OpenMP but not numpy/sklearn BLAS threads, which dominate numpy-heavy workloads (s43 bootstrap). Would leave the same livelock open via BLAS → near-identical 3rd RCA.
  • Per-diagnostic thread-pinning — rejected: "remember to pin in every probe" rots in six months (ADR-0093/0094 lesson). Must be structural at the pod entrypoint.
  • A "pin your threads" guideline without enforcement — rejected: a slogan, not a control. The fail-loud assertion + the shared entrypoint are the load-bearing parts.

Consequences

  • New compute diagnostics (s43+) inherit the protection with zero diagnostic-specific code; A3 of the RCA becomes "inherit the pod cap", not "pin in each probe".
  • Residual node-level contention when n_pods × cap > n_node_cores (e.g. 5 × 4 = 20 > 16): a slowdown ~25%, not a livelock (cgroup guarantees the per-pod budget, intra-pod barriers clear). Acceptable for one-shot verdicts ; use pod anti-affinity if industrialised. To be acted explicitly, not left implicit.
  • Minor startup-assertion overhead.

Decisions (post-plan_review 4f0c4589 PASSED)

Where the shared entrypoint livesbase Docker image (committee reco #2 : maximum structurality — all pods inherit, nothing to wire per-DAG ; KPO template = fallback for image-overriding pods). cgroup-read = Downward API, scope = compute pods (unconditional apply), fail-loud incl. undeclared limit — settled above.

Committee non-blocking recommendations (implementation-time) : (1) wire threadpoolctl.threadpool_info() into the startup assertion or a CI gate (confirm the effective BLAS cap, not just the env var) ; (3) lint/guard against n_jobs=-1 on compute pods (enforce Invariant 5) ; (4) Prometheus panel "fits completed vs CPU burnt" to detect a residual-contention slowdown (ties RCA A5) ; (5) pod anti-affinity for industrialised workloads (ties Consequences).

Validation notes (one-shot, not runtime invariants)

  • Fractional CPU limits → floor, not ceil: resourceFieldRef: limits.cpu (default divisor) rounds up (3.5 → 4) → a 4-thread cap on a 3.5-core budget slightly over-subscribes (slowdown under passive, not livelock — acceptable). For exactness, read millicores (divisor: 1m → 3500) and floor-divide by 1000 → 3. Either document the round-up as accepted, or assume integer limits (the usual case).
  • Confirm the actual BLAS backend once: the four env vars cover the usual Linux ML stack, but the effective cap depends on which BLAS numpy is linked against — OpenBLAS (common) vs MKL (conda/Intel). Run threadpoolctl.threadpool_info() once at startup (or in CI) to confirm which BLAS responds and that the cap reaches its target — else we cap MKL_NUM_THREADS while numpy is linked to OpenBLAS and the cap misses.

Rollback

Revert the entrypoint cap + assertion ; pods fall back to unbounded threads (the pre-incident state). No data/logic impact (env + a startup check only).

References

  • RCA: reviews/2026-06-02-cvn-n001-ei-s04-s42-run-hang-rca.md (action items A1-A6)
  • Related ADRs: ADR-25 (fail-loud structured — the startup assertion is an instance), ADR-0093 (cluster dry-run gate — backstop: a single-cell smoke already surfaces the intra-pod cap violation), ADR-0094 (invariant-source design), ADR-0095 (diagnostic-story template — its A3/thread-cap invariant defers to this ADR), ADR-92 (DAG provenance).
  • Family: ADR-0093 (process) · ADR-0094 (invariant source) · ADR-0095 (diagnostic docs) · ADR-0096 (compute resources) — all "systemic failure-mode prevention, not just incident".