Skip to content

Plan dossier — CVN-N011-EA-S10 : gRPC fork deadlock blocks every cleanlab FTF sweep

Date : 2026-04-29 Story : CVN-N011-EA-S10 (OP wp#91) — Pipeline Contract Hardening Epic GH issue : #774 Author : Dominique (operator) + Claude Session type : plan_review (per ADR-68) Severity : P1 — blocks every Track 5 cleanlab gate decision (Track 6 unaffected per the diagnostic refinement below)

Review history

  • v1 (committee 7bf612b7) — PASSED OK, consensus strong (5/5 experts), median score 8.0/10. 0 blockers. 11 recommendations triaged below in §11.

1. Context

On 2026-04-29 the operator triggered the cleanlab FTF sweep twice — once before PR #769 (CVN-N011-EA-S08 class-aware cap) was merged, and once after the image redeploy at 09:15 UTC. Both runs deadlocked silently after ~4 successful HPO trials per pod.

Hours later, the operator launched a Track 6 focal-loss sweep on the same defi_top5 group, same HPO + MLflow + Optuna stack — and that sweep ran cleanly with 0 deadlocks observed across 5 pods × 50 trials. This refines the original Story hypothesis (the deadlock is environmental on the whole HPO+MLflow+Optuna combo) into a much narrower surface : the deadlock requires the cleanlab path.

2. Updated diagnostic

2.1 What we observed (cleanlab sweep, 2026-04-29 09:46-10:10 UTC)

5 pods, all stuck identically by 09:47 UTC :

Pod Crypto Last log Last meaningful event
4taz4os7 OPUSDC 09:47:25 cleanlab_cv_probs (HPO trial ~6)
7wsg5tvp ARBUSDC 09:46:51 cleanlab_cv_probs (HPO trial ~3)
fzkykxa7 AAVEUSDC 09:46:56 cleanlab_cv_probs (HPO trial ~3)
p3mltmkj UNIUSDC 09:46:42 cleanlab_cv_probs (HPO trial ~2)
u6kmwt1l LDOUSDC 09:46:15 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers then silence

LDOUSDC is the smoking gun — explicit gRPC fork warning right before silence. The other 4 hung at the same point but their last log was a successful pipeline iteration so the warning was buried in stderr.

The per-class cap from PR #769 is working correctly in every successful trial : buy_drop_pct=4.99 ≤ cap_pct=5.0. The deadlock is unrelated to S08.

2.2 What we observed (focal_loss sweep, 2026-04-29 11:08-11:30 UTC)

5 pods, same defi_top5 group, same HPO + MLflow + Optuna stack, 0 fork warnings, all trials progressing at ~4.5 sec/trial, healthy hpo_heartbeat events every ~25 sec. Sweep was killed by an unrelated bug (sympy missing — see CVN-N011-EA-S11 post-mortem and PR #775 hotfix), but the runtime stack itself was clearly not deadlocking.

2.3 Refined root cause hypothesis

The deadlock is triggered by cleanlab.filter.find_label_issues which spawns a multiprocessing.Pool directly (verified at cleanlab/filter.py:11) — not via joblib. cleanlab defaults to n_jobs=cpu_count() and uses the OS default start method (fork on Linux + Python<3.14, see filter.py:359). The fork happens inside the parent process where the MLflow client gRPC threads (started by mlflow.autolog) are alive. The forked child inherits a half-locked gRPC state and hangs forever on the first gRPC call (which the MLflow client tries to make at trial completion).

This is the canonical "fork after threads" anti-pattern documented in gRPC's fork support page and Python multiprocessing docs.

Why focal_loss didn't trigger : the focal-loss label_pipeline runs with cleanlab_mode='off', so find_label_issues is never called, so no multiprocessing.Pool fork inside the trial body.

2.4 Why the deadlock survives across pods

Once a pod's main process is in a deadlocked state, the pod reports Running to K8s (the process is alive, just hung), passes the liveness probe (no probe defined for this workload), and consumes RAM until the cluster auto-scaler or the operator kills it. There is no automatic recovery.

3. Hypotheses

Hypothesis 1 — MULTIPROCESSING_START_METHOD=spawn (system-wide)

Force every multiprocessing.Process and joblib worker to use spawn instead of fork. The child re-imports the module fresh, no inherited threads, no inherited gRPC state.

  • Cost : add import multiprocessing; multiprocessing.set_start_method('spawn', force=True) near src/commun/finetune/__init__.py (or in the pod entrypoint). Plus joblib config: os.environ['LOKY_START_METHOD'] = 'spawn'.
  • Coverage : universal — fixes the fork-after-threads anti-pattern for all subprocess spawning, not just cleanlab. Future Stories that introduce other fork users get the fix for free.
  • Risk : 1-2 sec startup overhead per joblib worker (re-import). Pickle-ability of every callback passed to workers (most are fine ; custom closures in lambda form will break). Need to audit src/training/XGBoost/xgboost_hpo.py and src/commun/finetune/ablation_runner.py for non-picklable callbacks.

Hypothesis 2 — GRPC_ENABLE_FORK_SUPPORT=1 (gRPC-native)

gRPC has a documented (opt-in) fork support mode that drains and re-initializes the channel on fork. Setting GRPC_ENABLE_FORK_SUPPORT=1 and GRPC_POLL_STRATEGY=poll env vars before importing gRPC tells the library to handle forks safely.

  • Cost : 2 env vars in the pod spec (Helm chart). No code change.
  • Coverage : every gRPC-using library (MLflow, OpenTelemetry, Tempo, Loki client) gets fork-safety automatically.
  • Risk : performance hit on the gRPC channel (latency spike post-fork as the channel re-establishes). MLflow logging frequency is low (per-trial), so the latency is amortized. The poll strategy is slightly less efficient than epoll on heavy gRPC traffic, but our use is light.
  • Caveat : the env vars must be set before Python imports gRPC for the first time. In K8s that's straightforward via env: ; locally it requires a wrapper (or conftest.py for tests).

Hypothesis 3 — Disable MLflow autolog inside Optuna trials (most invasive)

The MLflow client gRPC threads are spawned by mlflow.autolog. Wrap each Optuna trial in mlflow.autolog(disable=True) for the trial duration ; log results explicitly post-trial.

  • Cost : ~30 lines in src/training/XGBoost/xgboost_hpo.py to wrap the Optuna.study.optimize(objective, ...) body.
  • Coverage : narrow — only fixes the MLflow / gRPC / fork triangle. If another fork-unsafe library (e.g. OpenTelemetry exporter) ever lands in the trial body, the deadlock returns.
  • Risk : we lose the per-trial autolog telemetry inside MLflow (acceptable — we already persist trial metrics in finetune_results). Slightly invasive code change in HPO entry point.

Hypothesis 4 (NEW, considered post-diagnostic) — Force n_jobs=1 on the cleanlab call

cleanlab.filter.find_label_issues is the fork trigger and accepts an n_jobs parameter. With n_jobs=1 cleanlab takes the in-process branch (use_global_vars=True per filter.py:365) — no multiprocessing.Pool, no fork.

  • Cost : 1-argument change in src/training/labels/label_pipeline.py::suspect_mask — pass n_jobs=1 to find_label_issues(...).
  • Coverage : narrow — only fixes the cleanlab path (which is what the diagnostic narrows the bug to). Other fork users would still deadlock.
  • Risk : single-process execution loses the multi-core CV. For cleanlab's typical input size in our pipeline (~3-30k samples per fold) the call duration is ~1-2 sec at n_jobs=cpu_count() and is expected to land in the 2-5 sec range at n_jobs=1 (cleanlab itself notes that the default n_jobs=1 is preferred for many input sizes — see filter.py:262-266). Need to benchmark to confirm no order-of-magnitude regression.

Adopt Hypothesis 4 (single-process cleanlab : pass n_jobs=1 to find_label_issues) as the primary fix, plus Hypothesis 2 (GRPC_ENABLE_FORK_SUPPORT=1 env vars) as a defence-in-depth backstop.

Implementation note (post plan_review) : initial drafting assumed cleanlab uses joblib (loky) for parallelism. Inspection of cleanlab.filter source showed cleanlab uses native multiprocessing.Pool, not joblib — so a joblib.parallel_backend('threading') wrapper would be a no-op. The right primitive is n_jobs=1 (in-process, no Pool spawned at all). The intent (eliminate fork from cleanlab path) is unchanged, only the mechanism is corrected.

Rationale :

  1. Diagnostic narrowing matters : the focal_loss success proves the deadlock is cleanlab-specific. H4 targets the actual trigger surface (cleanlab's internal multiprocessing.Pool) without touching the rest of the platform.
  2. H4 is the smallest blast radius : a 1-argument change (n_jobs=1) in the call site within label_pipeline.suspect_mask, rolled back trivially if benchmarks show a slowdown.
  3. H2 as backstop : even if H4 unblocks the immediate cleanlab path, the next fork-using library to land would re-create the same class of bug. GRPC_ENABLE_FORK_SUPPORT=1 is one env var that covers the entire gRPC surface — cheap insurance.
  4. H1 is too invasive for too little extra coverage : spawn system-wide breaks pickling assumptions across the whole codebase. The audit cost is high and we have no current evidence that any non-cleanlab fork user exists.
  5. H3 is wrong-shaped : disabling MLflow autolog scoped to trials reduces observability without fixing the root cause (the next fork user reintroduces the bug).

5. Implementation path

5.1 Primary fix — enforce n_jobs=1 in cleanlab call

  1. In src/training/labels/label_pipeline.py, locate the cleanlab.filter.find_label_issues call inside the suspect_mask Hamilton node (around line 341).
  2. Pass n_jobs=1 explicitly — cleanlab spawns no multiprocessing.Pool, no fork.
  3. Add a comment referencing this Story + the gRPC fork incident.
  4. Emit a structured event event=cleanlab_find_issues_start with fields n_jobs=1, backend=single_process, reason=cvn_n011_ea_s10_fork_safety on first call (per ADR-32) so Loki can confirm the contract holds.

5.2 Hypothesis 2 — gRPC fork support via env vars

  1. Edit infra/helm/airflow/values-prod.yaml — add GRPC_ENABLE_FORK_SUPPORT=1 and GRPC_POLL_STRATEGY=poll to the worker pod env.
  2. Mirror the env vars in airflow_docker/docker-compose.yaml so local dev gets the same behaviour.
  3. Update documentation/architecture/FTF_SCALING.md with a "fork safety" section pointing to the gRPC official guide and explaining the rationale.

5.3 Reproducer test (per ADR-58 — every fix needs a regression bar)

Add tests/integration/test_grpc_fork_deadlock_regression.py :

  • Fixture spins up an MLflow autolog session (start the gRPC threads explicitly).
  • Calls cleanlab.filter.find_label_issues on a small synthetic dataset (~300 samples) with the explicit contract n_jobs=1 — the test asserts it completes within a strict timeout AND that the contract value is preserved (the test fails if n_jobs is removed from the call site).
  • Asserts the call completes within a strict timeout (e.g. 30 sec).
  • Pre-fix : the test deadlocks (CI fails on timeout). Post-fix : the call returns within 1-2 sec.

5.4 Production validation

After deploy :

  1. Re-trigger finetune__pte with factor=cleanlab, crypto_group=defi_top5, power_mode=standard.
  2. Operator monitors the 5 pods for ~5 min.
  3. Expected : ~125 rows in finetune_results after ~30 min, all 5 pods complete cleanly, kubectl logs ... | grep fork_posix.cc:75 returns 0.

5.5 Runbook

Create documentation/runbooks/runbook_hpo_fork_deadlock.md (P1) per the MLOps readiness template — alert trigger : finetune_pod_idle_seconds > 300 AND finetune_rows_persisted_per_min == 0. The runbook references this Story for the root-cause analysis.

6. Acceptance criteria

  • Plan dossier reviewed via committee plan_review ; verdict PASSED or blockers addressed in v2
  • Reproducer test added (tests/integration/test_grpc_fork_deadlock_regression.py), pre-fix it fails on timeout, post-fix it passes
  • Hypothesis 4 implementation (scoped joblib threading backend) merged
  • Hypothesis 2 implementation (GRPC_ENABLE_FORK_SUPPORT=1 + GRPC_POLL_STRATEGY=poll in Helm values + docker-compose mirror) merged
  • On-cluster validation : cleanlab FTF sweep on defi_top5 completes with ≥ 50 BUY trades / fold + 75-125 rows persisted in finetune_results, no fork_posix.cc:75 warnings in pod logs
  • Runbook runbook_hpo_fork_deadlock.md (P1) added per MLOps readiness §2 + indexed in documentation/runbooks/index.md
  • documentation/architecture/FTF_SCALING.md updated with the fork-safety section
  • OPERATIONS.md §17 incident log entry §17.4 (sympy regression — wp#92) and §17.5 (this gRPC fork deadlock — wp#91) added

7. Out of scope

  • General MULTIPROCESSING_START_METHOD=spawn migration (deferred — reconsider only if a non-cleanlab fork user shows up later)
  • Post-build dockerized smoke test for new module-load-time deps (covered by CVN-N011-EA-S11)
  • Changes to cleanlab itself (upstream — out of our control)

8. Falsifiability + rollback

  • Falsifiability : if the post-fix cleanlab sweep still deadlocks on the cluster (5 pods stuck same way), the joblib threading backend assumption is wrong and we revert immediately, escalate to H1 (system-wide spawn) under a follow-up Story.
  • Rollback : revert label_pipeline.py change (3 lines) and remove the env vars from values-prod.yaml. Both are local, atomic, and reversible without data loss.

9. Risks

Risk Likelihood Impact Mitigation
threading backend slows cleanlab CV due to GIL contention Low Medium Most cleanlab time is in numpy / sklearn (GIL released). Benchmark in the integration test ; if > 2x slower than loky, revert to H1
GRPC_ENABLE_FORK_SUPPORT=1 introduces gRPC instability for MLflow / OTel Low Medium Documented in gRPC release notes as production-grade since gRPC 1.50 (we use 1.60+). Run staging soak test first
Reproducer test is flaky on CI (timing-dependent) Medium Low Use a generous 30 sec timeout + retry logic ; the deadlock is deterministic when triggered, the pass is deterministic when fixed
Another library starts forking in the trial body before H1 lands Low High H2 (gRPC fork support) is the catch-all backstop ; this risk is exactly why we keep it as defence-in-depth

10. Cross-references

  • Sister Story under same Epic, same victim sweep : CVN-N011-EA-S11 — post-mortem of the missing-dep regression that hid this bug for an extra ~2h
  • Hotfix that landed concurrently : PR #775 (sympy in requirements)
  • Closed sister Story (cleanlab cap) : CVN-N011-EA-S08 — proves the per-class cap is working and not the cause of this deadlock
  • Code sites :
  • Cleanlab call (the trigger) : src/training/labels/label_pipeline.py::suspect_mask (Hamilton node that calls cleanlab.filter.find_label_issues, around line 341)
  • HPO entry : src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py (Optuna study + trial spawning)
  • MLflow autolog : src/commun/mlflow/client.py
  • FTF runner : src/commun/finetune/ablation_runner.py:531 (where the runtime spawns the per-variant training)
  • Architecture docs to update : documentation/architecture/FTF_SCALING.md (fork safety section)
  • External references :
  • gRPC fork support guide — H2 reference
  • joblib parallel backends — H4 reference
  • Python multiprocessing start methods — H1 reference

11. Committee recommendations triage (post-PASSED)

# Recommendation Disposition + where applied
1 Enhanced liveness probe (hpo_heartbeat detection) — Architect Defer to a separate observability Story under CVN-N010 — ADR-26/62 territory, this Story stays scoped on the deadlock fix
2 Performance benchmark for H4 with explicit thresholds in CI — Architect Apply in §5.3 — extend the reproducer test to record loky vs threading wall time on the same dataset and assert threading <= 2.0 × loky
3 Staging soak test for H2 (GRPC_ENABLE_FORK_SUPPORT=1) — Ops Apply — staging deploy with --set cvntradeImageTag=<SHA> first, run a 50-trial focal_loss sweep, verify no MLflow / OTel regressions, then promote to prod
4 Audit cleanlab callbacks for thread-safety / pickling — Data Scientist Apply — call site inspection in PR description ; cleanlab's find_label_issues only takes immutable arrays + a classifier instance, no closures, low risk but document
5 Kill switch + formalize staged rollout — Ops Apply (partial) — add a section in the new runbook §4.1 with the kubectl delete pod -l job=finetune-pte command + the Airflow UI "Mark Success" path. ADR-71 trading kill-switch is unrelated to this batch workload
6 Enhanced gRPC observability post-H2 — Architect Defer to CVN-N010-EA — same scope as #1 ; gRPC client metrics are not currently exposed in our OTel pipeline
7 Runbook ownership (DRI assignment) — Ops Apply — runbook frontmatter Owner: dococeven, escalation path documented in §5 of the runbook
8 Structured log event for joblib backend — ML Engineer Apply — already in §5.1 as event=cleanlab_threading_backend_active, add field backend=threading per ADR-32 / ADR-38
9 Empirical GIL validation via py-spy / cProfile — ML Engineer Apply — add a make profile-grpc-fix RUN_ID=... target that runs the reproducer with py-spy record and attaches the SVG flamegraph to the PR description
10 Stress scenarios in reproducer (high n_jobs, memory pressure) — Data Scientist Apply — extend §5.3 with a n_jobs=8 + n_samples=2000 parametrized variant
11 Document threading limitations in docs + code — ML Engineer Apply — already in §5.2 (FTF_SCALING.md update) ; add a code comment at the joblib backend wrapper explaining the GIL-released-numpy assumption

Net effect on §5 implementation path : 8 recos applied (mostly already in plan, made explicit) + 2 deferred to existing observability Need + 1 partial. No blockers, no scope expansion that would warrant a v2.


Question for the committee

Validate Hypothesis 4 (scoped joblib threading backend on the cleanlab CV call) + Hypothesis 2 (GRPC_ENABLE_FORK_SUPPORT=1 env var) as the fix. The diagnostic from the focal_loss sweep proves the deadlock is cleanlab-specific (not a general HPO+MLflow+Optuna issue), so a narrow targeted fix + a defence-in-depth env var is preferable to a system-wide spawn migration. Are there hidden assumptions about the joblib threading backend (GIL, callback pickling, OS-level resource sharing) that would make this fix unsafe for a 5-pod × 50-trial production sweep ?