Plan dossier — CVN-N011-EA-S10 : gRPC fork deadlock blocks every cleanlab FTF sweep¶
Date : 2026-04-29
Story : CVN-N011-EA-S10 (OP wp#91) — Pipeline Contract Hardening Epic
GH issue : #774
Author : Dominique (operator) + Claude
Session type : plan_review (per ADR-68)
Severity : P1 — blocks every Track 5 cleanlab gate decision (Track 6 unaffected per the diagnostic refinement below)
Review history¶
- v1 (committee
7bf612b7) — PASSED OK, consensus strong (5/5 experts), median score 8.0/10. 0 blockers. 11 recommendations triaged below in §11.
1. Context¶
On 2026-04-29 the operator triggered the cleanlab FTF sweep twice — once before PR #769 (CVN-N011-EA-S08 class-aware cap) was merged, and once after the image redeploy at 09:15 UTC. Both runs deadlocked silently after ~4 successful HPO trials per pod.
Hours later, the operator launched a Track 6 focal-loss sweep on the same defi_top5 group, same HPO + MLflow + Optuna stack — and that sweep ran cleanly with 0 deadlocks observed across 5 pods × 50 trials. This refines the original Story hypothesis (the deadlock is environmental on the whole HPO+MLflow+Optuna combo) into a much narrower surface : the deadlock requires the cleanlab path.
2. Updated diagnostic¶
2.1 What we observed (cleanlab sweep, 2026-04-29 09:46-10:10 UTC)¶
5 pods, all stuck identically by 09:47 UTC :
| Pod | Crypto | Last log | Last meaningful event |
|---|---|---|---|
4taz4os7 |
OPUSDC | 09:47:25 | cleanlab_cv_probs (HPO trial ~6) |
7wsg5tvp |
ARBUSDC | 09:46:51 | cleanlab_cv_probs (HPO trial ~3) |
fzkykxa7 |
AAVEUSDC | 09:46:56 | cleanlab_cv_probs (HPO trial ~3) |
p3mltmkj |
UNIUSDC | 09:46:42 | cleanlab_cv_probs (HPO trial ~2) |
u6kmwt1l |
LDOUSDC | 09:46:15 | fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers then silence |
LDOUSDC is the smoking gun — explicit gRPC fork warning right before silence. The other 4 hung at the same point but their last log was a successful pipeline iteration so the warning was buried in stderr.
The per-class cap from PR #769 is working correctly in every successful trial : buy_drop_pct=4.99 ≤ cap_pct=5.0. The deadlock is unrelated to S08.
2.2 What we observed (focal_loss sweep, 2026-04-29 11:08-11:30 UTC)¶
5 pods, same defi_top5 group, same HPO + MLflow + Optuna stack, 0 fork warnings, all trials progressing at ~4.5 sec/trial, healthy hpo_heartbeat events every ~25 sec. Sweep was killed by an unrelated bug (sympy missing — see CVN-N011-EA-S11 post-mortem and PR #775 hotfix), but the runtime stack itself was clearly not deadlocking.
2.3 Refined root cause hypothesis¶
The deadlock is triggered by cleanlab.filter.find_label_issues which spawns a multiprocessing.Pool directly (verified at cleanlab/filter.py:11) — not via joblib. cleanlab defaults to n_jobs=cpu_count() and uses the OS default start method (fork on Linux + Python<3.14, see filter.py:359). The fork happens inside the parent process where the MLflow client gRPC threads (started by mlflow.autolog) are alive. The forked child inherits a half-locked gRPC state and hangs forever on the first gRPC call (which the MLflow client tries to make at trial completion).
This is the canonical "fork after threads" anti-pattern documented in gRPC's fork support page and Python multiprocessing docs.
Why focal_loss didn't trigger : the focal-loss label_pipeline runs with cleanlab_mode='off', so find_label_issues is never called, so no multiprocessing.Pool fork inside the trial body.
2.4 Why the deadlock survives across pods¶
Once a pod's main process is in a deadlocked state, the pod reports Running to K8s (the process is alive, just hung), passes the liveness probe (no probe defined for this workload), and consumes RAM until the cluster auto-scaler or the operator kills it. There is no automatic recovery.
3. Hypotheses¶
Hypothesis 1 — MULTIPROCESSING_START_METHOD=spawn (system-wide)¶
Force every multiprocessing.Process and joblib worker to use spawn instead of fork. The child re-imports the module fresh, no inherited threads, no inherited gRPC state.
- Cost : add
import multiprocessing; multiprocessing.set_start_method('spawn', force=True)nearsrc/commun/finetune/__init__.py(or in the pod entrypoint). Plus joblib config:os.environ['LOKY_START_METHOD'] = 'spawn'. - Coverage : universal — fixes the fork-after-threads anti-pattern for all subprocess spawning, not just cleanlab. Future Stories that introduce other fork users get the fix for free.
- Risk : 1-2 sec startup overhead per joblib worker (re-import). Pickle-ability of every callback passed to workers (most are fine ; custom closures in lambda form will break). Need to audit
src/training/XGBoost/xgboost_hpo.pyandsrc/commun/finetune/ablation_runner.pyfor non-picklable callbacks.
Hypothesis 2 — GRPC_ENABLE_FORK_SUPPORT=1 (gRPC-native)¶
gRPC has a documented (opt-in) fork support mode that drains and re-initializes the channel on fork. Setting GRPC_ENABLE_FORK_SUPPORT=1 and GRPC_POLL_STRATEGY=poll env vars before importing gRPC tells the library to handle forks safely.
- Cost : 2 env vars in the pod spec (Helm chart). No code change.
- Coverage : every gRPC-using library (MLflow, OpenTelemetry, Tempo, Loki client) gets fork-safety automatically.
- Risk : performance hit on the gRPC channel (latency spike post-fork as the channel re-establishes). MLflow logging frequency is low (per-trial), so the latency is amortized. The
pollstrategy is slightly less efficient thanepollon heavy gRPC traffic, but our use is light. - Caveat : the env vars must be set before Python imports gRPC for the first time. In K8s that's straightforward via
env:; locally it requires a wrapper (orconftest.pyfor tests).
Hypothesis 3 — Disable MLflow autolog inside Optuna trials (most invasive)¶
The MLflow client gRPC threads are spawned by mlflow.autolog. Wrap each Optuna trial in mlflow.autolog(disable=True) for the trial duration ; log results explicitly post-trial.
- Cost : ~30 lines in
src/training/XGBoost/xgboost_hpo.pyto wrap theOptuna.study.optimize(objective, ...)body. - Coverage : narrow — only fixes the MLflow / gRPC / fork triangle. If another fork-unsafe library (e.g. OpenTelemetry exporter) ever lands in the trial body, the deadlock returns.
- Risk : we lose the per-trial autolog telemetry inside MLflow (acceptable — we already persist trial metrics in
finetune_results). Slightly invasive code change in HPO entry point.
Hypothesis 4 (NEW, considered post-diagnostic) — Force n_jobs=1 on the cleanlab call¶
cleanlab.filter.find_label_issues is the fork trigger and accepts an n_jobs parameter. With n_jobs=1 cleanlab takes the in-process branch (use_global_vars=True per filter.py:365) — no multiprocessing.Pool, no fork.
- Cost : 1-argument change in
src/training/labels/label_pipeline.py::suspect_mask— passn_jobs=1tofind_label_issues(...). - Coverage : narrow — only fixes the cleanlab path (which is what the diagnostic narrows the bug to). Other fork users would still deadlock.
- Risk : single-process execution loses the multi-core CV. For cleanlab's typical input size in our pipeline (~3-30k samples per fold) the call duration is ~1-2 sec at
n_jobs=cpu_count()and is expected to land in the 2-5 sec range atn_jobs=1(cleanlab itself notes that the defaultn_jobs=1is preferred for many input sizes — seefilter.py:262-266). Need to benchmark to confirm no order-of-magnitude regression.
4. Recommended approach¶
Adopt Hypothesis 4 (single-process cleanlab : pass n_jobs=1 to find_label_issues) as the primary fix, plus Hypothesis 2 (GRPC_ENABLE_FORK_SUPPORT=1 env vars) as a defence-in-depth backstop.
Implementation note (post plan_review) : initial drafting assumed cleanlab uses joblib (
loky) for parallelism. Inspection ofcleanlab.filtersource showed cleanlab uses nativemultiprocessing.Pool, not joblib — so ajoblib.parallel_backend('threading')wrapper would be a no-op. The right primitive isn_jobs=1(in-process, no Pool spawned at all). The intent (eliminate fork from cleanlab path) is unchanged, only the mechanism is corrected.
Rationale :
- Diagnostic narrowing matters : the focal_loss success proves the deadlock is cleanlab-specific. H4 targets the actual trigger surface (cleanlab's internal
multiprocessing.Pool) without touching the rest of the platform. - H4 is the smallest blast radius : a 1-argument change (
n_jobs=1) in the call site withinlabel_pipeline.suspect_mask, rolled back trivially if benchmarks show a slowdown. - H2 as backstop : even if H4 unblocks the immediate cleanlab path, the next fork-using library to land would re-create the same class of bug.
GRPC_ENABLE_FORK_SUPPORT=1is one env var that covers the entire gRPC surface — cheap insurance. - H1 is too invasive for too little extra coverage :
spawnsystem-wide breaks pickling assumptions across the whole codebase. The audit cost is high and we have no current evidence that any non-cleanlab fork user exists. - H3 is wrong-shaped : disabling MLflow autolog scoped to trials reduces observability without fixing the root cause (the next fork user reintroduces the bug).
5. Implementation path¶
5.1 Primary fix — enforce n_jobs=1 in cleanlab call¶
- In
src/training/labels/label_pipeline.py, locate thecleanlab.filter.find_label_issuescall inside thesuspect_maskHamilton node (around line 341). - Pass
n_jobs=1explicitly — cleanlab spawns nomultiprocessing.Pool, no fork. - Add a comment referencing this Story + the gRPC fork incident.
- Emit a structured event
event=cleanlab_find_issues_startwith fieldsn_jobs=1, backend=single_process, reason=cvn_n011_ea_s10_fork_safetyon first call (per ADR-32) so Loki can confirm the contract holds.
5.2 Hypothesis 2 — gRPC fork support via env vars¶
- Edit
infra/helm/airflow/values-prod.yaml— addGRPC_ENABLE_FORK_SUPPORT=1andGRPC_POLL_STRATEGY=pollto the worker pod env. - Mirror the env vars in
airflow_docker/docker-compose.yamlso local dev gets the same behaviour. - Update
documentation/architecture/FTF_SCALING.mdwith a "fork safety" section pointing to the gRPC official guide and explaining the rationale.
5.3 Reproducer test (per ADR-58 — every fix needs a regression bar)¶
Add tests/integration/test_grpc_fork_deadlock_regression.py :
- Fixture spins up an MLflow autolog session (start the gRPC threads explicitly).
- Calls
cleanlab.filter.find_label_issueson a small synthetic dataset (~300 samples) with the explicit contractn_jobs=1— the test asserts it completes within a strict timeout AND that the contract value is preserved (the test fails ifn_jobsis removed from the call site). - Asserts the call completes within a strict timeout (e.g. 30 sec).
- Pre-fix : the test deadlocks (CI fails on timeout). Post-fix : the call returns within 1-2 sec.
5.4 Production validation¶
After deploy :
- Re-trigger
finetune__ptewithfactor=cleanlab,crypto_group=defi_top5,power_mode=standard. - Operator monitors the 5 pods for ~5 min.
- Expected : ~125 rows in
finetune_resultsafter ~30 min, all 5 pods complete cleanly,kubectl logs ... | grep fork_posix.cc:75returns 0.
5.5 Runbook¶
Create documentation/runbooks/runbook_hpo_fork_deadlock.md (P1) per the MLOps readiness template — alert trigger : finetune_pod_idle_seconds > 300 AND finetune_rows_persisted_per_min == 0. The runbook references this Story for the root-cause analysis.
6. Acceptance criteria¶
- Plan dossier reviewed via committee
plan_review; verdictPASSEDor blockers addressed in v2 - Reproducer test added (
tests/integration/test_grpc_fork_deadlock_regression.py), pre-fix it fails on timeout, post-fix it passes - Hypothesis 4 implementation (scoped joblib threading backend) merged
- Hypothesis 2 implementation (
GRPC_ENABLE_FORK_SUPPORT=1+GRPC_POLL_STRATEGY=pollin Helm values + docker-compose mirror) merged - On-cluster validation : cleanlab FTF sweep on
defi_top5completes with ≥ 50 BUY trades / fold + 75-125 rows persisted infinetune_results, nofork_posix.cc:75warnings in pod logs - Runbook
runbook_hpo_fork_deadlock.md(P1) added per MLOps readiness §2 + indexed indocumentation/runbooks/index.md -
documentation/architecture/FTF_SCALING.mdupdated with the fork-safety section - OPERATIONS.md §17 incident log entry §17.4 (sympy regression — wp#92) and §17.5 (this gRPC fork deadlock — wp#91) added
7. Out of scope¶
- General
MULTIPROCESSING_START_METHOD=spawnmigration (deferred — reconsider only if a non-cleanlab fork user shows up later) - Post-build dockerized smoke test for new module-load-time deps (covered by
CVN-N011-EA-S11) - Changes to cleanlab itself (upstream — out of our control)
8. Falsifiability + rollback¶
- Falsifiability : if the post-fix cleanlab sweep still deadlocks on the cluster (5 pods stuck same way), the joblib
threadingbackend assumption is wrong and we revert immediately, escalate to H1 (system-wide spawn) under a follow-up Story. - Rollback : revert
label_pipeline.pychange (3 lines) and remove the env vars fromvalues-prod.yaml. Both are local, atomic, and reversible without data loss.
9. Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
threading backend slows cleanlab CV due to GIL contention |
Low | Medium | Most cleanlab time is in numpy / sklearn (GIL released). Benchmark in the integration test ; if > 2x slower than loky, revert to H1 |
GRPC_ENABLE_FORK_SUPPORT=1 introduces gRPC instability for MLflow / OTel |
Low | Medium | Documented in gRPC release notes as production-grade since gRPC 1.50 (we use 1.60+). Run staging soak test first |
| Reproducer test is flaky on CI (timing-dependent) | Medium | Low | Use a generous 30 sec timeout + retry logic ; the deadlock is deterministic when triggered, the pass is deterministic when fixed |
| Another library starts forking in the trial body before H1 lands | Low | High | H2 (gRPC fork support) is the catch-all backstop ; this risk is exactly why we keep it as defence-in-depth |
10. Cross-references¶
- Sister Story under same Epic, same victim sweep :
CVN-N011-EA-S11— post-mortem of the missing-dep regression that hid this bug for an extra ~2h - Hotfix that landed concurrently : PR #775 (sympy in requirements)
- Closed sister Story (cleanlab cap) :
CVN-N011-EA-S08— proves the per-class cap is working and not the cause of this deadlock - Code sites :
- Cleanlab call (the trigger) :
src/training/labels/label_pipeline.py::suspect_mask(Hamilton node that callscleanlab.filter.find_label_issues, around line 341) - HPO entry :
src/training/XGBoost/cvntrade_XGBoost_hyperoptimizer.py(Optuna study + trial spawning) - MLflow autolog :
src/commun/mlflow/client.py - FTF runner :
src/commun/finetune/ablation_runner.py:531(where the runtime spawns the per-variant training) - Architecture docs to update :
documentation/architecture/FTF_SCALING.md(fork safety section) - External references :
- gRPC fork support guide — H2 reference
- joblib parallel backends — H4 reference
- Python multiprocessing start methods — H1 reference
11. Committee recommendations triage (post-PASSED)¶
| # | Recommendation | Disposition + where applied |
|---|---|---|
| 1 | Enhanced liveness probe (hpo_heartbeat detection) — Architect | Defer to a separate observability Story under CVN-N010 — ADR-26/62 territory, this Story stays scoped on the deadlock fix |
| 2 | Performance benchmark for H4 with explicit thresholds in CI — Architect | Apply in §5.3 — extend the reproducer test to record loky vs threading wall time on the same dataset and assert threading <= 2.0 × loky |
| 3 | Staging soak test for H2 (GRPC_ENABLE_FORK_SUPPORT=1) — Ops |
Apply — staging deploy with --set cvntradeImageTag=<SHA> first, run a 50-trial focal_loss sweep, verify no MLflow / OTel regressions, then promote to prod |
| 4 | Audit cleanlab callbacks for thread-safety / pickling — Data Scientist | Apply — call site inspection in PR description ; cleanlab's find_label_issues only takes immutable arrays + a classifier instance, no closures, low risk but document |
| 5 | Kill switch + formalize staged rollout — Ops | Apply (partial) — add a section in the new runbook §4.1 with the kubectl delete pod -l job=finetune-pte command + the Airflow UI "Mark Success" path. ADR-71 trading kill-switch is unrelated to this batch workload |
| 6 | Enhanced gRPC observability post-H2 — Architect | Defer to CVN-N010-EA — same scope as #1 ; gRPC client metrics are not currently exposed in our OTel pipeline |
| 7 | Runbook ownership (DRI assignment) — Ops | Apply — runbook frontmatter Owner: dococeven, escalation path documented in §5 of the runbook |
| 8 | Structured log event for joblib backend — ML Engineer | Apply — already in §5.1 as event=cleanlab_threading_backend_active, add field backend=threading per ADR-32 / ADR-38 |
| 9 | Empirical GIL validation via py-spy / cProfile — ML Engineer | Apply — add a make profile-grpc-fix RUN_ID=... target that runs the reproducer with py-spy record and attaches the SVG flamegraph to the PR description |
| 10 | Stress scenarios in reproducer (high n_jobs, memory pressure) — Data Scientist |
Apply — extend §5.3 with a n_jobs=8 + n_samples=2000 parametrized variant |
| 11 | Document threading limitations in docs + code — ML Engineer | Apply — already in §5.2 (FTF_SCALING.md update) ; add a code comment at the joblib backend wrapper explaining the GIL-released-numpy assumption |
Net effect on §5 implementation path : 8 recos applied (mostly already in plan, made explicit) + 2 deferred to existing observability Need + 1 partial. No blockers, no scope expansion that would warrant a v2.
Question for the committee¶
Validate Hypothesis 4 (scoped joblib
threadingbackend on the cleanlab CV call) + Hypothesis 2 (GRPC_ENABLE_FORK_SUPPORT=1env var) as the fix. The diagnostic from the focal_loss sweep proves the deadlock is cleanlab-specific (not a general HPO+MLflow+Optuna issue), so a narrow targeted fix + a defence-in-depth env var is preferable to a system-widespawnmigration. Are there hidden assumptions about the joblibthreadingbackend (GIL, callback pickling, OS-level resource sharing) that would make this fix unsafe for a 5-pod × 50-trial production sweep ?