Skip to content

Runbook — HPO / FTF sweep pods stuck (gRPC fork deadlock)

Alert : hpo_pod_stuckfinetune_pod_idle_seconds > 300 AND finetune_rows_persisted_per_min == 0

Severity : P1 (page on-call DRI immediately)

Impact : an FTF sweep that should run for ~30 min is silently stuck. Compute is consumed, no rows land in finetune_results, the run will be marked SUCCESS by Airflow if the heartbeat task is still alive — but the dataset is empty, every downstream gate decision (lock / keep available / abandon) is impossible. Cluster autoscaler keeps the pods alive at full RAM until manually killed.

Owner : dococeven (DRI Pipeline Contract Hardening)

Affected dashboards : - MLOps Overview panel "Active finetune pods" - Cluster Health panel "Pod RAM by namespace" (look for high steady RAM + no log progress)

Root-cause Story : CVN-N011-EA-S10


Immediate human actions (0–5 min)

1. Confirm the deadlock pattern

# List all active finetune pods
kubectl get pods -n cvntrade | grep finetune

# Pick one, check the last log timestamp (compare to current UTC time)
POD=finetune-pte-run-factor-crypto-standard-XXXXXX
kubectl logs -n cvntrade $POD --tail=2 --timestamps

If the last log timestamp is > 5 minutes old AND the pod status is Running AND no progress events have fired, this is the deadlock pattern.

2. Confirm the fork warning is the trigger

# Search for the gRPC fork warning across ALL active finetune pods
for p in $(kubectl get pods -n cvntrade -o name | grep finetune); do
  echo "=== $p ==="
  kubectl logs -n cvntrade $p 2>&1 | grep -c "fork_posix.cc:75\|skipping fork() handlers"
done

If at least one pod shows ≥ 1 hit on fork_posix.cc:75, this is the same root cause as CVN-N011-EA-S10.

3. PRIMARY ACTION — kill the stuck pods (< 1 min revert SLA)

# Kill all stuck finetune pods (Airflow will mark the DAG run as failed).
# Airflow's KubernetesExecutor applies a `dag_id` label automatically — that's
# the canonical selector for fan-out pods spawned by `dag_finetune__pte`.
# Note : `run_id` is NOT a Kubernetes label on Airflow worker pods (it lives
# in the metadata DB + structured logs only). Filter by run by inspecting
# pod creation timestamps (`kubectl get pods --sort-by=.metadata.creationTimestamp`)
# or by querying the run_id in `finetune_runs` PG table if needed.
kubectl delete pod -n cvntrade -l dag_id=finetune__pte --force

# Targeted by-name fallback (use the names from step 1 if you want to kill
# only one specific pod — e.g. you've identified the smoking-gun pod and
# want to keep the others alive for diagnostic) :
kubectl delete pod -n cvntrade finetune-pte-run-factor-crypto-standard-XXXXXX --force

The pods will not auto-restart (Airflow KubernetesExecutor doesn't retry on operator-killed pods unless the DAG sets retries > 0).

4. Mark the DAG run as failed in Airflow UI

Navigate to airflow.cvntrade.eu → DAG finetune__pte → latest run → Mark as Failed. This unblocks the max_active_runs=1 constraint (per ADR-22).


Diagnosis (5–30 min)

Why this happens

cleanlab.filter.find_label_issues defaults to n_jobs=cpu_count() and spawns multiprocessing.Pool with the OS default start method (fork on Linux + Python<3.14). When the parent process has live MLflow autolog gRPC threads (every HPO trial), the forked child inherits a half-locked gRPC state and hangs on first gRPC call.

Once stuck, the pod reports Running to K8s (process alive, just hung), passes the liveness probe (no probe defined for this workload), and consumes RAM until manually killed. There is no automatic recovery.

Verify the fix is in place

Important : label_pipeline.py does NOT run in airflow-scheduler — it runs in the worker pods spawned dynamically by the KubernetesExecutor. Verify the fix on a recent / active worker pod (use $POD from step 1) :

# Pick any recent finetune worker pod (alive or in the last 1h)
POD=$(kubectl get pods -n cvntrade -l dag_id=finetune__pte \
  --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}')
echo "Verifying on worker pod: $POD"

# Confirm n_jobs=1 is in the deployed image's source
kubectl exec -n cvntrade -it $POD -- grep -n "n_jobs=1" /opt/airflow/src/training/labels/label_pipeline.py

# Confirm GRPC_ENABLE_FORK_SUPPORT is set in the worker pod env
kubectl exec -n cvntrade -it $POD -- printenv GRPC_ENABLE_FORK_SUPPORT GRPC_POLL_STRATEGY

# Cross-check : the image tag deployed on workers (must match the post-S10 SHA)
kubectl get pod -n cvntrade $POD -o jsonpath='{.spec.containers[0].image}'

Expected : - n_jobs=1, line found in suspect_mask - GRPC_ENABLE_FORK_SUPPORT=1 and GRPC_POLL_STRATEGY=poll - Image tag = the SHA at which S10 was merged or later (check git log on main)

If no recent worker pod exists (cluster idle), trigger a quick dag_finetune__pte run on a single crypto, let it spawn a pod, then exec into that pod for the verification.

If either is missing, the deployed image is older than the S10 fix — escalate to operator + verify the deploy job ran on the latest main SHA.


Remediation paths

Path A — full rollback to prior factor (if S10 fix is broken)

  1. Console UI → https://console.cvntrade.eu/config → flip the active factor away from cleanlab
  2. Re-trigger dag_finetune__pte with factor=focal_loss (or label_smoothing, both validated to NOT trigger this deadlock)
  3. File a P1 issue against CVN-N011-EA-S10 referencing this incident

Path B — partial mitigation : disable cleanlab branch, run other factors

If the operator wants to continue the FTF program while S10 is investigated :

-- psql on prod cvntrade-pg, ftf_config table
UPDATE ftf_config SET base_env = jsonb_set(base_env, '{CVN_CLEANLAB_MODE}', '"off"', true) WHERE id = 1;

This pins cleanlab off in the base config. Variants of OTHER factors continue working.

Path C — investigate without rollback (read-only)

Take a thread dump from one stuck pod for the postmortem without mutating the pod runtime. The container in incident MUST stay untouched (no pip install, no extra processes spawned) so the post-mortem reflects the actual production state — installing tools mid-incident has historically masked root causes by altering memory layout / thread state.

Option C.1 — kubectl debug ephemeral container (recommended ; requires Kubernetes ≥ 1.25 with the EphemeralContainers feature gate enabled — verify with kubectl version --short | grep Server) :

# Attach an ephemeral debug container with py-spy already baked in to the
# stuck pod's PID namespace. The original container is untouched.
# Image tag is pinned (not `:latest`) so a future upstream update can't
# break this incident procedure unannounced — bump the tag deliberately
# here when you want a newer py-spy.
kubectl debug -n cvntrade $POD -it --image=ghcr.io/benfred/py-spy:0.4.2 --target=base -- py-spy dump --pid 1

Option C.2 — py-spy already present check (fallback if kubectl debug is unavailable) :

# Step 1 — read-only check that py-spy is present in the image. If absent,
# the operator gets an explicit error (not a silent skip) and switches to
# Option C.1 above.
kubectl exec -n cvntrade -it $POD -- command -v py-spy \
  || { echo "ERROR: py-spy not found in the worker image — use Option C.1 (kubectl debug) instead"; exit 1; }

# Step 2 — only run the dump if step 1 succeeded.
kubectl exec -n cvntrade -it $POD -- py-spy dump --pid 1

Capture the output. The expected stack on a stuck pod : a Python frame waiting on a gRPC channel future inside mlflow.tracking._tracking_service.


Escalation criteria

Trigger Action
Same alert fires within 24h of S10 fix deploy Page operator immediately ; the fix is incomplete or regressed
Fork warning found in non-cleanlab factor (e.g. focal_loss, label_smoothing) New victim — open a sister Story under CVN-N011-EA. The defence-in-depth GRPC_ENABLE_FORK_SUPPORT=1 env var should have caught this — investigate why it didn't
Pod RAM > 8 GiB and growing Approaching cluster autoscaler limit ; kill pods immediately to free compute for other workloads

Post-incident checklist

  • OPERATIONS.md §17 incident log entry added (incident_id = 2026-MM-DD-grpc-fork-N)
  • If a new victim was identified : open a follow-up Story under CVN-N011-EA
  • If the S10 fix itself broke : revert the merge commit, page operator, file P0 issue
  • finetune_results rows from the failed run are tagged run_status='failed' (cleanup task) so they don't pollute aggregations
  • Slack post-mortem to #cvntrade-ops with the timeline (alert → kill → root cause → remediation)
  • If escalation criterion 2 fires : add the new victim to the reproducer test under tests/integration/test_grpc_fork_deadlock_regression.py

  • Plan dossier : 2026-04-29-grpc-fork-deadlock-plan.md — full diagnostic + decision rationale
  • Sister Story (post-mortem of the cluster of bugs) : CVN-N011-EA-S11 — process gap analysis
  • Architecture : FTF_SCALING.md — fork safety section
  • Reproducer test : tests/integration/test_grpc_fork_deadlock_regression.py — CI gate that catches a regression of this fix