FTF Scaling Recommendation — Parallel Execution Strategy¶

Date: 2026-04-14 Issue: #499 (Tuning Protocol) Objective: Reduce FTF factor execution from ~13h/factor to <2-3h/factor

1. Current State¶

Infrastructure¶

Component	Spec	Cost
Permanent pool	1× PRO2-S (8 vCPU, 32GB RAM)	€60/month
Compute pool	0-4× PRO2-M (16 vCPU, 64GB each), autoscaler ON	€0 (idle)

The compute pool already exists but has never been activated (size=0, autoscale min=0, max=4).

Resource Budget on PRO2-S¶

	CPU	RAM
Node allocatable	7.8 cores	29Gi
Platform pods (Airflow, MLflow, API, etc.)	3.05 cores req	8.2Gi
Available for FTF	4.75 cores	21Gi
Current FTF pod	2 cores req (bursts to 4)	4Gi req (uses 3Gi)

Performance Baseline (observed)¶

Metric	Value
Time per run (HPO 30 trials + backtest)	~3.5 min
Calibration factor (225 runs, 1 pod)	~13h
Phase 1a single factor (300 runs, 1 pod)	~17h
Full Phase 0 + 1a (1425 runs, sequential)	~83h
CPU utilization during FTF	100% of 4 cores (saturated)
RAM utilization during FTF	3Gi / 8Gi (37%)

Bottleneck: CPU is saturated. RAM is abundant. Only 1 pod runs at a time.

2. Workload Profile¶

Per Factor¶

Factor	Variants	Cryptos	Folds	Costs	Total Runs
calibration	3	5	5	3	225
timeframe	4	5	5	3	300
fold_size	4	5	5	3	300
n_features	4	5	5	3	300
classification_mode	3	5	5	3	225

Parallelization Opportunity¶

Within each factor, runs for different cryptos are fully independent (no shared state, no data dependency). Each crypto can run in its own pod.

With 5 cryptos → 5 parallel pods = theoretical 5× speedup.

Iteration Cadence¶

The tuning protocol has 10 phases with multiple factors each. Each committee review may require re-runs with adjusted parameters. Expected iteration count: 20-40 factor runs over the next weeks.

3. Scaling Options¶

Option A — Stay on PRO2-S (no cost change)¶

Fan-out cryptos into 2 chunks (3+2) running in 2 parallel pods.

Metric	Current	Option A
Pods per factor	1	2
Time per factor	~13-17h	~7-9h
Full Phase 0+1a	~83h	~45h
Monthly cost delta	—	€0

Constraint: Only 2 pods fit (2×2 CPU req = 4 cores + 3 platform = 7 < 7.8 allocatable).

Option B — Activate compute pool (PRO2-M autoscale)¶

Route FTF worker pods to the existing compute pool via nodeSelector. The autoscaler provisions a PRO2-M node on demand and terminates it when idle (~10 min cooldown).

5 pods × 1 crypto each, all on the PRO2-M node (16 CPU, 64GB).

Metric	Current	Option B
Pods per factor	1	5
CPU per pod	4 cores (limit)	3 cores each (15 total)
Time per factor	~13-17h	~2.5-3.5h
Full Phase 0+1a	~83h	~17h
Monthly cost (idle)	€60	€60 (compute pool = €0 when idle)
Cost per factor run	€0	~€0.50 (3h × €0.1644/h)
Cost full Phase 0+1a	€0	~€3

Key advantage: Pay only for compute hours. Node auto-terminates after runs complete.

Option C — PRO2-M permanent + compute burst¶

Replace PRO2-S permanent with PRO2-M permanent. Use compute pool for additional burst capacity.

Metric	Current	Option C
Permanent node	PRO2-S (8 CPU)	PRO2-M (16 CPU)
Pods per factor (no burst)	1	5
Time per factor	~13-17h	~2.5-3.5h
Monthly cost	€60	€120

4. Recommendation¶

Option B — Activate compute pool (best cost/performance ratio)

Why¶

Compute pool already exists — no infrastructure changes needed
Pay-per-use — PRO2-M costs €0.1644/h, only billed during FTF runs
5× parallelism — 5 pods (1/crypto) on 16 CPU node
Auto scale-down — node terminates ~10 min after last pod finishes
No impact on platform — FTF pods run on separate node, platform stays on PRO2-S
Future-proof — compute pool max=4 nodes, can burst to 64 CPU if needed

Implementation (3 changes) — DONE (PR #525, merged)¶

Helm (values-prod.yaml): nodeSelector: k8s.scaleway.com/pool-name: compute ✓ deployed
DAG (dag_finetune__pte.py): Fan-out per crypto (5 parallel pods) ✓ deployed
No Scaleway changes — compute pool was already configured ✓

Expected Performance¶

Workload	Current	After
1 factor (calibration, 225 runs)	~13h	~2.5h
1 factor (4 variants, 300 runs)	~17h	~3.5h
Phase 0 + 1a (1425 runs)	~83h	~17h
Cost per full Phase 0+1a	€0	~€3

Risk Assessment¶

Risk	Mitigation
Compute node fails to provision	Autoscaler retry + Airflow task retry (retries=1)
5 pods overload DB writes	Persistence has 3× retry + DLQ — tested at higher volumes
Node stays up (cost leak)	Autoscaler scale-to-zero after 10 min idle — already configured
Platform affected	FTF pods isolated on compute node via nodeSelector

5. Cost Projection¶

Scenario	Monthly Platform	FTF Compute	Total
Current (no change)	€60	€0	€60
Option B (recommended)	€60	~€5-15	~€65-75
Option C (PRO2-M permanent)	€120	€0	€120

FTF compute cost assumes 2-4 factor runs per week during active tuning phase.

6. Implementation Checklist¶

Add nodeSelector to worker config in Helm values
Refactor DAG: fan-out per crypto (5 pods) instead of per factor (1 pod)
Verify compute pool autoscaler triggers on pending pods
Test with 1 factor run → confirm 5 pods spawn on PRO2-M
Validate auto scale-down after completion
Update OPERATIONS.md with new execution model

7. Fork safety (CVN-N011-EA-S10, 2026-04-29)¶

FTF sweep pods are particularly exposed to the fork-after-threads anti-pattern : every HPO trial wakes up MLflow autolog (which spawns gRPC client threads in the parent process), and the trial body may then call libraries that internally fork() (cleanlab's multiprocessing.Pool, joblib loky workers, sklearn parallel CV, etc.). The forked child inherits a half-locked gRPC state and hangs on first gRPC call — the pod stays Running forever, consuming RAM, persisting zero rows.

Two-layer defence in place (see runbook runbook_hpo_fork_deadlock.md) :

Targeted contract — cleanlab.filter.find_label_issues is invoked with n_jobs=1 (in-process, no fork). Enforced by the regression test tests/integration/test_grpc_fork_deadlock_regression.py.
Defence in depth — GRPC_ENABLE_FORK_SUPPORT=1 + GRPC_POLL_STRATEGY=poll env vars in infra/helm/airflow/values-prod.yaml (mirrored in airflow_docker/docker-compose.yaml for local parity). gRPC drains and re-initializes its channels on fork — covers any future library that adds fork-based parallelism.

Rule of thumb when adding new factors / sweep code : if your code path inside an HPO trial calls anything that spawns processes (joblib, multiprocessing, sklearn n_jobs > 1, ray, dask), either set n_jobs=1 explicitly OR add a regression test under tests/integration/test_grpc_fork_deadlock_regression.py that asserts the call completes within a strict timeout under simulated MLflow autolog pressure.

GIL caveat : n_jobs=1 is safe for cleanlab because it spends most time in numpy / sklearn C extensions (GIL released). If you wrap a different code path in with joblib.parallel_backend('threading'), validate the GIL-released assumption with py-spy record + flamegraph (see plan dossier 2026-04-29-grpc-fork-deadlock-plan.md §11 reco #9).