FTF Scaling Recommendation — Parallel Execution Strategy¶
Date: 2026-04-14 Issue: #499 (Tuning Protocol) Objective: Reduce FTF factor execution from ~13h/factor to <2-3h/factor
1. Current State¶
Infrastructure¶
| Component | Spec | Cost |
|---|---|---|
| Permanent pool | 1× PRO2-S (8 vCPU, 32GB RAM) | €60/month |
| Compute pool | 0-4× PRO2-M (16 vCPU, 64GB each), autoscaler ON | €0 (idle) |
The compute pool already exists but has never been activated (size=0, autoscale min=0, max=4).
Resource Budget on PRO2-S¶
| CPU | RAM | |
|---|---|---|
| Node allocatable | 7.8 cores | 29Gi |
| Platform pods (Airflow, MLflow, API, etc.) | 3.05 cores req | 8.2Gi |
| Available for FTF | 4.75 cores | 21Gi |
| Current FTF pod | 2 cores req (bursts to 4) | 4Gi req (uses 3Gi) |
Performance Baseline (observed)¶
| Metric | Value |
|---|---|
| Time per run (HPO 30 trials + backtest) | ~3.5 min |
| Calibration factor (225 runs, 1 pod) | ~13h |
| Phase 1a single factor (300 runs, 1 pod) | ~17h |
| Full Phase 0 + 1a (1425 runs, sequential) | ~83h |
| CPU utilization during FTF | 100% of 4 cores (saturated) |
| RAM utilization during FTF | 3Gi / 8Gi (37%) |
Bottleneck: CPU is saturated. RAM is abundant. Only 1 pod runs at a time.
2. Workload Profile¶
Per Factor¶
| Factor | Variants | Cryptos | Folds | Costs | Total Runs |
|---|---|---|---|---|---|
| calibration | 3 | 5 | 5 | 3 | 225 |
| timeframe | 4 | 5 | 5 | 3 | 300 |
| fold_size | 4 | 5 | 5 | 3 | 300 |
| n_features | 4 | 5 | 5 | 3 | 300 |
| classification_mode | 3 | 5 | 5 | 3 | 225 |
Parallelization Opportunity¶
Within each factor, runs for different cryptos are fully independent (no shared state, no data dependency). Each crypto can run in its own pod.
With 5 cryptos → 5 parallel pods = theoretical 5× speedup.
Iteration Cadence¶
The tuning protocol has 10 phases with multiple factors each. Each committee review may require re-runs with adjusted parameters. Expected iteration count: 20-40 factor runs over the next weeks.
3. Scaling Options¶
Option A — Stay on PRO2-S (no cost change)¶
Fan-out cryptos into 2 chunks (3+2) running in 2 parallel pods.
| Metric | Current | Option A |
|---|---|---|
| Pods per factor | 1 | 2 |
| Time per factor | ~13-17h | ~7-9h |
| Full Phase 0+1a | ~83h | ~45h |
| Monthly cost delta | — | €0 |
Constraint: Only 2 pods fit (2×2 CPU req = 4 cores + 3 platform = 7 < 7.8 allocatable).
Option B — Activate compute pool (PRO2-M autoscale)¶
Route FTF worker pods to the existing compute pool via nodeSelector. The autoscaler provisions a PRO2-M node on demand and terminates it when idle (~10 min cooldown).
5 pods × 1 crypto each, all on the PRO2-M node (16 CPU, 64GB).
| Metric | Current | Option B |
|---|---|---|
| Pods per factor | 1 | 5 |
| CPU per pod | 4 cores (limit) | 3 cores each (15 total) |
| Time per factor | ~13-17h | ~2.5-3.5h |
| Full Phase 0+1a | ~83h | ~17h |
| Monthly cost (idle) | €60 | €60 (compute pool = €0 when idle) |
| Cost per factor run | €0 | ~€0.50 (3h × €0.1644/h) |
| Cost full Phase 0+1a | €0 | ~€3 |
Key advantage: Pay only for compute hours. Node auto-terminates after runs complete.
Option C — PRO2-M permanent + compute burst¶
Replace PRO2-S permanent with PRO2-M permanent. Use compute pool for additional burst capacity.
| Metric | Current | Option C |
|---|---|---|
| Permanent node | PRO2-S (8 CPU) | PRO2-M (16 CPU) |
| Pods per factor (no burst) | 1 | 5 |
| Time per factor | ~13-17h | ~2.5-3.5h |
| Monthly cost | €60 | €120 |
4. Recommendation¶
Option B — Activate compute pool (best cost/performance ratio)
Why¶
- Compute pool already exists — no infrastructure changes needed
- Pay-per-use — PRO2-M costs €0.1644/h, only billed during FTF runs
- 5× parallelism — 5 pods (1/crypto) on 16 CPU node
- Auto scale-down — node terminates ~10 min after last pod finishes
- No impact on platform — FTF pods run on separate node, platform stays on PRO2-S
- Future-proof — compute pool max=4 nodes, can burst to 64 CPU if needed
Implementation (3 changes) — DONE (PR #525, merged)¶
- Helm (
values-prod.yaml):nodeSelector: k8s.scaleway.com/pool-name: compute✓ deployed - DAG (
dag_finetune__pte.py): Fan-out per crypto (5 parallel pods) ✓ deployed - No Scaleway changes — compute pool was already configured ✓
Expected Performance¶
| Workload | Current | After |
|---|---|---|
| 1 factor (calibration, 225 runs) | ~13h | ~2.5h |
| 1 factor (4 variants, 300 runs) | ~17h | ~3.5h |
| Phase 0 + 1a (1425 runs) | ~83h | ~17h |
| Cost per full Phase 0+1a | €0 | ~€3 |
Risk Assessment¶
| Risk | Mitigation |
|---|---|
| Compute node fails to provision | Autoscaler retry + Airflow task retry (retries=1) |
| 5 pods overload DB writes | Persistence has 3× retry + DLQ — tested at higher volumes |
| Node stays up (cost leak) | Autoscaler scale-to-zero after 10 min idle — already configured |
| Platform affected | FTF pods isolated on compute node via nodeSelector |
5. Cost Projection¶
| Scenario | Monthly Platform | FTF Compute | Total |
|---|---|---|---|
| Current (no change) | €60 | €0 | €60 |
| Option B (recommended) | €60 | ~€5-15 | ~€65-75 |
| Option C (PRO2-M permanent) | €120 | €0 | €120 |
FTF compute cost assumes 2-4 factor runs per week during active tuning phase.
6. Implementation Checklist¶
- Add
nodeSelectorto worker config in Helm values - Refactor DAG: fan-out per crypto (5 pods) instead of per factor (1 pod)
- Verify compute pool autoscaler triggers on pending pods
- Test with 1 factor run → confirm 5 pods spawn on PRO2-M
- Validate auto scale-down after completion
- Update OPERATIONS.md with new execution model
7. Fork safety (CVN-N011-EA-S10, 2026-04-29)¶
FTF sweep pods are particularly exposed to the fork-after-threads anti-pattern : every HPO trial wakes up MLflow autolog (which spawns gRPC client threads in the parent process), and the trial body may then call libraries that internally fork() (cleanlab's multiprocessing.Pool, joblib loky workers, sklearn parallel CV, etc.). The forked child inherits a half-locked gRPC state and hangs on first gRPC call — the pod stays Running forever, consuming RAM, persisting zero rows.
Two-layer defence in place (see runbook runbook_hpo_fork_deadlock.md) :
- Targeted contract —
cleanlab.filter.find_label_issuesis invoked withn_jobs=1(in-process, no fork). Enforced by the regression testtests/integration/test_grpc_fork_deadlock_regression.py. - Defence in depth —
GRPC_ENABLE_FORK_SUPPORT=1+GRPC_POLL_STRATEGY=pollenv vars ininfra/helm/airflow/values-prod.yaml(mirrored inairflow_docker/docker-compose.yamlfor local parity). gRPC drains and re-initializes its channels on fork — covers any future library that adds fork-based parallelism.
Rule of thumb when adding new factors / sweep code : if your code path inside an HPO trial calls anything that spawns processes (joblib, multiprocessing, sklearn n_jobs > 1, ray, dask), either set n_jobs=1 explicitly OR add a regression test under tests/integration/test_grpc_fork_deadlock_regression.py that asserts the call completes within a strict timeout under simulated MLflow autolog pressure.
GIL caveat : n_jobs=1 is safe for cleanlab because it spends most time in numpy / sklearn C extensions (GIL released). If you wrap a different code path in with joblib.parallel_backend('threading'), validate the GIL-released assumption with py-spy record + flamegraph (see plan dossier 2026-04-29-grpc-fork-deadlock-plan.md §11 reco #9).