Skip to content

FTF Scaling Recommendation — Parallel Execution Strategy

Date: 2026-04-14 Issue: #499 (Tuning Protocol) Objective: Reduce FTF factor execution from ~13h/factor to <2-3h/factor


1. Current State

Infrastructure

Component Spec Cost
Permanent pool 1× PRO2-S (8 vCPU, 32GB RAM) €60/month
Compute pool 0-4× PRO2-M (16 vCPU, 64GB each), autoscaler ON €0 (idle)

The compute pool already exists but has never been activated (size=0, autoscale min=0, max=4).

Resource Budget on PRO2-S

CPU RAM
Node allocatable 7.8 cores 29Gi
Platform pods (Airflow, MLflow, API, etc.) 3.05 cores req 8.2Gi
Available for FTF 4.75 cores 21Gi
Current FTF pod 2 cores req (bursts to 4) 4Gi req (uses 3Gi)

Performance Baseline (observed)

Metric Value
Time per run (HPO 30 trials + backtest) ~3.5 min
Calibration factor (225 runs, 1 pod) ~13h
Phase 1a single factor (300 runs, 1 pod) ~17h
Full Phase 0 + 1a (1425 runs, sequential) ~83h
CPU utilization during FTF 100% of 4 cores (saturated)
RAM utilization during FTF 3Gi / 8Gi (37%)

Bottleneck: CPU is saturated. RAM is abundant. Only 1 pod runs at a time.


2. Workload Profile

Per Factor

Factor Variants Cryptos Folds Costs Total Runs
calibration 3 5 5 3 225
timeframe 4 5 5 3 300
fold_size 4 5 5 3 300
n_features 4 5 5 3 300
classification_mode 3 5 5 3 225

Parallelization Opportunity

Within each factor, runs for different cryptos are fully independent (no shared state, no data dependency). Each crypto can run in its own pod.

With 5 cryptos → 5 parallel pods = theoretical 5× speedup.

Iteration Cadence

The tuning protocol has 10 phases with multiple factors each. Each committee review may require re-runs with adjusted parameters. Expected iteration count: 20-40 factor runs over the next weeks.


3. Scaling Options

Option A — Stay on PRO2-S (no cost change)

Fan-out cryptos into 2 chunks (3+2) running in 2 parallel pods.

Metric Current Option A
Pods per factor 1 2
Time per factor ~13-17h ~7-9h
Full Phase 0+1a ~83h ~45h
Monthly cost delta €0

Constraint: Only 2 pods fit (2×2 CPU req = 4 cores + 3 platform = 7 < 7.8 allocatable).

Option B — Activate compute pool (PRO2-M autoscale)

Route FTF worker pods to the existing compute pool via nodeSelector. The autoscaler provisions a PRO2-M node on demand and terminates it when idle (~10 min cooldown).

5 pods × 1 crypto each, all on the PRO2-M node (16 CPU, 64GB).

Metric Current Option B
Pods per factor 1 5
CPU per pod 4 cores (limit) 3 cores each (15 total)
Time per factor ~13-17h ~2.5-3.5h
Full Phase 0+1a ~83h ~17h
Monthly cost (idle) €60 €60 (compute pool = €0 when idle)
Cost per factor run €0 ~€0.50 (3h × €0.1644/h)
Cost full Phase 0+1a €0 ~€3

Key advantage: Pay only for compute hours. Node auto-terminates after runs complete.

Option C — PRO2-M permanent + compute burst

Replace PRO2-S permanent with PRO2-M permanent. Use compute pool for additional burst capacity.

Metric Current Option C
Permanent node PRO2-S (8 CPU) PRO2-M (16 CPU)
Pods per factor (no burst) 1 5
Time per factor ~13-17h ~2.5-3.5h
Monthly cost €60 €120

4. Recommendation

Option B — Activate compute pool (best cost/performance ratio)

Why

  1. Compute pool already exists — no infrastructure changes needed
  2. Pay-per-use — PRO2-M costs €0.1644/h, only billed during FTF runs
  3. 5× parallelism — 5 pods (1/crypto) on 16 CPU node
  4. Auto scale-down — node terminates ~10 min after last pod finishes
  5. No impact on platform — FTF pods run on separate node, platform stays on PRO2-S
  6. Future-proof — compute pool max=4 nodes, can burst to 64 CPU if needed

Implementation (3 changes) — DONE (PR #525, merged)

  1. Helm (values-prod.yaml): nodeSelector: k8s.scaleway.com/pool-name: compute ✓ deployed
  2. DAG (dag_finetune__pte.py): Fan-out per crypto (5 parallel pods) ✓ deployed
  3. No Scaleway changes — compute pool was already configured ✓

Expected Performance

Workload Current After
1 factor (calibration, 225 runs) ~13h ~2.5h
1 factor (4 variants, 300 runs) ~17h ~3.5h
Phase 0 + 1a (1425 runs) ~83h ~17h
Cost per full Phase 0+1a €0 ~€3

Risk Assessment

Risk Mitigation
Compute node fails to provision Autoscaler retry + Airflow task retry (retries=1)
5 pods overload DB writes Persistence has 3× retry + DLQ — tested at higher volumes
Node stays up (cost leak) Autoscaler scale-to-zero after 10 min idle — already configured
Platform affected FTF pods isolated on compute node via nodeSelector

5. Cost Projection

Scenario Monthly Platform FTF Compute Total
Current (no change) €60 €0 €60
Option B (recommended) €60 ~€5-15 ~€65-75
Option C (PRO2-M permanent) €120 €0 €120

FTF compute cost assumes 2-4 factor runs per week during active tuning phase.


6. Implementation Checklist

  • Add nodeSelector to worker config in Helm values
  • Refactor DAG: fan-out per crypto (5 pods) instead of per factor (1 pod)
  • Verify compute pool autoscaler triggers on pending pods
  • Test with 1 factor run → confirm 5 pods spawn on PRO2-M
  • Validate auto scale-down after completion
  • Update OPERATIONS.md with new execution model

7. Fork safety (CVN-N011-EA-S10, 2026-04-29)

FTF sweep pods are particularly exposed to the fork-after-threads anti-pattern : every HPO trial wakes up MLflow autolog (which spawns gRPC client threads in the parent process), and the trial body may then call libraries that internally fork() (cleanlab's multiprocessing.Pool, joblib loky workers, sklearn parallel CV, etc.). The forked child inherits a half-locked gRPC state and hangs on first gRPC call — the pod stays Running forever, consuming RAM, persisting zero rows.

Two-layer defence in place (see runbook runbook_hpo_fork_deadlock.md) :

  1. Targeted contractcleanlab.filter.find_label_issues is invoked with n_jobs=1 (in-process, no fork). Enforced by the regression test tests/integration/test_grpc_fork_deadlock_regression.py.
  2. Defence in depthGRPC_ENABLE_FORK_SUPPORT=1 + GRPC_POLL_STRATEGY=poll env vars in infra/helm/airflow/values-prod.yaml (mirrored in airflow_docker/docker-compose.yaml for local parity). gRPC drains and re-initializes its channels on fork — covers any future library that adds fork-based parallelism.

Rule of thumb when adding new factors / sweep code : if your code path inside an HPO trial calls anything that spawns processes (joblib, multiprocessing, sklearn n_jobs > 1, ray, dask), either set n_jobs=1 explicitly OR add a regression test under tests/integration/test_grpc_fork_deadlock_regression.py that asserts the call completes within a strict timeout under simulated MLflow autolog pressure.

GIL caveat : n_jobs=1 is safe for cleanlab because it spends most time in numpy / sklearn C extensions (GIL released). If you wrap a different code path in with joblib.parallel_backend('threading'), validate the GIL-released assumption with py-spy record + flamegraph (see plan dossier 2026-04-29-grpc-fork-deadlock-plan.md §11 reco #9).