Scaleway Compute Sizing — Definitive Architecture¶
Version: 3.0 Date: 2026-04-15 Issue: #553 Status: V3 after 2nd committee review (V1: 5.4/10, V2: 6.4/10) Budget: EUR 200/month maximum Guarantees: - RAM: > 40% margin on all scenarios (hard guarantee) - CPU: > 15% margin on peak burst with max_active_runs=2 (hard guarantee) - Zero OOM: 4-layer prevention - Zero permanent failures: retries=1 catches transient issues
1. Problem Statement¶
Over the past 2 days, the compute sizing was changed 5 times, causing OOM kills, CPU throttling, stuck pods, and 4 interrupted FTF sessions. Root cause: no definitive sizing plan — each issue was patched reactively.
This document defines a sizing architecture that handles ALL current and planned workloads with comfortable margins, within budget.
2. Platform Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ PERMANENT POOL — PRO2-S (8 CPU, 32Gi) │
│ Always on. EUR 60/month. │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Airflow │ │ MLflow │ │ CVNTrade │ │ Grafana │ │
│ │ sched+ │ │ │ │ API+FE+ │ │ + Redis │ │
│ │ web+trig │ │ │ │ runtime │ │ + ZenML │ │
│ │ 950m/1.8G│ │ 250m/512M│ │ 1.6/5.3G │ │ 350m/450M│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Total: 3.1 CPU / 8.2Gi used. Margin: 63% CPU, 74% RAM. │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ COMPUTE POOL — 0 to 6 × PRO2-M (16 CPU, 64Gi each) │
│ Autoscale. EUR 0 when idle. EUR 0.16/h per node when active. │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ PRO2-M #1 PRO2-M #2 PRO2-M #3 ... │ │
│ │ 16 CPU, 64Gi 16 CPU, 64Gi 16 CPU, 64Gi │ │
│ │ │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │ FTF │ │ FTF │ │ FTF │ │ FTF │ │ Disc│ │ WFRB│ │ │
│ │ │pod 1│ │pod 2│ │pod 3│ │pod 4│ │overy│ │ │ │ │
│ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Scale-up: ~3 min (pod Pending → node provisioned → pod Running) │
│ Scale-down: ~10 min after last pod terminates │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ EXTERNAL SERVICES │
│ │
│ PostgreSQL (Scaleway Managed) ~EUR 20/month │
│ S3 (MLflow artifacts) ~EUR 5/month │
└─────────────────────────────────────────────────────────────────┘
3. Workload Profiles¶
3.1 FTF Normal (CUSUM enabled)¶
The standard FTF ablation run. Tests factors like calibration, timeframe, fold_size, etc.
| Parameter | Value |
|---|---|
| Trigger | Manual (Airflow UI or CLI) |
| Cryptos | 5 (defi_top5) |
| Pods per factor | 5 (1/crypto) |
| Parallel factors | 2 (max_active_runs=2) |
| Peak pods | 10 |
| Training samples | ~1,000-2,000 (CUSUM event mode) |
| Features | 50-80 (after feature cap fix #552) |
| HPO trials | 15 |
| CPU per pod | 2 req / 4 limit (observed: 3-4 cores) |
| RAM per pod | 2Gi req / 8Gi limit (observed: 1-2Gi) |
| Duration per factor | ~2.5h |
| Pod profile | standard |
3.2 FTF Heavy (CUSUM disabled/relaxed)¶
Tests the cusum_training_mode factor with disabled or relaxed CUSUM. Much more data.
| Parameter | Value |
|---|---|
| Training samples | 5,000-20,000 (CUSUM disabled/relaxed) |
| Features | 50-272 (depends on feature cap) |
| HPO trials | 15 |
| CPU per pod | 2 req / 4 limit (observed: 4 cores, CPU-bound) |
| RAM per pod | 12Gi req / 24Gi limit (observed: 10-15Gi) |
| Duration per factor | 8-10h |
| Parallel factors | 1 only (no concurrent factors during heavy runs) |
| Peak pods | 5 |
| Pod profile | heavy |
Also applies to: n_features=full variant (272 features); btc_features (any
variant — its BTC cross-asset overlay loads a 2nd asset's full feature store,
~2× single-asset memory; OOMKilled at 8Gi in the CVN-N001-EI-S12 dry-run, #1123).
3.3 Discovery¶
Model screening across crypto universe. Identifies promising candidates.
| Parameter | Value |
|---|---|
| Pods | 5 (1/crypto) |
| CPU per pod | 2 req / 4 limit |
| RAM per pod | 2Gi req / 8Gi limit (observed: 3-4Gi) |
| Duration | ~1h |
| Pod profile | standard |
3.4 Testing / WFRB (Walk-Forward Rebalance Backtest)¶
Post-training validation. Backtests the trained model on walk-forward folds.
| Parameter | Value |
|---|---|
| Pods | 5 (1/crypto) |
| CPU per pod | 1 req / 2 limit |
| RAM per pod | 2Gi req / 4Gi limit (observed: 2-3Gi) |
| Duration | ~30 min |
| Pod profile | standard |
4. Pod Profiles¶
Standard Profile¶
Used for: FTF normal, Discovery, WFRB. Ratio limit/request: 2× CPU, 4× RAM — generous burst headroom.
Heavy Profile¶
Used for: FTF cusum_training_mode, FTF n_features=full, FTF btc_features. Ratio limit/request: 2× CPU, 2× RAM.
Profile Auto-Detection¶
The DAG selects the profile automatically based on the factor (factor-level:
the whole factor routes to one profile — there is no per-variant routing). The
single source of truth is commun.finetune.power_mode.is_heavy_factor, used by
the DAG for pod_override and by forecast_resources for the profile name:
# commun/finetune/power_mode.py — single source of truth
HEAVY_FACTORS = frozenset({"cusum_training_mode", "n_features", "btc_features"})
def is_heavy_factor(factor_name: str) -> bool:
return factor_name in HEAVY_FACTORS
When heavy is detected: 1. Pod uses heavy resource profile 2. max_active_runs is effectively 1 for this factor (no concurrent heavy runs) 3. Minimum 2 compute nodes guaranteed
5. Node Allocation & Margins¶
Scheduling Rules¶
| Rule | Mechanism |
|---|---|
| Standard pods schedule anywhere | nodeSelector: compute pool |
| Heavy pods require 2 nodes minimum | RAM request 12Gi × 5 pods = 60Gi > 1 node (64Gi allocatable ~60Gi) |
| No heavy + normal concurrent | Heavy factor runs alone (max_active_runs enforcement) |
| Platform pods never on compute | nodeSelector: permanent pool |
Margin Analysis¶
All margins calculated on BURST usage (CPU limit, RAM observed peak), not requests. No aspirational numbers — only what we've measured.
max_active_runs=2 for FTF normal (V3 decision — 3 caused stuck pods in practice).
| Scenario | Pods | CPU burst | RAM peak | Nodes | CPU avail | RAM avail | CPU margin | RAM margin |
|---|---|---|---|---|---|---|---|---|
| FTF normal (2 factors) | 10 | 40 | 20Gi | 3 | 48 | 192Gi | 17% ✓ | 90% ✓ |
| FTF heavy (1 factor) | 5 | 20 | 60Gi | 2 | 32 | 128Gi | 38% ✓ | 53% ✓ |
| Discovery | 5 | 20 | 20Gi | 2 | 32 | 128Gi | 38% ✓ | 84% ✓ |
| WFRB | 5 | 10 | 10Gi | 1 | 16 | 64Gi | 38% ✓ | 84% ✓ |
| FTF normal + Discovery | 15 | 60 | 40Gi | 4 | 64 | 256Gi | 6% ⚠️ | 84% ✓ |
| FTF heavy + WFRB | 10 | 30 | 70Gi | 2 | 32 | 128Gi | 6% ⚠️ | 45% ✓ |
| Worst case (all) | 20 | 70 | 80Gi | 5 | 80 | 320Gi | 13% ✓ | 75% ✓ |
Margin Guarantees — Honest Assessment¶
| Dimension | Guarantee | Justification |
|---|---|---|
| RAM (all scenarios) | > 40% | Hard guarantee. Sizing verified on observed peak, not requests. |
| CPU (FTF alone) | > 15% | 10 pods × 4 cores = 40 on 48 available (3 nodes). Proven in practice. |
| CPU (concurrent workloads) | Best-effort (6-13%) | FTF + Discovery concurrent = 60 cores on 64. Minor throttling possible during peak overlap. Runs complete without failure, ~5-10% slower at worst. |
| CPU throttling impact | < 10% duration increase | Throttling only occurs when ALL pods are CPU-bound simultaneously. I/O and data loading phases naturally stagger CPU demand. |
What we do NOT promise: Zero throttling when FTF + Discovery run simultaneously. This is a cost trade-off — eliminating it requires 5 nodes permanent during concurrent runs.
What we DO promise: No run fails, no OOM, no stuck pods. Throttling adds at most 10% to run duration in the concurrent scenario.
Concurrency Rules (V2 reco #7)¶
| Rule | Enforcement | Rationale |
|---|---|---|
| FTF normal: max 2 concurrent factors | max_active_runs=2 in DAG | 10 pods × 4 CPU = 40 burst on 3 nodes (48 cores) = 17% margin |
| FTF heavy: max 1 concurrent factor | DAG auto-detects heavy → serializes | 5 pods × 4 CPU = 20 on 2 nodes (32 cores) = 38% margin |
| FTF heavy + normal: NOT concurrent | Heavy blocks normal until complete | Prevents resource contention |
| FTF + Discovery: concurrent OK | 15 pods total, 4 nodes | CPU margin 6% — acceptable, runs complete |
| FTF + WFRB: concurrent OK | 15 pods total, 3-4 nodes | WFRB is lighter (2 CPU/pod) |
Worst case: FTF normal (10 pods) + Discovery (5 pods) + WFRB (5 pods) = 20 pods. Requires 5 nodes. Pool max=6, 1 spare. This is realistic — a user could trigger all three.
Cannot happen: FTF heavy + FTF normal concurrent (DAG-enforced). FTF heavy + Discovery is possible but unlikely (different workflows).
6. OOM Prevention — 4 Layers¶
| Layer | Mechanism | Catches |
|---|---|---|
| 1. Pod limit | RAM limit = 2-4× request | Prevents single pod from consuming all node RAM |
| 2. Node margin | > 40% RAM free (except CPU-bound scenarios) | Absorbs burst without eviction |
| 3. Pre-flight guardrail | Estimate memory before training, BLOCK if > 70% node capacity | Catches misconfiguration before expensive training (#551) |
| 4. Profile auto-detection | DAG selects heavy profile for high-data variants | Prevents standard pods from running heavy workloads |
OOM Scenarios and Mitigations¶
| Scenario | Risk | Mitigation |
|---|---|---|
| cusum_training_mode=disabled | 10-15Gi per pod | Heavy profile (12Gi req, 24Gi limit) + 2 nodes |
| n_features=full (272 features) | 5-8Gi per pod | Heavy profile + feature cap fix (#552) |
| Memory leak during HPO | Slow growth over 30 trials | Pod limit (8/24Gi) + Optuna trial isolation |
| Concurrent heavy + normal | 80Gi+ total | Never concurrent — heavy runs alone |
| Node memory fragmentation | Random OOM despite margin | retries=1 catches transient failures |
7. Retry Policy¶
| Parameter | Value | Rationale |
|---|---|---|
retries |
1 | Catch transient failures (GC spike, node preemption) |
retry_delay |
60s | Let node recover |
retry_exponential_backoff |
True | 60s → 120s on 2nd retry |
max_retry_delay |
300s | Cap at 5 min |
| Pod restart policy | Never | Airflow manages lifecycle |
Design intent: With proper margins, retries should NEVER trigger. If a task uses its retry, it's a signal that sizing is wrong — investigate before the next session.
Monitoring: Grafana panel for retry count. Alert if retries > 0 in any session.
8. Autoscaler Configuration¶
| Parameter | Value |
|---|---|
| Pool name | compute |
| Node type | PRO2-M (16 CPU, 64Gi) |
| Min size | 0 |
| Max size | 6 |
| Scale-up trigger | Pod Pending (insufficient resources) |
| Scale-up time | ~3 min (Scaleway SLA: < 5 min) |
| Scale-down trigger | Node idle (no pods) for 10 min |
| Scale-down protection | Pods running → node NOT drained |
Scale-Up Behavior¶
t=0: FTF triggered → 15 pods created → Pending
t=0: Autoscaler detects Pending pods
t=1m: Scaleway API: provision PRO2-M node #1
t=2m: Scaleway API: provision PRO2-M node #2 (if needed)
t=3m: Nodes Ready → pods scheduled → Running
t=3m: If still Pending pods → provision node #3
Scale-Down Behavior¶
t=0: Last FTF pod completes → Succeeded
t=10m: Autoscaler: node #3 idle → drain + terminate
t=10m: Autoscaler: node #2 idle → drain + terminate
t=10m: Autoscaler: node #1 idle → drain + terminate
t=10m: Compute pool size = 0. Cost = EUR 0.
9. Cost Model¶
Unit Costs¶
| Resource | Cost |
|---|---|
| PRO2-S (permanent, 24/7) | EUR 60/month |
| PRO2-M (per node per hour) | EUR 0.1644/h |
| PostgreSQL managed | EUR 20/month |
| S3 storage | EUR 5/month |
Monthly Projections¶
Normal Month (3 FTF sessions/week, daily discovery)¶
| Component | Calculation | Cost |
|---|---|---|
| Permanent | 1 × PRO2-S × 730h | EUR 60 |
| FTF normal | 12 sessions × 4h × 3 nodes × EUR 0.16 | EUR 23 |
| FTF heavy | 2 sessions × 10h × 2 nodes × EUR 0.16 | EUR 6 |
| Discovery | 20 sessions × 1h × 2 nodes × EUR 0.16 | EUR 6 |
| WFRB | 20 sessions × 0.5h × 1 node × EUR 0.16 | EUR 2 |
| DB + S3 | fixed | EUR 25 |
| Total | EUR 122 |
Note: FTF normal takes 4h per session with max_active_runs=2 (vs 2.5h with 3). Trade-off: reliability over speed.
Heavy Month (5 FTF sessions/week, cusum experiments, daily everything)¶
| Component | Calculation | Cost |
|---|---|---|
| Permanent | EUR 60 | |
| FTF normal | 20 sessions × 4h × 3 nodes × EUR 0.16 | EUR 38 |
| FTF heavy | 4 sessions × 10h × 2 nodes × EUR 0.16 | EUR 13 |
| Discovery | 30 sessions × 1h × 2 nodes × EUR 0.16 | EUR 10 |
| WFRB | 30 sessions × 0.5h × 1 node × EUR 0.16 | EUR 2 |
| DB + S3 | EUR 25 | |
| Total | EUR 148 |
Worst Month (intensive research, multiple cusum experiments)¶
| Component | Calculation | Cost |
|---|---|---|
| Permanent | EUR 60 | |
| FTF normal | 25 sessions × 4h × 3 nodes × EUR 0.16 | EUR 48 |
| FTF heavy | 8 sessions × 10h × 2 nodes × EUR 0.16 | EUR 26 |
| Discovery + WFRB | 40 × 1.5h × 2 nodes × EUR 0.16 | EUR 19 |
| DB + S3 | EUR 25 | |
| Total | EUR 178 |
All scenarios under EUR 200 budget. Worst month margin: EUR 22 (11%).
10. Observability & Alerting (V2 reco #2)¶
| Metric | Source | Alert threshold | Action |
|---|---|---|---|
| Pod CPU actual vs limit | Prometheus: container_cpu_usage / container_resource_limits |
> 90% sustained 10 min | Possible throttling — check if more nodes needed |
| Pod RAM actual vs limit | Prometheus: container_memory_working_set / container_resource_limits |
> 80% | OOM risk — investigate |
| Node CPU utilization | Prometheus: instance:node_cpu_utilisation |
> 85% | Scale up or reduce parallelism |
| Node RAM utilization | Prometheus: 1 - MemAvailable/MemTotal |
> 75% | Scale up |
| Pending pods | kube-state-metrics: kube_pod_status_phase{phase=Pending} |
> 0 for 5 min | Autoscaler may be stuck |
| Autoscaler scale-up time | Event annotations | > 5 min | Investigate Scaleway API |
| Run duration | PostgreSQL: elapsed_s |
> 600s per run (10 min) | Possible misconfiguration |
| Retry count | Airflow task instance | > 0 | Sizing may be wrong — investigate |
| External dependencies | Probes: MLflow, PostgreSQL, Redis, S3 | Any down | Alert immediately |
All metrics visible on Grafana Infrastructure Monitoring dashboard (#529).
11. Security (V2 reco #3)¶
| Layer | Mechanism | Status |
|---|---|---|
| RBAC | Scaleway IAM roles: admin + CI service account. Namespace-level RBAC in K8s. | Configured |
| Network policies | Compute pods → PostgreSQL, MLflow, Redis only. No external egress except Binance API. | To implement |
| Secrets management | Kubernetes Secrets (to migrate from ConfigMap, #527). No secrets in git. | In progress |
| Image scanning | CI builds with pinned base image. No latest tag (ADR-3). |
Configured |
| Runtime security | Pod security standards: non-root, read-only root FS where possible. | Partial |
| Trust boundaries | Permanent pool (platform) isolated from compute pool (workloads) via nodeSelector. | Configured |
12. Growth Strategy (V2 reco #4)¶
| Milestone | Compute need | Architecture |
|---|---|---|
| Current (5 cryptos, FTF) | 4 PRO2-M peak | Pool max=6 |
| defi_full (17 cryptos) | ~14 PRO2-M peak (17 pods × 4 CPU) | Pool max=16 or switch to PRO2-L (32 CPU) |
| Live trading (paper + live) | +2 PRO2-M permanent | Dedicated trading pool (always-on, low-latency) |
| Multi-strategy | ×N of FTF | Separate pools per strategy or time-share |
Decision point: When defi_full is needed, evaluate PRO2-L (32 CPU, EUR 0.33/h) vs more PRO2-M. PRO2-L is more cost-efficient per core for sustained workloads.
Budget projection for growth: - defi_full FTF: ~EUR 250/month → exceeds EUR 200 budget → need budget revision - Live trading: ~EUR 100/month additional → separate budget line
13. Rollback Procedures (V2 reco #6)¶
| Scenario | Steps | RTO |
|---|---|---|
| Bad Helm values (wrong sizing) | git revert → CI/CD auto-deploy → kill pods → retrigger |
10 min |
| Autoscaler stuck | Manual: scw k8s pool update <id> size=N → force node creation |
5 min |
| Node failure during run | Airflow retry (retries=1) → pod rescheduled on healthy node | 2 min |
| All compute nodes down | Workloads paused. Platform unaffected (permanent pool). Wait for autoscaler recovery. | 5-10 min |
| Cost overrun detected | Pause all FTF DAGs → scale pool to 0 → investigate | Immediate |
14. Implementation¶
| Task | Effort | Priority |
|---|---|---|
| Update Helm: standard profile (cpu=2/4, ram=2/8Gi) | 2 lines | HIGH |
| Heavy profile in DAG (conditional resources) | 15 lines | HIGH |
| Auto-detect heavy workload | 10 lines | HIGH |
| retries=1 + exponential backoff | 3 lines | HIGH |
| Pool max=6 (already done) | Done | — |
| Grafana: cost tracking + retry panel | 2 panels | MEDIUM |
| OPERATIONS.md: sizing rationale | Doc | MEDIUM |
15. Acceptance Criteria¶
- FTF normal (3 factors //) < 2h30 per factor
- FTF heavy (cusum_off) completes without OOM on 2 nodes
- > 40% RAM margin on ALL scenarios (verified by analysis)
- Discovery + FTF concurrent without scheduling issues
- Monthly cost < EUR 200 in all projections
- Scale-up < 5 min, scale-down < 15 min
- Zero OOMKill in 10 consecutive sessions
- Zero permanent task failures (retry resolves any transient)
- Heavy profile auto-detected — no manual config needed