Scaleway Compute Sizing — Definitive Architecture¶

Version: 3.0 Date: 2026-04-15 Issue: #553 Status: V3 after 2nd committee review (V1: 5.4/10, V2: 6.4/10) Budget: EUR 200/month maximum Guarantees: - RAM: > 40% margin on all scenarios (hard guarantee) - CPU: > 15% margin on peak burst with max_active_runs=2 (hard guarantee) - Zero OOM: 4-layer prevention - Zero permanent failures: retries=1 catches transient issues

1. Problem Statement¶

Over the past 2 days, the compute sizing was changed 5 times, causing OOM kills, CPU throttling, stuck pods, and 4 interrupted FTF sessions. Root cause: no definitive sizing plan — each issue was patched reactively.

This document defines a sizing architecture that handles ALL current and planned workloads with comfortable margins, within budget.

2. Platform Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│  PERMANENT POOL — PRO2-S (8 CPU, 32Gi)                          │
│  Always on. EUR 60/month.                                        │
│                                                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐           │
│  │ Airflow  │ │ MLflow   │ │ CVNTrade │ │ Grafana  │           │
│  │ sched+   │ │          │ │ API+FE+  │ │ + Redis  │           │
│  │ web+trig │ │          │ │ runtime  │ │ + ZenML  │           │
│  │ 950m/1.8G│ │ 250m/512M│ │ 1.6/5.3G │ │ 350m/450M│           │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘           │
│                                                                   │
│  Total: 3.1 CPU / 8.2Gi used.  Margin: 63% CPU, 74% RAM.       │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  COMPUTE POOL — 0 to 6 × PRO2-M (16 CPU, 64Gi each)            │
│  Autoscale. EUR 0 when idle. EUR 0.16/h per node when active.   │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  PRO2-M #1          PRO2-M #2          PRO2-M #3 ...      │ │
│  │  16 CPU, 64Gi       16 CPU, 64Gi       16 CPU, 64Gi       │ │
│  │                                                             │ │
│  │  ┌─────┐ ┌─────┐   ┌─────┐ ┌─────┐   ┌─────┐ ┌─────┐   │ │
│  │  │ FTF │ │ FTF │   │ FTF │ │ FTF │   │ Disc│ │ WFRB│   │ │
│  │  │pod 1│ │pod 2│   │pod 3│ │pod 4│   │overy│ │     │   │ │
│  │  └─────┘ └─────┘   └─────┘ └─────┘   └─────┘ └─────┘   │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  Scale-up: ~3 min (pod Pending → node provisioned → pod Running) │
│  Scale-down: ~10 min after last pod terminates                    │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  EXTERNAL SERVICES                                                │
│                                                                   │
│  PostgreSQL (Scaleway Managed)  ~EUR 20/month                    │
│  S3 (MLflow artifacts)          ~EUR 5/month                     │
└─────────────────────────────────────────────────────────────────┘

3. Workload Profiles¶

3.1 FTF Normal (CUSUM enabled)¶

The standard FTF ablation run. Tests factors like calibration, timeframe, fold_size, etc.

Parameter	Value
Trigger	Manual (Airflow UI or CLI)
Cryptos	5 (defi_top5)
Pods per factor	5 (1/crypto)
Parallel factors	2 (max_active_runs=2)
Peak pods	10
Training samples	~1,000-2,000 (CUSUM event mode)
Features	50-80 (after feature cap fix #552)
HPO trials	15
CPU per pod	2 req / 4 limit (observed: 3-4 cores)
RAM per pod	2Gi req / 8Gi limit (observed: 1-2Gi)
Duration per factor	~2.5h
Pod profile	standard

3.2 FTF Heavy (CUSUM disabled/relaxed)¶

Tests the cusum_training_mode factor with disabled or relaxed CUSUM. Much more data.

Parameter	Value
Training samples	5,000-20,000 (CUSUM disabled/relaxed)
Features	50-272 (depends on feature cap)
HPO trials	15
CPU per pod	2 req / 4 limit (observed: 4 cores, CPU-bound)
RAM per pod	12Gi req / 24Gi limit (observed: 10-15Gi)
Duration per factor	8-10h
Parallel factors	1 only (no concurrent factors during heavy runs)
Peak pods	5
Pod profile	heavy

Also applies to: n_features=full variant (272 features); btc_features (any variant — its BTC cross-asset overlay loads a 2nd asset's full feature store, ~2× single-asset memory; OOMKilled at 8Gi in the CVN-N001-EI-S12 dry-run, #1123).

3.3 Discovery¶

Model screening across crypto universe. Identifies promising candidates.

Parameter	Value
Pods	5 (1/crypto)
CPU per pod	2 req / 4 limit
RAM per pod	2Gi req / 8Gi limit (observed: 3-4Gi)
Duration	~1h
Pod profile	standard

3.4 Testing / WFRB (Walk-Forward Rebalance Backtest)¶

Post-training validation. Backtests the trained model on walk-forward folds.

Parameter	Value
Pods	5 (1/crypto)
CPU per pod	1 req / 2 limit
RAM per pod	2Gi req / 4Gi limit (observed: 2-3Gi)
Duration	~30 min
Pod profile	standard

4. Pod Profiles¶

Standard Profile¶

resources:
  requests:
    cpu: "2"
    memory: "2Gi"
  limits:
    cpu: "4"
    memory: "8Gi"

Used for: FTF normal, Discovery, WFRB. Ratio limit/request: 2× CPU, 4× RAM — generous burst headroom.

Heavy Profile¶

resources:
  requests:
    cpu: "2"
    memory: "12Gi"
  limits:
    cpu: "4"
    memory: "24Gi"

Used for: FTF cusum_training_mode, FTF n_features=full, FTF btc_features. Ratio limit/request: 2× CPU, 2× RAM.

Profile Auto-Detection¶

The DAG selects the profile automatically based on the factor (factor-level: the whole factor routes to one profile — there is no per-variant routing). The single source of truth is commun.finetune.power_mode.is_heavy_factor, used by the DAG for pod_override and by forecast_resources for the profile name:

# commun/finetune/power_mode.py — single source of truth
HEAVY_FACTORS = frozenset({"cusum_training_mode", "n_features", "btc_features"})

def is_heavy_factor(factor_name: str) -> bool:
    return factor_name in HEAVY_FACTORS

When heavy is detected: 1. Pod uses heavy resource profile 2. max_active_runs is effectively 1 for this factor (no concurrent heavy runs) 3. Minimum 2 compute nodes guaranteed

5. Node Allocation & Margins¶

Scheduling Rules¶

Rule	Mechanism
Standard pods schedule anywhere	nodeSelector: compute pool
Heavy pods require 2 nodes minimum	RAM request 12Gi × 5 pods = 60Gi > 1 node (64Gi allocatable ~60Gi)
No heavy + normal concurrent	Heavy factor runs alone (max_active_runs enforcement)
Platform pods never on compute	nodeSelector: permanent pool

Margin Analysis¶

All margins calculated on BURST usage (CPU limit, RAM observed peak), not requests. No aspirational numbers — only what we've measured.

max_active_runs=2 for FTF normal (V3 decision — 3 caused stuck pods in practice).

Scenario	Pods	CPU burst	RAM peak	Nodes	CPU avail	RAM avail	CPU margin	RAM margin
FTF normal (2 factors)	10	40	20Gi	3	48	192Gi	17% ✓	90% ✓
FTF heavy (1 factor)	5	20	60Gi	2	32	128Gi	38% ✓	53% ✓
Discovery	5	20	20Gi	2	32	128Gi	38% ✓	84% ✓
WFRB	5	10	10Gi	1	16	64Gi	38% ✓	84% ✓
FTF normal + Discovery	15	60	40Gi	4	64	256Gi	6% ⚠️	84% ✓
FTF heavy + WFRB	10	30	70Gi	2	32	128Gi	6% ⚠️	45% ✓
Worst case (all)	20	70	80Gi	5	80	320Gi	13% ✓	75% ✓

Margin Guarantees — Honest Assessment¶

Dimension	Guarantee	Justification
RAM (all scenarios)	> 40%	Hard guarantee. Sizing verified on observed peak, not requests.
CPU (FTF alone)	> 15%	10 pods × 4 cores = 40 on 48 available (3 nodes). Proven in practice.
CPU (concurrent workloads)	Best-effort (6-13%)	FTF + Discovery concurrent = 60 cores on 64. Minor throttling possible during peak overlap. Runs complete without failure, ~5-10% slower at worst.
CPU throttling impact	< 10% duration increase	Throttling only occurs when ALL pods are CPU-bound simultaneously. I/O and data loading phases naturally stagger CPU demand.

What we do NOT promise: Zero throttling when FTF + Discovery run simultaneously. This is a cost trade-off — eliminating it requires 5 nodes permanent during concurrent runs.

What we DO promise: No run fails, no OOM, no stuck pods. Throttling adds at most 10% to run duration in the concurrent scenario.

Concurrency Rules (V2 reco #7)¶

Rule	Enforcement	Rationale
FTF normal: max 2 concurrent factors	max_active_runs=2 in DAG	10 pods × 4 CPU = 40 burst on 3 nodes (48 cores) = 17% margin
FTF heavy: max 1 concurrent factor	DAG auto-detects heavy → serializes	5 pods × 4 CPU = 20 on 2 nodes (32 cores) = 38% margin
FTF heavy + normal: NOT concurrent	Heavy blocks normal until complete	Prevents resource contention
FTF + Discovery: concurrent OK	15 pods total, 4 nodes	CPU margin 6% — acceptable, runs complete
FTF + WFRB: concurrent OK	15 pods total, 3-4 nodes	WFRB is lighter (2 CPU/pod)

Worst case: FTF normal (10 pods) + Discovery (5 pods) + WFRB (5 pods) = 20 pods. Requires 5 nodes. Pool max=6, 1 spare. This is realistic — a user could trigger all three.

Cannot happen: FTF heavy + FTF normal concurrent (DAG-enforced). FTF heavy + Discovery is possible but unlikely (different workflows).

6. OOM Prevention — 4 Layers¶

Layer	Mechanism	Catches
1. Pod limit	RAM limit = 2-4× request	Prevents single pod from consuming all node RAM
2. Node margin	> 40% RAM free (except CPU-bound scenarios)	Absorbs burst without eviction
3. Pre-flight guardrail	Estimate memory before training, BLOCK if > 70% node capacity	Catches misconfiguration before expensive training (#551)
4. Profile auto-detection	DAG selects heavy profile for high-data variants	Prevents standard pods from running heavy workloads

OOM Scenarios and Mitigations¶

Scenario	Risk	Mitigation
cusum_training_mode=disabled	10-15Gi per pod	Heavy profile (12Gi req, 24Gi limit) + 2 nodes
n_features=full (272 features)	5-8Gi per pod	Heavy profile + feature cap fix (#552)
Memory leak during HPO	Slow growth over 30 trials	Pod limit (8/24Gi) + Optuna trial isolation
Concurrent heavy + normal	80Gi+ total	Never concurrent — heavy runs alone
Node memory fragmentation	Random OOM despite margin	retries=1 catches transient failures

7. Retry Policy¶

Parameter	Value	Rationale
`retries`	1	Catch transient failures (GC spike, node preemption)
`retry_delay`	60s	Let node recover
`retry_exponential_backoff`	True	60s → 120s on 2nd retry
`max_retry_delay`	300s	Cap at 5 min
Pod restart policy	Never	Airflow manages lifecycle

Design intent: With proper margins, retries should NEVER trigger. If a task uses its retry, it's a signal that sizing is wrong — investigate before the next session.

Monitoring: Grafana panel for retry count. Alert if retries > 0 in any session.

8. Autoscaler Configuration¶

Parameter	Value
Pool name	compute
Node type	PRO2-M (16 CPU, 64Gi)
Min size	0
Max size	6
Scale-up trigger	Pod Pending (insufficient resources)
Scale-up time	~3 min (Scaleway SLA: < 5 min)
Scale-down trigger	Node idle (no pods) for 10 min
Scale-down protection	Pods running → node NOT drained

Scale-Up Behavior¶

t=0:    FTF triggered → 15 pods created → Pending
t=0:    Autoscaler detects Pending pods
t=1m:   Scaleway API: provision PRO2-M node #1
t=2m:   Scaleway API: provision PRO2-M node #2 (if needed)
t=3m:   Nodes Ready → pods scheduled → Running
t=3m:   If still Pending pods → provision node #3

Scale-Down Behavior¶

t=0:    Last FTF pod completes → Succeeded
t=10m:  Autoscaler: node #3 idle → drain + terminate
t=10m:  Autoscaler: node #2 idle → drain + terminate
t=10m:  Autoscaler: node #1 idle → drain + terminate
t=10m:  Compute pool size = 0. Cost = EUR 0.

9. Cost Model¶

Unit Costs¶

Resource	Cost
PRO2-S (permanent, 24/7)	EUR 60/month
PRO2-M (per node per hour)	EUR 0.1644/h
PostgreSQL managed	EUR 20/month
S3 storage	EUR 5/month

Monthly Projections¶

Normal Month (3 FTF sessions/week, daily discovery)¶

Component	Calculation	Cost
Permanent	1 × PRO2-S × 730h	EUR 60
FTF normal	12 sessions × 4h × 3 nodes × EUR 0.16	EUR 23
FTF heavy	2 sessions × 10h × 2 nodes × EUR 0.16	EUR 6
Discovery	20 sessions × 1h × 2 nodes × EUR 0.16	EUR 6
WFRB	20 sessions × 0.5h × 1 node × EUR 0.16	EUR 2
DB + S3	fixed	EUR 25
Total		EUR 122

Note: FTF normal takes 4h per session with max_active_runs=2 (vs 2.5h with 3). Trade-off: reliability over speed.

Heavy Month (5 FTF sessions/week, cusum experiments, daily everything)¶

Component	Calculation	Cost
Permanent		EUR 60
FTF normal	20 sessions × 4h × 3 nodes × EUR 0.16	EUR 38
FTF heavy	4 sessions × 10h × 2 nodes × EUR 0.16	EUR 13
Discovery	30 sessions × 1h × 2 nodes × EUR 0.16	EUR 10
WFRB	30 sessions × 0.5h × 1 node × EUR 0.16	EUR 2
DB + S3		EUR 25
Total		EUR 148

Worst Month (intensive research, multiple cusum experiments)¶

Component	Calculation	Cost
Permanent		EUR 60
FTF normal	25 sessions × 4h × 3 nodes × EUR 0.16	EUR 48
FTF heavy	8 sessions × 10h × 2 nodes × EUR 0.16	EUR 26
Discovery + WFRB	40 × 1.5h × 2 nodes × EUR 0.16	EUR 19
DB + S3		EUR 25
Total		EUR 178

All scenarios under EUR 200 budget. Worst month margin: EUR 22 (11%).

10. Observability & Alerting (V2 reco #2)¶

Metric	Source	Alert threshold	Action
Pod CPU actual vs limit	Prometheus: `container_cpu_usage / container_resource_limits`	> 90% sustained 10 min	Possible throttling — check if more nodes needed
Pod RAM actual vs limit	Prometheus: `container_memory_working_set / container_resource_limits`	> 80%	OOM risk — investigate
Node CPU utilization	Prometheus: `instance:node_cpu_utilisation`	> 85%	Scale up or reduce parallelism
Node RAM utilization	Prometheus: `1 - MemAvailable/MemTotal`	> 75%	Scale up
Pending pods	kube-state-metrics: `kube_pod_status_phase{phase=Pending}`	> 0 for 5 min	Autoscaler may be stuck
Autoscaler scale-up time	Event annotations	> 5 min	Investigate Scaleway API
Run duration	PostgreSQL: `elapsed_s`	> 600s per run (10 min)	Possible misconfiguration
Retry count	Airflow task instance	> 0	Sizing may be wrong — investigate
External dependencies	Probes: MLflow, PostgreSQL, Redis, S3	Any down	Alert immediately

All metrics visible on Grafana Infrastructure Monitoring dashboard (#529).

11. Security (V2 reco #3)¶

Layer	Mechanism	Status
RBAC	Scaleway IAM roles: admin + CI service account. Namespace-level RBAC in K8s.	Configured
Network policies	Compute pods → PostgreSQL, MLflow, Redis only. No external egress except Binance API.	To implement
Secrets management	Kubernetes Secrets (to migrate from ConfigMap, #527). No secrets in git.	In progress
Image scanning	CI builds with pinned base image. No `latest` tag (ADR-3).	Configured
Runtime security	Pod security standards: non-root, read-only root FS where possible.	Partial
Trust boundaries	Permanent pool (platform) isolated from compute pool (workloads) via nodeSelector.	Configured

12. Growth Strategy (V2 reco #4)¶

Milestone	Compute need	Architecture
Current (5 cryptos, FTF)	4 PRO2-M peak	Pool max=6
defi_full (17 cryptos)	~14 PRO2-M peak (17 pods × 4 CPU)	Pool max=16 or switch to PRO2-L (32 CPU)
Live trading (paper + live)	+2 PRO2-M permanent	Dedicated trading pool (always-on, low-latency)
Multi-strategy	×N of FTF	Separate pools per strategy or time-share

Decision point: When defi_full is needed, evaluate PRO2-L (32 CPU, EUR 0.33/h) vs more PRO2-M. PRO2-L is more cost-efficient per core for sustained workloads.

Budget projection for growth: - defi_full FTF: ~EUR 250/month → exceeds EUR 200 budget → need budget revision - Live trading: ~EUR 100/month additional → separate budget line

13. Rollback Procedures (V2 reco #6)¶

Scenario	Steps	RTO
Bad Helm values (wrong sizing)	`git revert` → CI/CD auto-deploy → kill pods → retrigger	10 min
Autoscaler stuck	Manual: `scw k8s pool update <id> size=N` → force node creation	5 min
Node failure during run	Airflow retry (retries=1) → pod rescheduled on healthy node	2 min
All compute nodes down	Workloads paused. Platform unaffected (permanent pool). Wait for autoscaler recovery.	5-10 min
Cost overrun detected	Pause all FTF DAGs → scale pool to 0 → investigate	Immediate

14. Implementation¶

Task	Effort	Priority
Update Helm: standard profile (cpu=2/4, ram=2/8Gi)	2 lines	HIGH
Heavy profile in DAG (conditional resources)	15 lines	HIGH
Auto-detect heavy workload	10 lines	HIGH
retries=1 + exponential backoff	3 lines	HIGH
Pool max=6 (already done)	Done	—
Grafana: cost tracking + retry panel	2 panels	MEDIUM
OPERATIONS.md: sizing rationale	Doc	MEDIUM

15. Acceptance Criteria¶

FTF normal (3 factors //) < 2h30 per factor
FTF heavy (cusum_off) completes without OOM on 2 nodes
> 40% RAM margin on ALL scenarios (verified by analysis)
Discovery + FTF concurrent without scheduling issues
Monthly cost < EUR 200 in all projections
Scale-up < 5 min, scale-down < 15 min
Zero OOMKill in 10 consecutive sessions
Zero permanent task failures (retry resolves any transient)
Heavy profile auto-detected — no manual config needed