Skip to content

Scaleway Compute Sizing — Definitive Architecture

Version: 3.0 Date: 2026-04-15 Issue: #553 Status: V3 after 2nd committee review (V1: 5.4/10, V2: 6.4/10) Budget: EUR 200/month maximum Guarantees: - RAM: > 40% margin on all scenarios (hard guarantee) - CPU: > 15% margin on peak burst with max_active_runs=2 (hard guarantee) - Zero OOM: 4-layer prevention - Zero permanent failures: retries=1 catches transient issues


1. Problem Statement

Over the past 2 days, the compute sizing was changed 5 times, causing OOM kills, CPU throttling, stuck pods, and 4 interrupted FTF sessions. Root cause: no definitive sizing plan — each issue was patched reactively.

This document defines a sizing architecture that handles ALL current and planned workloads with comfortable margins, within budget.


2. Platform Architecture

┌─────────────────────────────────────────────────────────────────┐
│  PERMANENT POOL — PRO2-S (8 CPU, 32Gi)                          │
│  Always on. EUR 60/month.                                        │
│                                                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐           │
│  │ Airflow  │ │ MLflow   │ │ CVNTrade │ │ Grafana  │           │
│  │ sched+   │ │          │ │ API+FE+  │ │ + Redis  │           │
│  │ web+trig │ │          │ │ runtime  │ │ + ZenML  │           │
│  │ 950m/1.8G│ │ 250m/512M│ │ 1.6/5.3G │ │ 350m/450M│           │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘           │
│                                                                   │
│  Total: 3.1 CPU / 8.2Gi used.  Margin: 63% CPU, 74% RAM.       │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  COMPUTE POOL — 0 to 6 × PRO2-M (16 CPU, 64Gi each)            │
│  Autoscale. EUR 0 when idle. EUR 0.16/h per node when active.   │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  PRO2-M #1          PRO2-M #2          PRO2-M #3 ...      │ │
│  │  16 CPU, 64Gi       16 CPU, 64Gi       16 CPU, 64Gi       │ │
│  │                                                             │ │
│  │  ┌─────┐ ┌─────┐   ┌─────┐ ┌─────┐   ┌─────┐ ┌─────┐   │ │
│  │  │ FTF │ │ FTF │   │ FTF │ │ FTF │   │ Disc│ │ WFRB│   │ │
│  │  │pod 1│ │pod 2│   │pod 3│ │pod 4│   │overy│ │     │   │ │
│  │  └─────┘ └─────┘   └─────┘ └─────┘   └─────┘ └─────┘   │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  Scale-up: ~3 min (pod Pending → node provisioned → pod Running) │
│  Scale-down: ~10 min after last pod terminates                    │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  EXTERNAL SERVICES                                                │
│                                                                   │
│  PostgreSQL (Scaleway Managed)  ~EUR 20/month                    │
│  S3 (MLflow artifacts)          ~EUR 5/month                     │
└─────────────────────────────────────────────────────────────────┘

3. Workload Profiles

3.1 FTF Normal (CUSUM enabled)

The standard FTF ablation run. Tests factors like calibration, timeframe, fold_size, etc.

Parameter Value
Trigger Manual (Airflow UI or CLI)
Cryptos 5 (defi_top5)
Pods per factor 5 (1/crypto)
Parallel factors 2 (max_active_runs=2)
Peak pods 10
Training samples ~1,000-2,000 (CUSUM event mode)
Features 50-80 (after feature cap fix #552)
HPO trials 15
CPU per pod 2 req / 4 limit (observed: 3-4 cores)
RAM per pod 2Gi req / 8Gi limit (observed: 1-2Gi)
Duration per factor ~2.5h
Pod profile standard

3.2 FTF Heavy (CUSUM disabled/relaxed)

Tests the cusum_training_mode factor with disabled or relaxed CUSUM. Much more data.

Parameter Value
Training samples 5,000-20,000 (CUSUM disabled/relaxed)
Features 50-272 (depends on feature cap)
HPO trials 15
CPU per pod 2 req / 4 limit (observed: 4 cores, CPU-bound)
RAM per pod 12Gi req / 24Gi limit (observed: 10-15Gi)
Duration per factor 8-10h
Parallel factors 1 only (no concurrent factors during heavy runs)
Peak pods 5
Pod profile heavy

Also applies to: n_features=full variant (272 features); btc_features (any variant — its BTC cross-asset overlay loads a 2nd asset's full feature store, ~2× single-asset memory; OOMKilled at 8Gi in the CVN-N001-EI-S12 dry-run, #1123).

3.3 Discovery

Model screening across crypto universe. Identifies promising candidates.

Parameter Value
Pods 5 (1/crypto)
CPU per pod 2 req / 4 limit
RAM per pod 2Gi req / 8Gi limit (observed: 3-4Gi)
Duration ~1h
Pod profile standard

3.4 Testing / WFRB (Walk-Forward Rebalance Backtest)

Post-training validation. Backtests the trained model on walk-forward folds.

Parameter Value
Pods 5 (1/crypto)
CPU per pod 1 req / 2 limit
RAM per pod 2Gi req / 4Gi limit (observed: 2-3Gi)
Duration ~30 min
Pod profile standard

4. Pod Profiles

Standard Profile

resources:
  requests:
    cpu: "2"
    memory: "2Gi"
  limits:
    cpu: "4"
    memory: "8Gi"

Used for: FTF normal, Discovery, WFRB. Ratio limit/request: 2× CPU, 4× RAM — generous burst headroom.

Heavy Profile

resources:
  requests:
    cpu: "2"
    memory: "12Gi"
  limits:
    cpu: "4"
    memory: "24Gi"

Used for: FTF cusum_training_mode, FTF n_features=full, FTF btc_features. Ratio limit/request: 2× CPU, 2× RAM.

Profile Auto-Detection

The DAG selects the profile automatically based on the factor (factor-level: the whole factor routes to one profile — there is no per-variant routing). The single source of truth is commun.finetune.power_mode.is_heavy_factor, used by the DAG for pod_override and by forecast_resources for the profile name:

# commun/finetune/power_mode.py — single source of truth
HEAVY_FACTORS = frozenset({"cusum_training_mode", "n_features", "btc_features"})

def is_heavy_factor(factor_name: str) -> bool:
    return factor_name in HEAVY_FACTORS

When heavy is detected: 1. Pod uses heavy resource profile 2. max_active_runs is effectively 1 for this factor (no concurrent heavy runs) 3. Minimum 2 compute nodes guaranteed


5. Node Allocation & Margins

Scheduling Rules

Rule Mechanism
Standard pods schedule anywhere nodeSelector: compute pool
Heavy pods require 2 nodes minimum RAM request 12Gi × 5 pods = 60Gi > 1 node (64Gi allocatable ~60Gi)
No heavy + normal concurrent Heavy factor runs alone (max_active_runs enforcement)
Platform pods never on compute nodeSelector: permanent pool

Margin Analysis

All margins calculated on BURST usage (CPU limit, RAM observed peak), not requests. No aspirational numbers — only what we've measured.

max_active_runs=2 for FTF normal (V3 decision — 3 caused stuck pods in practice).

Scenario Pods CPU burst RAM peak Nodes CPU avail RAM avail CPU margin RAM margin
FTF normal (2 factors) 10 40 20Gi 3 48 192Gi 17% 90%
FTF heavy (1 factor) 5 20 60Gi 2 32 128Gi 38% 53%
Discovery 5 20 20Gi 2 32 128Gi 38% 84%
WFRB 5 10 10Gi 1 16 64Gi 38% 84%
FTF normal + Discovery 15 60 40Gi 4 64 256Gi 6% ⚠️ 84%
FTF heavy + WFRB 10 30 70Gi 2 32 128Gi 6% ⚠️ 45%
Worst case (all) 20 70 80Gi 5 80 320Gi 13% 75%

Margin Guarantees — Honest Assessment

Dimension Guarantee Justification
RAM (all scenarios) > 40% Hard guarantee. Sizing verified on observed peak, not requests.
CPU (FTF alone) > 15% 10 pods × 4 cores = 40 on 48 available (3 nodes). Proven in practice.
CPU (concurrent workloads) Best-effort (6-13%) FTF + Discovery concurrent = 60 cores on 64. Minor throttling possible during peak overlap. Runs complete without failure, ~5-10% slower at worst.
CPU throttling impact < 10% duration increase Throttling only occurs when ALL pods are CPU-bound simultaneously. I/O and data loading phases naturally stagger CPU demand.

What we do NOT promise: Zero throttling when FTF + Discovery run simultaneously. This is a cost trade-off — eliminating it requires 5 nodes permanent during concurrent runs.

What we DO promise: No run fails, no OOM, no stuck pods. Throttling adds at most 10% to run duration in the concurrent scenario.

Concurrency Rules (V2 reco #7)

Rule Enforcement Rationale
FTF normal: max 2 concurrent factors max_active_runs=2 in DAG 10 pods × 4 CPU = 40 burst on 3 nodes (48 cores) = 17% margin
FTF heavy: max 1 concurrent factor DAG auto-detects heavy → serializes 5 pods × 4 CPU = 20 on 2 nodes (32 cores) = 38% margin
FTF heavy + normal: NOT concurrent Heavy blocks normal until complete Prevents resource contention
FTF + Discovery: concurrent OK 15 pods total, 4 nodes CPU margin 6% — acceptable, runs complete
FTF + WFRB: concurrent OK 15 pods total, 3-4 nodes WFRB is lighter (2 CPU/pod)

Worst case: FTF normal (10 pods) + Discovery (5 pods) + WFRB (5 pods) = 20 pods. Requires 5 nodes. Pool max=6, 1 spare. This is realistic — a user could trigger all three.

Cannot happen: FTF heavy + FTF normal concurrent (DAG-enforced). FTF heavy + Discovery is possible but unlikely (different workflows).


6. OOM Prevention — 4 Layers

Layer Mechanism Catches
1. Pod limit RAM limit = 2-4× request Prevents single pod from consuming all node RAM
2. Node margin > 40% RAM free (except CPU-bound scenarios) Absorbs burst without eviction
3. Pre-flight guardrail Estimate memory before training, BLOCK if > 70% node capacity Catches misconfiguration before expensive training (#551)
4. Profile auto-detection DAG selects heavy profile for high-data variants Prevents standard pods from running heavy workloads

OOM Scenarios and Mitigations

Scenario Risk Mitigation
cusum_training_mode=disabled 10-15Gi per pod Heavy profile (12Gi req, 24Gi limit) + 2 nodes
n_features=full (272 features) 5-8Gi per pod Heavy profile + feature cap fix (#552)
Memory leak during HPO Slow growth over 30 trials Pod limit (8/24Gi) + Optuna trial isolation
Concurrent heavy + normal 80Gi+ total Never concurrent — heavy runs alone
Node memory fragmentation Random OOM despite margin retries=1 catches transient failures

7. Retry Policy

Parameter Value Rationale
retries 1 Catch transient failures (GC spike, node preemption)
retry_delay 60s Let node recover
retry_exponential_backoff True 60s → 120s on 2nd retry
max_retry_delay 300s Cap at 5 min
Pod restart policy Never Airflow manages lifecycle

Design intent: With proper margins, retries should NEVER trigger. If a task uses its retry, it's a signal that sizing is wrong — investigate before the next session.

Monitoring: Grafana panel for retry count. Alert if retries > 0 in any session.


8. Autoscaler Configuration

Parameter Value
Pool name compute
Node type PRO2-M (16 CPU, 64Gi)
Min size 0
Max size 6
Scale-up trigger Pod Pending (insufficient resources)
Scale-up time ~3 min (Scaleway SLA: < 5 min)
Scale-down trigger Node idle (no pods) for 10 min
Scale-down protection Pods running → node NOT drained

Scale-Up Behavior

t=0:    FTF triggered → 15 pods created → Pending
t=0:    Autoscaler detects Pending pods
t=1m:   Scaleway API: provision PRO2-M node #1
t=2m:   Scaleway API: provision PRO2-M node #2 (if needed)
t=3m:   Nodes Ready → pods scheduled → Running
t=3m:   If still Pending pods → provision node #3

Scale-Down Behavior

t=0:    Last FTF pod completes → Succeeded
t=10m:  Autoscaler: node #3 idle → drain + terminate
t=10m:  Autoscaler: node #2 idle → drain + terminate
t=10m:  Autoscaler: node #1 idle → drain + terminate
t=10m:  Compute pool size = 0. Cost = EUR 0.

9. Cost Model

Unit Costs

Resource Cost
PRO2-S (permanent, 24/7) EUR 60/month
PRO2-M (per node per hour) EUR 0.1644/h
PostgreSQL managed EUR 20/month
S3 storage EUR 5/month

Monthly Projections

Normal Month (3 FTF sessions/week, daily discovery)

Component Calculation Cost
Permanent 1 × PRO2-S × 730h EUR 60
FTF normal 12 sessions × 4h × 3 nodes × EUR 0.16 EUR 23
FTF heavy 2 sessions × 10h × 2 nodes × EUR 0.16 EUR 6
Discovery 20 sessions × 1h × 2 nodes × EUR 0.16 EUR 6
WFRB 20 sessions × 0.5h × 1 node × EUR 0.16 EUR 2
DB + S3 fixed EUR 25
Total EUR 122

Note: FTF normal takes 4h per session with max_active_runs=2 (vs 2.5h with 3). Trade-off: reliability over speed.

Heavy Month (5 FTF sessions/week, cusum experiments, daily everything)

Component Calculation Cost
Permanent EUR 60
FTF normal 20 sessions × 4h × 3 nodes × EUR 0.16 EUR 38
FTF heavy 4 sessions × 10h × 2 nodes × EUR 0.16 EUR 13
Discovery 30 sessions × 1h × 2 nodes × EUR 0.16 EUR 10
WFRB 30 sessions × 0.5h × 1 node × EUR 0.16 EUR 2
DB + S3 EUR 25
Total EUR 148

Worst Month (intensive research, multiple cusum experiments)

Component Calculation Cost
Permanent EUR 60
FTF normal 25 sessions × 4h × 3 nodes × EUR 0.16 EUR 48
FTF heavy 8 sessions × 10h × 2 nodes × EUR 0.16 EUR 26
Discovery + WFRB 40 × 1.5h × 2 nodes × EUR 0.16 EUR 19
DB + S3 EUR 25
Total EUR 178

All scenarios under EUR 200 budget. Worst month margin: EUR 22 (11%).


10. Observability & Alerting (V2 reco #2)

Metric Source Alert threshold Action
Pod CPU actual vs limit Prometheus: container_cpu_usage / container_resource_limits > 90% sustained 10 min Possible throttling — check if more nodes needed
Pod RAM actual vs limit Prometheus: container_memory_working_set / container_resource_limits > 80% OOM risk — investigate
Node CPU utilization Prometheus: instance:node_cpu_utilisation > 85% Scale up or reduce parallelism
Node RAM utilization Prometheus: 1 - MemAvailable/MemTotal > 75% Scale up
Pending pods kube-state-metrics: kube_pod_status_phase{phase=Pending} > 0 for 5 min Autoscaler may be stuck
Autoscaler scale-up time Event annotations > 5 min Investigate Scaleway API
Run duration PostgreSQL: elapsed_s > 600s per run (10 min) Possible misconfiguration
Retry count Airflow task instance > 0 Sizing may be wrong — investigate
External dependencies Probes: MLflow, PostgreSQL, Redis, S3 Any down Alert immediately

All metrics visible on Grafana Infrastructure Monitoring dashboard (#529).


11. Security (V2 reco #3)

Layer Mechanism Status
RBAC Scaleway IAM roles: admin + CI service account. Namespace-level RBAC in K8s. Configured
Network policies Compute pods → PostgreSQL, MLflow, Redis only. No external egress except Binance API. To implement
Secrets management Kubernetes Secrets (to migrate from ConfigMap, #527). No secrets in git. In progress
Image scanning CI builds with pinned base image. No latest tag (ADR-3). Configured
Runtime security Pod security standards: non-root, read-only root FS where possible. Partial
Trust boundaries Permanent pool (platform) isolated from compute pool (workloads) via nodeSelector. Configured

12. Growth Strategy (V2 reco #4)

Milestone Compute need Architecture
Current (5 cryptos, FTF) 4 PRO2-M peak Pool max=6
defi_full (17 cryptos) ~14 PRO2-M peak (17 pods × 4 CPU) Pool max=16 or switch to PRO2-L (32 CPU)
Live trading (paper + live) +2 PRO2-M permanent Dedicated trading pool (always-on, low-latency)
Multi-strategy ×N of FTF Separate pools per strategy or time-share

Decision point: When defi_full is needed, evaluate PRO2-L (32 CPU, EUR 0.33/h) vs more PRO2-M. PRO2-L is more cost-efficient per core for sustained workloads.

Budget projection for growth: - defi_full FTF: ~EUR 250/month → exceeds EUR 200 budget → need budget revision - Live trading: ~EUR 100/month additional → separate budget line


13. Rollback Procedures (V2 reco #6)

Scenario Steps RTO
Bad Helm values (wrong sizing) git revert → CI/CD auto-deploy → kill pods → retrigger 10 min
Autoscaler stuck Manual: scw k8s pool update <id> size=N → force node creation 5 min
Node failure during run Airflow retry (retries=1) → pod rescheduled on healthy node 2 min
All compute nodes down Workloads paused. Platform unaffected (permanent pool). Wait for autoscaler recovery. 5-10 min
Cost overrun detected Pause all FTF DAGs → scale pool to 0 → investigate Immediate

14. Implementation

Task Effort Priority
Update Helm: standard profile (cpu=2/4, ram=2/8Gi) 2 lines HIGH
Heavy profile in DAG (conditional resources) 15 lines HIGH
Auto-detect heavy workload 10 lines HIGH
retries=1 + exponential backoff 3 lines HIGH
Pool max=6 (already done) Done
Grafana: cost tracking + retry panel 2 panels MEDIUM
OPERATIONS.md: sizing rationale Doc MEDIUM

15. Acceptance Criteria

  • FTF normal (3 factors //) < 2h30 per factor
  • FTF heavy (cusum_off) completes without OOM on 2 nodes
  • > 40% RAM margin on ALL scenarios (verified by analysis)
  • Discovery + FTF concurrent without scheduling issues
  • Monthly cost < EUR 200 in all projections
  • Scale-up < 5 min, scale-down < 15 min
  • Zero OOMKill in 10 consecutive sessions
  • Zero permanent task failures (retry resolves any transient)
  • Heavy profile auto-detected — no manual config needed