Infrastructure Dashboard — Production-Grade Design V2¶

Issue: #529 Date: 2026-04-14 Status: V2 — amended after committee review (REJECTED V1 — ML_USELESS)

Design Principles¶

Glance (5 sec): Is everything OK right now? Green/red status bar.
Scan (30 sec): Was everything OK in the last 12 hours? Timeseries with anomaly zones.
Investigate (2 min): Are resources efficient? Where are the bottlenecks? Drill-down panels.
Act: Alert log with diagnostic context. Threshold-based escalation (Slack, SMS).

Dashboard Layout — 11 Sections, ~35 Panels¶

Section 1: Health Status Bar (y=0, h=3)¶

Purpose: One row of stats. Green = OK, yellow = warning, red = action needed. Read left-to-right in 5 seconds.

┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ CLUSTER  │ PERM     │ COMPUTE  │ FTF PODS │ ERROR    │ LATENCY  │ AVAIL    │ COST     │
│ STATUS   │ NODE     │ NODE     │ ACTIVE   │ RATE     │ p99      │ 12h %    │ EUR/h    │
│ [icon]   │ 1 [stat] │ 1 [stat] │ 5 [stat] │ 0% [st]  │ 45ms[st] │ 99.9%[st]│ 0.16[st]│
│ green bg │ green bg │ orange bg│ green bg │ green bg │ green bg │ green bg │ green bg │
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

Panel	Control	Query	Thresholds	SLA
Cluster Status	stat + icon	`min(up{job=~".kube."})`	1=green, 0=red	Availability >99.9%
Permanent Node	stat	`count(kube_node_info{node=~".permanent."})`	0=red, 1=green	Always 1
Compute Node	stat	`count(kube_node_info{node=~".compute."})`	0=blue, 1+=orange	0 when idle
FTF Pods	stat	`count(kube_pod_container_status_running{pod=~"finetune.*"})`	0=blue, 5=green	5 during FTF
Error Rate	stat	`sum(rate(kube_pod_container_status_terminated_reason{reason!="Completed"}[1h]))`	0=green, >0=red	0 errors
Latency p99	stat	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	<100ms=green, <500ms=yellow, >500ms=red	<100ms
Availability 12h	stat	`avg_over_time(up{job=~".kube."}[12h]) * 100`	>99.9=green, >99=yellow, <99=red	>99.9%
Cost EUR/h	stat	`count(kube_node_info{node=~".compute."}) * 0.1644`	0=green, >0=orange	Minimize

Section 2: 12-Hour Timeline (y=3, h=8)¶

Purpose: Was everything OK? Spot anomalies visually.

┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ CPU UTILIZATION — ALL NODES (timeseries)         │ MEMORY UTILIZATION — ALL NODES (timeseries)     │
│                                                  │                                                  │
│ ████████████████████████████░░░░░░ 75%           │ ██████████░░░░░░░░░░░░░░░░░░░░ 35%              │
│ ─────────────────────── capacity line (red dash) │ ─────────────────────── capacity line (red dash) │
│ Permanent (blue fill) + Compute (orange fill)    │ Same color coding                                │
│ ▓▓▓ Yellow zone >70% ▓▓▓ Red zone >90%          │ ▓▓▓ Yellow zone >70% ▓▓▓ Red zone >90%          │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ REQUEST THROUGHPUT (timeseries, stacked)          │ ERROR TIMELINE (timeseries, red bars)            │
│                                                  │                                                  │
│ Requests/sec by service (API, Airflow, MLflow)   │ OOM kills, restarts, failed pods                 │
│ Stacked area — shows load distribution           │ Each bar = 1 event. Zero baseline = healthy      │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘

Panel	Control	Query	Thresholds
CPU All Nodes	timeseries, stacked area	`sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))` + capacity overlay	>70% yellow, >90% red
Memory All Nodes	timeseries, stacked area	`(MemTotal - MemAvailable) / MemTotal * 100`	>70% yellow, >90% red
Request Throughput	timeseries, stacked	`rate(container_network_receive_bytes_total[5m])` by pod	—
Error Timeline	timeseries, bars	`increase(restarts[5m])` + OOM + terminated	>0 = red bar

Section 3: Resource Efficiency (y=11, h=8)¶

Purpose: Are resources well-used? Identify waste vs saturation.

┌────────────────┬────────────────┬────────────────┬────────────────────────────────────────────────┐
│ CPU            │ MEMORY         │ DISK           │ TOP 5 CPU CONSUMERS (bar chart, horizontal)    │
│ EFFICIENCY     │ EFFICIENCY     │ USAGE          │                                                │
│ [gauge]        │ [gauge]        │ [gauge]        │ pod-1 ████████████████ 3.2 cores                │
│ 82% [green]    │ 45% [yellow]   │ 12% [green]    │ pod-2 ███████████████ 3.0 cores                │
│ target >80%    │ target >60%    │ alert >85%     │ scheduler █████ 0.5 cores                       │
└────────────────┴────────────────┴────────────────┴────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ CPU BALANCE (pie chart)                        │ MEMORY BALANCE (pie chart)                          │
│                                                │                                                     │
│ ● FTF pods (orange) — 78%                      │ ● FTF pods (orange) — 45%                          │
│ ● Airflow (blue) — 12%                         │ ● Airflow (blue) — 25%                             │
│ ● MLflow (green) — 3%                          │ ● MLflow (green) — 15%                             │
│ ● Other (grey) — 7%                            │ ● Other (grey) — 15%                               │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

Panel	Control	Query	Thresholds
CPU Efficiency	gauge	`sum(cpu_usage) / sum(cpu_requests) * 100`	<50% red (waste), >80% green
Memory Efficiency	gauge	`sum(mem_working_set) / sum(mem_requests) * 100`	<50% red, >60% green
Disk Usage	gauge	`pvc_used / pvc_capacity * 100`	>85% red
Top 5 CPU	bar chart horizontal	`topk(5, sum by (pod) (rate(cpu_usage[5m])))`	—
CPU Balance	pie chart	`sum by (category) (cpu_usage)` — regex on pod name	—
Memory Balance	pie chart	`sum by (category) (mem_working_set)`	—

Section 4: FTF Parallel Execution (y=19, h=10)¶

Purpose: Is parallelism working? Are pods balanced? Any stragglers?

┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ FTF CPU PER POD (timeseries)                     │ CPU THROTTLING (timeseries, threshold zones)     │
│                                                  │                                                  │
│ ── pod-1 (3.2 cores)                             │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 42% ← yellow zone          │
│ ── pod-2 (2.9 cores)                             │ Zones: green <30%, yellow 30-50%, red >50%       │
│ ── pod-3 (3.1 cores)                             │                                                  │
│ - - - limit (4 cores) [red dashed]               │ High throttling = CPU limit too low              │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ FTF MEMORY PER POD (timeseries)                  │ FTF DISK I/O (timeseries)                        │
│                                                  │                                                  │
│ ── pod-1 (1.1 Gi)                                │ Write KB/s — spikes at model checkpoint          │
│ ── pod-2 (1.0 Gi)                                │ Read KB/s — spikes at data load                  │
│ - - - limit (8 Gi) [red dashed]                  │                                                  │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘

Panel	Control	Query	Thresholds
FTF CPU per Pod	timeseries + red dashed limit	`sum by (pod) (rate(cpu{pod=~"finetune.*"}[5m]))`	limit=4 cores
CPU Throttling	timeseries + threshold zones	`cfs_throttled / cfs_total * 100`	<30% green, 30-50% yellow, >50% red
FTF Memory per Pod	timeseries + red dashed limit	`mem_working_set{pod=~"finetune.*"} / 1024^3`	limit=8Gi
FTF Disk I/O	timeseries	`rate(fs_writes_bytes_total[5m]) / 1024`	—

Section 5: Autoscaler & Latency (y=29, h=6)¶

Purpose: How does autoscale behave? What's the cold-start impact?

┌─────────────────────────────────────────────────┬──────────┬──────────┬──────────────────────────┐
│ NODE LIFECYCLE TIMELINE (state-timeline)          │ UPTIME   │ WAKE-UP  │ SCALE EVENTS             │
│                                                  │          │ TIME     │                          │
│ permanent ██████████████████████████████████████  │ 2.4h     │ 2m47s    │ 08:22 compute UP         │
│ compute   ░░░░░░░░░░░░████████████████████████   │ [stat]   │ [stat]   │ 08:22 5 pods scheduled   │
│                                                  │ orange   │ green    │ [table, last 20]         │
│ Green=Ready, Grey=absent                         │          │ <5m=grn  │                          │
└─────────────────────────────────────────────────┴──────────┴──────────┴──────────────────────────┘

Section 6: Swap & System Health (y=35, h=6)¶

Purpose: System-level health. Swap = memory pressure = early warning.

┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ SWAP USAGE PER NODE (timeseries)               │ OPEN FILE DESCRIPTORS (timeseries)                  │
│                                                │                                                     │
│ permanent: 0 MB ✓                              │ Per node — high FD = connection leak                │
│ compute: 0 MB ✓                                │ Alert > 80% of limit                               │
│ ANY swap > 0 = memory pressure → alert         │                                                     │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ NETWORK THROUGHPUT (timeseries, stacked)        │ PVC USAGE (gauge per volume)                        │
│                                                │                                                     │
│ rx/tx per pod, KB/s                            │ Green <70%, Yellow 70-85%, Red >85%                │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

Section 7: Process Sizing (y=41, h=6)¶

Purpose: Are pods correctly sized? Under-provisioned = throttled. Over-provisioned = waste.

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ POD SIZING TABLE                                                                                  │
│                                                                                                    │
│ Pod              │ CPU req │ CPU lim │ CPU used │ CPU %  │ MEM req │ MEM lim │ MEM used │ MEM %   │
│ finetune-crypto-1│ 2       │ 4       │ 3.2      │ 80% ██ │ 4Gi     │ 8Gi     │ 1.1Gi    │ 14% ░░  │
│ finetune-crypto-2│ 2       │ 4       │ 2.9      │ 73% ██ │ 4Gi     │ 8Gi     │ 1.0Gi    │ 13% ░░  │
│ airflow-sched    │ 0.5     │ 1       │ 0.4      │ 80% ██ │ 1Gi     │ 2Gi     │ 0.6Gi    │ 30% ░░  │
│ mlflow           │ 0.25    │ 0.5     │ 0.01     │  4% ░░ │ 512Mi   │ 1Gi     │ 726Mi    │ 71% ██  │
│                                                                                                    │
│ Color: CPU% >90% = red (throttled), <20% = blue (over-provisioned)                                │
│        MEM% >80% = red (OOM risk), <20% = blue (over-provisioned)                                 │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

Section 8: Application Health (y=47, h=8)¶

Purpose: Application-level observability. DB, certs, DNS, HTTP errors.

┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ DB CONNECTIONS (timeseries + limit line)        │ HTTP ERROR RATES BY SERVICE (timeseries)            │
│                                                │                                                     │
│ Active connections vs pool max                 │ 4xx (yellow) + 5xx (red) per service               │
│ ── active (15)                                 │ API, Airflow webserver, MLflow                      │
│ - - - pool max (50) [red dashed]               │ Zero = healthy                                     │
│ Alert > 80% pool                               │                                                     │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ RESPONSE LATENCY p50/p95/p99 (timeseries)      │ CERTIFICATE EXPIRY (stat, days)                     │
│                                                │                                                     │
│ By service — green <100ms, red >500ms          │ TLS cert days remaining                             │
│ Identifies slow services                       │ >30d = green, 7-30d = yellow, <7d = red            │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ DNS RESOLUTION TIME (timeseries)                                                                    │
│                                                                                                      │
│ Latency for external APIs: Binance, Langfuse, Scaleway S3                                           │
│ Green <50ms, Yellow 50-200ms, Red >200ms                                                            │
└────────────────────────────────────────────────────────────────────────────────────────────────────┘

Panel	Control	Query	SLA
DB Connections	timeseries + limit	`pg_stat_activity_count` or `sum(kube_pod_container_resource_requests{resource="connections"})`	<80% pool
HTTP Error Rates	timeseries	`sum by (service) (rate(http_requests_total{status=~"[45].."}[5m]))`	0 errors
Response Latency	timeseries (p50/p95/p99)	`histogram_quantile(0.99, rate(http_duration_bucket[5m]))`	p99 <100ms
Certificate Expiry	stat	`cert_manager_certificate_expiration_timestamp_seconds - time()` / 86400	>30 days
DNS Resolution	timeseries	`dns_lookup_duration_seconds` or `probe_dns_lookup_time_seconds`	<50ms

Section 9: Airflow Health (y=55, h=8)¶

Purpose: Is the orchestrator healthy? Queue bottlenecks? Failed tasks?

┌──────────┬──────────┬──────────┬────────────────────────────────────────────────────────────────┐
│ SCHED    │ DAG PARSE│ QUEUE    │ FAILED TASKS (24h) — table                                    │
│ HEARTBEAT│ TIME     │ DEPTH    │                                                                │
│ [stat]   │ [stat]   │ [stat]   │ Time        │ DAG              │ Task       │ Error             │
│ 12s      │ 2.4s     │ 0        │ 14:22       │ finetune__pte    │ run_crypto │ OOM               │
│ <30s grn │ <5s grn  │ 0=grn    │ 13:05       │ model__train     │ train      │ timeout            │
│ >60s red │ >10s red │ >5 red   │             │                  │            │                    │
└──────────┴──────────┴──────────┴────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ TASK STATES OVER TIME (timeseries, stacked)     │ WORKER UTILIZATION (gauge)                          │
│                                                │                                                     │
│ ■ running (green)                              │ Active slots / total slots                          │
│ ■ queued (yellow)                              │ Target: 60-90%                                      │
│ ■ failed (red)                                 │ <30% = over-provisioned                             │
│ ■ success (blue, low opacity)                  │ >95% = saturated                                    │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

Panel	Control	Query	SLA
Scheduler Heartbeat	stat	`airflow_scheduler_heartbeat` age	<30s green, >60s red
DAG Parse Time	stat	`airflow_dag_processing_total_parse_time`	<5s green, >10s red
Queue Depth	stat	`airflow_pool_queued_slots`	0=green, >5=red
Failed Tasks 24h	table	`airflow_ti_failures` or PostgreSQL query on task_instance	0 target
Task States	timeseries, stacked	`airflow_pool_running/queued/open_slots`	—
Worker Utilization	gauge	`running / (running + open) * 100`	60-90% target

Note: Requires enabling Airflow StatsD exporter or Prometheus endpoint (AIRFLOW__METRICS__STATSD_ON=True).

Section 10: Performance Drift Indicators (y=63, h=8)¶

Purpose: Is the ML pipeline degrading over time? Early warning before results go bad.

┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ F1 MACRO TREND (timeseries, 30d)                 │ SORTINO TREND (timeseries, 30d)                  │
│                                                  │                                                  │
│ Rolling average f1_macro per run                 │ Rolling average Sortino per crypto per run        │
│ ── current run                                   │ ── UNIUSDC                                       │
│ - - 7d moving average                            │ ── OPUSDC                                        │
│ ▓▓▓ Alert zone: >10% drop from 7d avg           │ ── AAVEUSDC (always low — candidate exclusion)    │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ ACTION RATE DRIFT (timeseries)                   │ CUSUM FILTER RATE (timeseries)                    │
│                                                  │                                                  │
│ % of candles where model predicts BUY            │ % candles filtered by CUSUM pre-inference         │
│ Increase = model becoming permissive             │ Stable ~95% = normal                             │
│ Decrease = model becoming conservative           │ Drop = CUSUM relaxing → more signals             │
│ Alert on >20% change from baseline               │ Spike = market regime shift                      │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ TRAINING SAMPLE TREND (timeseries)                                                                │
│                                                                                                    │
│ n_train_samples per run — should be stable (~27K for 12m window)                                  │
│ Drop = data pipeline issue, missing OHLCV data, Binance API problem                               │
│ Alert on >15% drop from baseline                                                                   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

Panel	Control	Query (PostgreSQL)	Alert
F1 Macro Trend	timeseries	`SELECT run_id, AVG(f1_macro) FROM finetune_results GROUP BY run_id ORDER BY run_id`	>10% drop from 7d avg
Sortino Trend	timeseries, per crypto	`SELECT run_id, crypto, AVG(sortino) FROM finetune_results WHERE cost_bps=15 GROUP BY run_id, crypto`	—
Action Rate Drift	timeseries	`SELECT run_id, AVG(action_rate) FROM finetune_results GROUP BY run_id`	>20% change
CUSUM Filter Rate	timeseries	`SELECT run_id, AVG(cusum_block_rate) FROM finetune_results GROUP BY run_id ORDER BY run_id`	—
Training Sample Trend	timeseries	`SELECT run_id, AVG(n_train_samples) FROM finetune_results GROUP BY run_id`	>15% drop

Data source: PostgreSQL (Grafana datasource P5C4B7CDEC9D3684F already configured).

Section 11: Alert Log & Diagnostics (y=71, h=8)¶

Purpose: What went wrong? When? What to do about it?

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ALERT HISTORY (table, sorted by time desc)                                                        │
│                                                                                                    │
│ Time              │ Severity │ Alert                    │ Target       │ Value  │ Runbook          │
│ 2026-04-14 08:22  │ INFO     │ Compute node UP          │ compute-194  │ 1      │ —                │
│ 2026-04-14 08:20  │ WARNING  │ Pods pending (autoscale) │ finetune-*   │ 5      │ AUTOSCALER_STUCK │
│ 2026-04-13 22:30  │ CRITICAL │ OOMKilled                │ finetune-abc │ 1      │ OOM              │
│                                                                                                    │
│ Click runbook link → opens documentation/RUNBOOKS/{name}.md                                       │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────┬──────────────────────────────────────────────────────────┐
│ ACTIVE ALERTS (stat)                     │ ALERTS LAST 24H (stat)                                    │
│                                          │                                                           │
│ 0 [green background]                     │ 3 [yellow background]                                     │
└──────────────────────────────────────────┴──────────────────────────────────────────────────────────┘

Alerting Rules (PrometheusRule CRD)¶

Critical (SMS + Slack) — SLA: <5 min response¶

Rule	Condition	For	SLA	Runbook
Node Down	`count(kube_node_info) < 1`	2m	Availability >99.9%	NODE_DOWN
OOM Kill	OOMKilled event	0m	Zero OOM in prod	OOM
Pod CrashLoop	restarts >3 in 15m	5m	Service continuity	CRASHLOOP
Disk Full >90%	PVC usage	5m	Data integrity	DISK_FULL
DB Connection Saturated	active > 80% pool	5m	Query latency <100ms	DB_SATURATED
Scheduler Down	heartbeat > 120s	2m	Orchestration continuity	SCHEDULER_DOWN

Warning (Slack only) — SLA: <30 min response¶

Rule	Condition	For	SLA	Runbook
CPU Saturated >90%	node CPU	10m	CPU headroom >10%	CPU_SATURATED
Memory Pressure (swap >0)	swap used >0	5m	Zero swap	MEMORY_PRESSURE
FTF Pod Stuck Pending	waiting >10m	10m	Autoscale <5 min	AUTOSCALER_STUCK
Compute Node Idle >30m	0 FTF pods	30m	Cost efficiency	COMPUTE_IDLE
CPU Throttling >50%	cfs throttled ratio	10m	<30% throttling	CPU_THROTTLING
Scheduler Heartbeat >60s	heartbeat age	2m	Scheduler health	SCHEDULER_DOWN
F1 Drift >10%	f1_macro drop	1h	Model stability	MODEL_DRIFT
Certificate Expiry <30d	cert age	24h	Zero expired certs	CERT_EXPIRY
HTTP 5xx Spike	5xx rate > 1/min	5m	Zero 5xx	HTTP_ERRORS

Info (annotation only)¶

Rule	Condition	Runbook
Compute Scale-Up	node appears	—
Compute Scale-Down	node disappears	—
FTF Run Start	FTF pods appear	—
FTF Run Complete	FTF pods disappear	—

Escalation Chain¶

INFO     → Grafana annotation (visible on timeline)
WARNING  → Slack #cvntrade-alerts (immediate)
CRITICAL → Slack #cvntrade-alerts + SMS to on-call (immediate)

No response in 15 min → escalate to next on-call
No response in 30 min → PagerDuty incident

Integration: Alertmanager → Slack webhook + Twilio SMS gateway.

Runbooks (documentation/RUNBOOKS/)¶

Each alert links to a runbook:

Runbook	Sections
`NODE_DOWN.md`	Detection, check `kubectl get nodes`, check Scaleway console, manual node restart
`OOM.md`	Detection, identify pod, check memory limit, increase limit or optimize code
`DISK_FULL.md`	Detection, identify PVC, cleanup old data, extend volume
`AUTOSCALER_STUCK.md`	Detection, check Scaleway API, check pod resource requests, manual scale
`CPU_SATURATED.md`	Detection, identify top consumers, reduce parallelism or upgrade node
`SCHEDULER_DOWN.md`	Detection, check scheduler logs, restart scheduler pod
`MODEL_DRIFT.md`	Detection, compare recent F1/Sortino, check data pipeline, retrain
`CERT_EXPIRY.md`	Detection, renew cert, update secret
`DB_SATURATED.md`	Detection, check slow queries, increase pool, optimize queries
`HTTP_ERRORS.md`	Detection, check service logs, restart pod, check upstream

Runbook template:

# RUNBOOK: {ALERT_NAME}

## Detection
What triggered: {description}
Where to look: {Grafana panel, log command}

## Diagnosis (first 3 checks)
1. {check 1}
2. {check 2}
3. {check 3}

## Resolution
{step-by-step fix}

## Escalation
If unresolved in 15 min: contact {name/role}

Implementation Plan¶

Phase	Scope	Effort
1	Sections 1-4 (status bar, timeline, efficiency, FTF)	~3h
2	Sections 5-7 (autoscaler, system health, sizing)	~2h
3	Section 8 (application health — DB, HTTP, certs, DNS)	~3h
4	Section 9 (Airflow health — requires StatsD/Prometheus endpoint)	~2h
5	Section 10 (performance drift — PostgreSQL queries)	~2h
6	Section 11 + PrometheusRules + Alertmanager config	~3h
7	Runbooks (10 docs)	~3h
8	Slack webhook + Twilio SMS integration + end-to-end test	~2h
Total		~20h

Prerequisites¶

Dependency	Status	Action
Prometheus + kube-state-metrics	Deployed	—
PostgreSQL datasource in Grafana	Configured	—
Airflow StatsD/Prometheus metrics	Not enabled	Set `AIRFLOW__METRICS__STATSD_ON=True`
cert-manager	Not deployed	Deploy for cert expiry tracking, or manual check
Alertmanager	Deployed	Configure receivers (Slack, SMS)
Slack webhook	Not configured	Create webhook for #cvntrade-alerts
Twilio SMS	Not configured	Create account + configure in Alertmanager