Infrastructure Dashboard — Production-Grade Design V2¶
Issue: #529 Date: 2026-04-14 Status: V2 — amended after committee review (REJECTED V1 — ML_USELESS)
Design Principles¶
- Glance (5 sec): Is everything OK right now? Green/red status bar.
- Scan (30 sec): Was everything OK in the last 12 hours? Timeseries with anomaly zones.
- Investigate (2 min): Are resources efficient? Where are the bottlenecks? Drill-down panels.
- Act: Alert log with diagnostic context. Threshold-based escalation (Slack, SMS).
Dashboard Layout — 11 Sections, ~35 Panels¶
Section 1: Health Status Bar (y=0, h=3)¶
Purpose: One row of stats. Green = OK, yellow = warning, red = action needed. Read left-to-right in 5 seconds.
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ CLUSTER │ PERM │ COMPUTE │ FTF PODS │ ERROR │ LATENCY │ AVAIL │ COST │
│ STATUS │ NODE │ NODE │ ACTIVE │ RATE │ p99 │ 12h % │ EUR/h │
│ [icon] │ 1 [stat] │ 1 [stat] │ 5 [stat] │ 0% [st] │ 45ms[st] │ 99.9%[st]│ 0.16[st]│
│ green bg │ green bg │ orange bg│ green bg │ green bg │ green bg │ green bg │ green bg │
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
| Panel | Control | Query | Thresholds | SLA |
|---|---|---|---|---|
| Cluster Status | stat + icon | min(up{job=~".*kube.*"}) |
1=green, 0=red | Availability >99.9% |
| Permanent Node | stat | count(kube_node_info{node=~".*permanent.*"}) |
0=red, 1=green | Always 1 |
| Compute Node | stat | count(kube_node_info{node=~".*compute.*"}) |
0=blue, 1+=orange | 0 when idle |
| FTF Pods | stat | count(kube_pod_container_status_running{pod=~"finetune.*"}) |
0=blue, 5=green | 5 during FTF |
| Error Rate | stat | sum(rate(kube_pod_container_status_terminated_reason{reason!="Completed"}[1h])) |
0=green, >0=red | 0 errors |
| Latency p99 | stat | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
<100ms=green, <500ms=yellow, >500ms=red | <100ms |
| Availability 12h | stat | avg_over_time(up{job=~".*kube.*"}[12h]) * 100 |
>99.9=green, >99=yellow, <99=red | >99.9% |
| Cost EUR/h | stat | count(kube_node_info{node=~".*compute.*"}) * 0.1644 |
0=green, >0=orange | Minimize |
Section 2: 12-Hour Timeline (y=3, h=8)¶
Purpose: Was everything OK? Spot anomalies visually.
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ CPU UTILIZATION — ALL NODES (timeseries) │ MEMORY UTILIZATION — ALL NODES (timeseries) │
│ │ │
│ ████████████████████████████░░░░░░ 75% │ ██████████░░░░░░░░░░░░░░░░░░░░ 35% │
│ ─────────────────────── capacity line (red dash) │ ─────────────────────── capacity line (red dash) │
│ Permanent (blue fill) + Compute (orange fill) │ Same color coding │
│ ▓▓▓ Yellow zone >70% ▓▓▓ Red zone >90% │ ▓▓▓ Yellow zone >70% ▓▓▓ Red zone >90% │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ REQUEST THROUGHPUT (timeseries, stacked) │ ERROR TIMELINE (timeseries, red bars) │
│ │ │
│ Requests/sec by service (API, Airflow, MLflow) │ OOM kills, restarts, failed pods │
│ Stacked area — shows load distribution │ Each bar = 1 event. Zero baseline = healthy │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
| Panel | Control | Query | Thresholds |
|---|---|---|---|
| CPU All Nodes | timeseries, stacked area | sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) + capacity overlay |
>70% yellow, >90% red |
| Memory All Nodes | timeseries, stacked area | (MemTotal - MemAvailable) / MemTotal * 100 |
>70% yellow, >90% red |
| Request Throughput | timeseries, stacked | rate(container_network_receive_bytes_total[5m]) by pod |
— |
| Error Timeline | timeseries, bars | increase(restarts[5m]) + OOM + terminated |
>0 = red bar |
Section 3: Resource Efficiency (y=11, h=8)¶
Purpose: Are resources well-used? Identify waste vs saturation.
┌────────────────┬────────────────┬────────────────┬────────────────────────────────────────────────┐
│ CPU │ MEMORY │ DISK │ TOP 5 CPU CONSUMERS (bar chart, horizontal) │
│ EFFICIENCY │ EFFICIENCY │ USAGE │ │
│ [gauge] │ [gauge] │ [gauge] │ pod-1 ████████████████ 3.2 cores │
│ 82% [green] │ 45% [yellow] │ 12% [green] │ pod-2 ███████████████ 3.0 cores │
│ target >80% │ target >60% │ alert >85% │ scheduler █████ 0.5 cores │
└────────────────┴────────────────┴────────────────┴────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ CPU BALANCE (pie chart) │ MEMORY BALANCE (pie chart) │
│ │ │
│ ● FTF pods (orange) — 78% │ ● FTF pods (orange) — 45% │
│ ● Airflow (blue) — 12% │ ● Airflow (blue) — 25% │
│ ● MLflow (green) — 3% │ ● MLflow (green) — 15% │
│ ● Other (grey) — 7% │ ● Other (grey) — 15% │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
| Panel | Control | Query | Thresholds |
|---|---|---|---|
| CPU Efficiency | gauge | sum(cpu_usage) / sum(cpu_requests) * 100 |
<50% red (waste), >80% green |
| Memory Efficiency | gauge | sum(mem_working_set) / sum(mem_requests) * 100 |
<50% red, >60% green |
| Disk Usage | gauge | pvc_used / pvc_capacity * 100 |
>85% red |
| Top 5 CPU | bar chart horizontal | topk(5, sum by (pod) (rate(cpu_usage[5m]))) |
— |
| CPU Balance | pie chart | sum by (category) (cpu_usage) — regex on pod name |
— |
| Memory Balance | pie chart | sum by (category) (mem_working_set) |
— |
Section 4: FTF Parallel Execution (y=19, h=10)¶
Purpose: Is parallelism working? Are pods balanced? Any stragglers?
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ FTF CPU PER POD (timeseries) │ CPU THROTTLING (timeseries, threshold zones) │
│ │ │
│ ── pod-1 (3.2 cores) │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 42% ← yellow zone │
│ ── pod-2 (2.9 cores) │ Zones: green <30%, yellow 30-50%, red >50% │
│ ── pod-3 (3.1 cores) │ │
│ - - - limit (4 cores) [red dashed] │ High throttling = CPU limit too low │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ FTF MEMORY PER POD (timeseries) │ FTF DISK I/O (timeseries) │
│ │ │
│ ── pod-1 (1.1 Gi) │ Write KB/s — spikes at model checkpoint │
│ ── pod-2 (1.0 Gi) │ Read KB/s — spikes at data load │
│ - - - limit (8 Gi) [red dashed] │ │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
| Panel | Control | Query | Thresholds |
|---|---|---|---|
| FTF CPU per Pod | timeseries + red dashed limit | sum by (pod) (rate(cpu{pod=~"finetune.*"}[5m])) |
limit=4 cores |
| CPU Throttling | timeseries + threshold zones | cfs_throttled / cfs_total * 100 |
<30% green, 30-50% yellow, >50% red |
| FTF Memory per Pod | timeseries + red dashed limit | mem_working_set{pod=~"finetune.*"} / 1024^3 |
limit=8Gi |
| FTF Disk I/O | timeseries | rate(fs_writes_bytes_total[5m]) / 1024 |
— |
Section 5: Autoscaler & Latency (y=29, h=6)¶
Purpose: How does autoscale behave? What's the cold-start impact?
┌─────────────────────────────────────────────────┬──────────┬──────────┬──────────────────────────┐
│ NODE LIFECYCLE TIMELINE (state-timeline) │ UPTIME │ WAKE-UP │ SCALE EVENTS │
│ │ │ TIME │ │
│ permanent ██████████████████████████████████████ │ 2.4h │ 2m47s │ 08:22 compute UP │
│ compute ░░░░░░░░░░░░████████████████████████ │ [stat] │ [stat] │ 08:22 5 pods scheduled │
│ │ orange │ green │ [table, last 20] │
│ Green=Ready, Grey=absent │ │ <5m=grn │ │
└─────────────────────────────────────────────────┴──────────┴──────────┴──────────────────────────┘
Section 6: Swap & System Health (y=35, h=6)¶
Purpose: System-level health. Swap = memory pressure = early warning.
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ SWAP USAGE PER NODE (timeseries) │ OPEN FILE DESCRIPTORS (timeseries) │
│ │ │
│ permanent: 0 MB ✓ │ Per node — high FD = connection leak │
│ compute: 0 MB ✓ │ Alert > 80% of limit │
│ ANY swap > 0 = memory pressure → alert │ │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ NETWORK THROUGHPUT (timeseries, stacked) │ PVC USAGE (gauge per volume) │
│ │ │
│ rx/tx per pod, KB/s │ Green <70%, Yellow 70-85%, Red >85% │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
Section 7: Process Sizing (y=41, h=6)¶
Purpose: Are pods correctly sized? Under-provisioned = throttled. Over-provisioned = waste.
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ POD SIZING TABLE │
│ │
│ Pod │ CPU req │ CPU lim │ CPU used │ CPU % │ MEM req │ MEM lim │ MEM used │ MEM % │
│ finetune-crypto-1│ 2 │ 4 │ 3.2 │ 80% ██ │ 4Gi │ 8Gi │ 1.1Gi │ 14% ░░ │
│ finetune-crypto-2│ 2 │ 4 │ 2.9 │ 73% ██ │ 4Gi │ 8Gi │ 1.0Gi │ 13% ░░ │
│ airflow-sched │ 0.5 │ 1 │ 0.4 │ 80% ██ │ 1Gi │ 2Gi │ 0.6Gi │ 30% ░░ │
│ mlflow │ 0.25 │ 0.5 │ 0.01 │ 4% ░░ │ 512Mi │ 1Gi │ 726Mi │ 71% ██ │
│ │
│ Color: CPU% >90% = red (throttled), <20% = blue (over-provisioned) │
│ MEM% >80% = red (OOM risk), <20% = blue (over-provisioned) │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
Section 8: Application Health (y=47, h=8)¶
Purpose: Application-level observability. DB, certs, DNS, HTTP errors.
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ DB CONNECTIONS (timeseries + limit line) │ HTTP ERROR RATES BY SERVICE (timeseries) │
│ │ │
│ Active connections vs pool max │ 4xx (yellow) + 5xx (red) per service │
│ ── active (15) │ API, Airflow webserver, MLflow │
│ - - - pool max (50) [red dashed] │ Zero = healthy │
│ Alert > 80% pool │ │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ RESPONSE LATENCY p50/p95/p99 (timeseries) │ CERTIFICATE EXPIRY (stat, days) │
│ │ │
│ By service — green <100ms, red >500ms │ TLS cert days remaining │
│ Identifies slow services │ >30d = green, 7-30d = yellow, <7d = red │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ DNS RESOLUTION TIME (timeseries) │
│ │
│ Latency for external APIs: Binance, Langfuse, Scaleway S3 │
│ Green <50ms, Yellow 50-200ms, Red >200ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────┘
| Panel | Control | Query | SLA |
|---|---|---|---|
| DB Connections | timeseries + limit | pg_stat_activity_count or sum(kube_pod_container_resource_requests{resource="connections"}) |
<80% pool |
| HTTP Error Rates | timeseries | sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) |
0 errors |
| Response Latency | timeseries (p50/p95/p99) | histogram_quantile(0.99, rate(http_duration_bucket[5m])) |
p99 <100ms |
| Certificate Expiry | stat | cert_manager_certificate_expiration_timestamp_seconds - time() / 86400 |
>30 days |
| DNS Resolution | timeseries | dns_lookup_duration_seconds or probe_dns_lookup_time_seconds |
<50ms |
Section 9: Airflow Health (y=55, h=8)¶
Purpose: Is the orchestrator healthy? Queue bottlenecks? Failed tasks?
┌──────────┬──────────┬──────────┬────────────────────────────────────────────────────────────────┐
│ SCHED │ DAG PARSE│ QUEUE │ FAILED TASKS (24h) — table │
│ HEARTBEAT│ TIME │ DEPTH │ │
│ [stat] │ [stat] │ [stat] │ Time │ DAG │ Task │ Error │
│ 12s │ 2.4s │ 0 │ 14:22 │ finetune__pte │ run_crypto │ OOM │
│ <30s grn │ <5s grn │ 0=grn │ 13:05 │ model__train │ train │ timeout │
│ >60s red │ >10s red │ >5 red │ │ │ │ │
└──────────┴──────────┴──────────┴────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ TASK STATES OVER TIME (timeseries, stacked) │ WORKER UTILIZATION (gauge) │
│ │ │
│ ■ running (green) │ Active slots / total slots │
│ ■ queued (yellow) │ Target: 60-90% │
│ ■ failed (red) │ <30% = over-provisioned │
│ ■ success (blue, low opacity) │ >95% = saturated │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
| Panel | Control | Query | SLA |
|---|---|---|---|
| Scheduler Heartbeat | stat | airflow_scheduler_heartbeat age |
<30s green, >60s red |
| DAG Parse Time | stat | airflow_dag_processing_total_parse_time |
<5s green, >10s red |
| Queue Depth | stat | airflow_pool_queued_slots |
0=green, >5=red |
| Failed Tasks 24h | table | airflow_ti_failures or PostgreSQL query on task_instance |
0 target |
| Task States | timeseries, stacked | airflow_pool_running/queued/open_slots |
— |
| Worker Utilization | gauge | running / (running + open) * 100 |
60-90% target |
Note: Requires enabling Airflow StatsD exporter or Prometheus endpoint (AIRFLOW__METRICS__STATSD_ON=True).
Section 10: Performance Drift Indicators (y=63, h=8)¶
Purpose: Is the ML pipeline degrading over time? Early warning before results go bad.
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ F1 MACRO TREND (timeseries, 30d) │ SORTINO TREND (timeseries, 30d) │
│ │ │
│ Rolling average f1_macro per run │ Rolling average Sortino per crypto per run │
│ ── current run │ ── UNIUSDC │
│ - - 7d moving average │ ── OPUSDC │
│ ▓▓▓ Alert zone: >10% drop from 7d avg │ ── AAVEUSDC (always low — candidate exclusion) │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ ACTION RATE DRIFT (timeseries) │ CUSUM FILTER RATE (timeseries) │
│ │ │
│ % of candles where model predicts BUY │ % candles filtered by CUSUM pre-inference │
│ Increase = model becoming permissive │ Stable ~95% = normal │
│ Decrease = model becoming conservative │ Drop = CUSUM relaxing → more signals │
│ Alert on >20% change from baseline │ Spike = market regime shift │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ TRAINING SAMPLE TREND (timeseries) │
│ │
│ n_train_samples per run — should be stable (~27K for 12m window) │
│ Drop = data pipeline issue, missing OHLCV data, Binance API problem │
│ Alert on >15% drop from baseline │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
| Panel | Control | Query (PostgreSQL) | Alert |
|---|---|---|---|
| F1 Macro Trend | timeseries | SELECT run_id, AVG(f1_macro) FROM finetune_results GROUP BY run_id ORDER BY run_id |
>10% drop from 7d avg |
| Sortino Trend | timeseries, per crypto | SELECT run_id, crypto, AVG(sortino) FROM finetune_results WHERE cost_bps=15 GROUP BY run_id, crypto |
— |
| Action Rate Drift | timeseries | SELECT run_id, AVG(action_rate) FROM finetune_results GROUP BY run_id |
>20% change |
| CUSUM Filter Rate | timeseries | SELECT run_id, AVG(cusum_block_rate) FROM finetune_results GROUP BY run_id ORDER BY run_id |
— |
| Training Sample Trend | timeseries | SELECT run_id, AVG(n_train_samples) FROM finetune_results GROUP BY run_id |
>15% drop |
Data source: PostgreSQL (Grafana datasource P5C4B7CDEC9D3684F already configured).
Section 11: Alert Log & Diagnostics (y=71, h=8)¶
Purpose: What went wrong? When? What to do about it?
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ALERT HISTORY (table, sorted by time desc) │
│ │
│ Time │ Severity │ Alert │ Target │ Value │ Runbook │
│ 2026-04-14 08:22 │ INFO │ Compute node UP │ compute-194 │ 1 │ — │
│ 2026-04-14 08:20 │ WARNING │ Pods pending (autoscale) │ finetune-* │ 5 │ AUTOSCALER_STUCK │
│ 2026-04-13 22:30 │ CRITICAL │ OOMKilled │ finetune-abc │ 1 │ OOM │
│ │
│ Click runbook link → opens documentation/RUNBOOKS/{name}.md │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────┬──────────────────────────────────────────────────────────┐
│ ACTIVE ALERTS (stat) │ ALERTS LAST 24H (stat) │
│ │ │
│ 0 [green background] │ 3 [yellow background] │
└──────────────────────────────────────────┴──────────────────────────────────────────────────────────┘
Alerting Rules (PrometheusRule CRD)¶
Critical (SMS + Slack) — SLA: <5 min response¶
| Rule | Condition | For | SLA | Runbook |
|---|---|---|---|---|
| Node Down | count(kube_node_info) < 1 |
2m | Availability >99.9% | NODE_DOWN |
| OOM Kill | OOMKilled event | 0m | Zero OOM in prod | OOM |
| Pod CrashLoop | restarts >3 in 15m | 5m | Service continuity | CRASHLOOP |
| Disk Full >90% | PVC usage | 5m | Data integrity | DISK_FULL |
| DB Connection Saturated | active > 80% pool | 5m | Query latency <100ms | DB_SATURATED |
| Scheduler Down | heartbeat > 120s | 2m | Orchestration continuity | SCHEDULER_DOWN |
Warning (Slack only) — SLA: <30 min response¶
| Rule | Condition | For | SLA | Runbook |
|---|---|---|---|---|
| CPU Saturated >90% | node CPU | 10m | CPU headroom >10% | CPU_SATURATED |
| Memory Pressure (swap >0) | swap used >0 | 5m | Zero swap | MEMORY_PRESSURE |
| FTF Pod Stuck Pending | waiting >10m | 10m | Autoscale <5 min | AUTOSCALER_STUCK |
| Compute Node Idle >30m | 0 FTF pods | 30m | Cost efficiency | COMPUTE_IDLE |
| CPU Throttling >50% | cfs throttled ratio | 10m | <30% throttling | CPU_THROTTLING |
| Scheduler Heartbeat >60s | heartbeat age | 2m | Scheduler health | SCHEDULER_DOWN |
| F1 Drift >10% | f1_macro drop | 1h | Model stability | MODEL_DRIFT |
| Certificate Expiry <30d | cert age | 24h | Zero expired certs | CERT_EXPIRY |
| HTTP 5xx Spike | 5xx rate > 1/min | 5m | Zero 5xx | HTTP_ERRORS |
Info (annotation only)¶
| Rule | Condition | Runbook |
|---|---|---|
| Compute Scale-Up | node appears | — |
| Compute Scale-Down | node disappears | — |
| FTF Run Start | FTF pods appear | — |
| FTF Run Complete | FTF pods disappear | — |
Escalation Chain¶
INFO → Grafana annotation (visible on timeline)
WARNING → Slack #cvntrade-alerts (immediate)
CRITICAL → Slack #cvntrade-alerts + SMS to on-call (immediate)
No response in 15 min → escalate to next on-call
No response in 30 min → PagerDuty incident
Integration: Alertmanager → Slack webhook + Twilio SMS gateway.
Runbooks (documentation/RUNBOOKS/)¶
Each alert links to a runbook:
| Runbook | Sections |
|---|---|
NODE_DOWN.md |
Detection, check kubectl get nodes, check Scaleway console, manual node restart |
OOM.md |
Detection, identify pod, check memory limit, increase limit or optimize code |
DISK_FULL.md |
Detection, identify PVC, cleanup old data, extend volume |
AUTOSCALER_STUCK.md |
Detection, check Scaleway API, check pod resource requests, manual scale |
CPU_SATURATED.md |
Detection, identify top consumers, reduce parallelism or upgrade node |
SCHEDULER_DOWN.md |
Detection, check scheduler logs, restart scheduler pod |
MODEL_DRIFT.md |
Detection, compare recent F1/Sortino, check data pipeline, retrain |
CERT_EXPIRY.md |
Detection, renew cert, update secret |
DB_SATURATED.md |
Detection, check slow queries, increase pool, optimize queries |
HTTP_ERRORS.md |
Detection, check service logs, restart pod, check upstream |
Runbook template:
# RUNBOOK: {ALERT_NAME}
## Detection
What triggered: {description}
Where to look: {Grafana panel, log command}
## Diagnosis (first 3 checks)
1. {check 1}
2. {check 2}
3. {check 3}
## Resolution
{step-by-step fix}
## Escalation
If unresolved in 15 min: contact {name/role}
Implementation Plan¶
| Phase | Scope | Effort |
|---|---|---|
| 1 | Sections 1-4 (status bar, timeline, efficiency, FTF) | ~3h |
| 2 | Sections 5-7 (autoscaler, system health, sizing) | ~2h |
| 3 | Section 8 (application health — DB, HTTP, certs, DNS) | ~3h |
| 4 | Section 9 (Airflow health — requires StatsD/Prometheus endpoint) | ~2h |
| 5 | Section 10 (performance drift — PostgreSQL queries) | ~2h |
| 6 | Section 11 + PrometheusRules + Alertmanager config | ~3h |
| 7 | Runbooks (10 docs) | ~3h |
| 8 | Slack webhook + Twilio SMS integration + end-to-end test | ~2h |
| Total | ~20h |
Prerequisites¶
| Dependency | Status | Action |
|---|---|---|
| Prometheus + kube-state-metrics | Deployed | — |
| PostgreSQL datasource in Grafana | Configured | — |
| Airflow StatsD/Prometheus metrics | Not enabled | Set AIRFLOW__METRICS__STATSD_ON=True |
| cert-manager | Not deployed | Deploy for cert expiry tracking, or manual check |
| Alertmanager | Deployed | Configure receivers (Slack, SMS) |
| Slack webhook | Not configured | Create webhook for #cvntrade-alerts |
| Twilio SMS | Not configured | Create account + configure in Alertmanager |
Success Criteria¶
- 5-second glance: status bar tells if platform is healthy
- 30-second scan: 12h timeline shows anomalies with threshold zones
- Bottleneck ID: top 5 CPU + balance pies show distribution
- FTF parallelism: per-pod CPU/throttling shows balance and stragglers
- Autoscaler: node lifecycle + wake-up time visible
- Swap: any usage = visible alert
- Process sizing: over/under-provisioned pods color-coded
- Application health: DB connections, HTTP errors, cert expiry, DNS latency
- Airflow health: scheduler heartbeat, queue depth, failed tasks
- Performance drift: F1/Sortino/action rate trends with anomaly detection
- Alert log: all events with severity + runbook link
- Every alert linked to SLA target
- Escalation: critical→SMS+Slack, warning→Slack, info→annotation
- 10 runbooks with diagnosis + resolution steps
- Dashboard loads <3s with 12h range