Skip to content

Infrastructure Dashboard — Production-Grade Design V2

Issue: #529 Date: 2026-04-14 Status: V2 — amended after committee review (REJECTED V1 — ML_USELESS)


Design Principles

  1. Glance (5 sec): Is everything OK right now? Green/red status bar.
  2. Scan (30 sec): Was everything OK in the last 12 hours? Timeseries with anomaly zones.
  3. Investigate (2 min): Are resources efficient? Where are the bottlenecks? Drill-down panels.
  4. Act: Alert log with diagnostic context. Threshold-based escalation (Slack, SMS).

Dashboard Layout — 11 Sections, ~35 Panels


Section 1: Health Status Bar (y=0, h=3)

Purpose: One row of stats. Green = OK, yellow = warning, red = action needed. Read left-to-right in 5 seconds.

┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ CLUSTER  │ PERM     │ COMPUTE  │ FTF PODS │ ERROR    │ LATENCY  │ AVAIL    │ COST     │
│ STATUS   │ NODE     │ NODE     │ ACTIVE   │ RATE     │ p99      │ 12h %    │ EUR/h    │
│ [icon]   │ 1 [stat] │ 1 [stat] │ 5 [stat] │ 0% [st]  │ 45ms[st] │ 99.9%[st]│ 0.16[st]│
│ green bg │ green bg │ orange bg│ green bg │ green bg │ green bg │ green bg │ green bg │
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Panel Control Query Thresholds SLA
Cluster Status stat + icon min(up{job=~".*kube.*"}) 1=green, 0=red Availability >99.9%
Permanent Node stat count(kube_node_info{node=~".*permanent.*"}) 0=red, 1=green Always 1
Compute Node stat count(kube_node_info{node=~".*compute.*"}) 0=blue, 1+=orange 0 when idle
FTF Pods stat count(kube_pod_container_status_running{pod=~"finetune.*"}) 0=blue, 5=green 5 during FTF
Error Rate stat sum(rate(kube_pod_container_status_terminated_reason{reason!="Completed"}[1h])) 0=green, >0=red 0 errors
Latency p99 stat histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) <100ms=green, <500ms=yellow, >500ms=red <100ms
Availability 12h stat avg_over_time(up{job=~".*kube.*"}[12h]) * 100 >99.9=green, >99=yellow, <99=red >99.9%
Cost EUR/h stat count(kube_node_info{node=~".*compute.*"}) * 0.1644 0=green, >0=orange Minimize

Section 2: 12-Hour Timeline (y=3, h=8)

Purpose: Was everything OK? Spot anomalies visually.

┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ CPU UTILIZATION — ALL NODES (timeseries)         │ MEMORY UTILIZATION — ALL NODES (timeseries)     │
│                                                  │                                                  │
│ ████████████████████████████░░░░░░ 75%           │ ██████████░░░░░░░░░░░░░░░░░░░░ 35%              │
│ ─────────────────────── capacity line (red dash) │ ─────────────────────── capacity line (red dash) │
│ Permanent (blue fill) + Compute (orange fill)    │ Same color coding                                │
│ ▓▓▓ Yellow zone >70% ▓▓▓ Red zone >90%          │ ▓▓▓ Yellow zone >70% ▓▓▓ Red zone >90%          │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ REQUEST THROUGHPUT (timeseries, stacked)          │ ERROR TIMELINE (timeseries, red bars)            │
│                                                  │                                                  │
│ Requests/sec by service (API, Airflow, MLflow)   │ OOM kills, restarts, failed pods                 │
│ Stacked area — shows load distribution           │ Each bar = 1 event. Zero baseline = healthy      │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
Panel Control Query Thresholds
CPU All Nodes timeseries, stacked area sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) + capacity overlay >70% yellow, >90% red
Memory All Nodes timeseries, stacked area (MemTotal - MemAvailable) / MemTotal * 100 >70% yellow, >90% red
Request Throughput timeseries, stacked rate(container_network_receive_bytes_total[5m]) by pod
Error Timeline timeseries, bars increase(restarts[5m]) + OOM + terminated >0 = red bar

Section 3: Resource Efficiency (y=11, h=8)

Purpose: Are resources well-used? Identify waste vs saturation.

┌────────────────┬────────────────┬────────────────┬────────────────────────────────────────────────┐
│ CPU            │ MEMORY         │ DISK           │ TOP 5 CPU CONSUMERS (bar chart, horizontal)    │
│ EFFICIENCY     │ EFFICIENCY     │ USAGE          │                                                │
│ [gauge]        │ [gauge]        │ [gauge]        │ pod-1 ████████████████ 3.2 cores                │
│ 82% [green]    │ 45% [yellow]   │ 12% [green]    │ pod-2 ███████████████ 3.0 cores                │
│ target >80%    │ target >60%    │ alert >85%     │ scheduler █████ 0.5 cores                       │
└────────────────┴────────────────┴────────────────┴────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ CPU BALANCE (pie chart)                        │ MEMORY BALANCE (pie chart)                          │
│                                                │                                                     │
│ ● FTF pods (orange) — 78%                      │ ● FTF pods (orange) — 45%                          │
│ ● Airflow (blue) — 12%                         │ ● Airflow (blue) — 25%                             │
│ ● MLflow (green) — 3%                          │ ● MLflow (green) — 15%                             │
│ ● Other (grey) — 7%                            │ ● Other (grey) — 15%                               │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
Panel Control Query Thresholds
CPU Efficiency gauge sum(cpu_usage) / sum(cpu_requests) * 100 <50% red (waste), >80% green
Memory Efficiency gauge sum(mem_working_set) / sum(mem_requests) * 100 <50% red, >60% green
Disk Usage gauge pvc_used / pvc_capacity * 100 >85% red
Top 5 CPU bar chart horizontal topk(5, sum by (pod) (rate(cpu_usage[5m])))
CPU Balance pie chart sum by (category) (cpu_usage) — regex on pod name
Memory Balance pie chart sum by (category) (mem_working_set)

Section 4: FTF Parallel Execution (y=19, h=10)

Purpose: Is parallelism working? Are pods balanced? Any stragglers?

┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ FTF CPU PER POD (timeseries)                     │ CPU THROTTLING (timeseries, threshold zones)     │
│                                                  │                                                  │
│ ── pod-1 (3.2 cores)                             │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 42% ← yellow zone          │
│ ── pod-2 (2.9 cores)                             │ Zones: green <30%, yellow 30-50%, red >50%       │
│ ── pod-3 (3.1 cores)                             │                                                  │
│ - - - limit (4 cores) [red dashed]               │ High throttling = CPU limit too low              │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ FTF MEMORY PER POD (timeseries)                  │ FTF DISK I/O (timeseries)                        │
│                                                  │                                                  │
│ ── pod-1 (1.1 Gi)                                │ Write KB/s — spikes at model checkpoint          │
│ ── pod-2 (1.0 Gi)                                │ Read KB/s — spikes at data load                  │
│ - - - limit (8 Gi) [red dashed]                  │                                                  │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
Panel Control Query Thresholds
FTF CPU per Pod timeseries + red dashed limit sum by (pod) (rate(cpu{pod=~"finetune.*"}[5m])) limit=4 cores
CPU Throttling timeseries + threshold zones cfs_throttled / cfs_total * 100 <30% green, 30-50% yellow, >50% red
FTF Memory per Pod timeseries + red dashed limit mem_working_set{pod=~"finetune.*"} / 1024^3 limit=8Gi
FTF Disk I/O timeseries rate(fs_writes_bytes_total[5m]) / 1024

Section 5: Autoscaler & Latency (y=29, h=6)

Purpose: How does autoscale behave? What's the cold-start impact?

┌─────────────────────────────────────────────────┬──────────┬──────────┬──────────────────────────┐
│ NODE LIFECYCLE TIMELINE (state-timeline)          │ UPTIME   │ WAKE-UP  │ SCALE EVENTS             │
│                                                  │          │ TIME     │                          │
│ permanent ██████████████████████████████████████  │ 2.4h     │ 2m47s    │ 08:22 compute UP         │
│ compute   ░░░░░░░░░░░░████████████████████████   │ [stat]   │ [stat]   │ 08:22 5 pods scheduled   │
│                                                  │ orange   │ green    │ [table, last 20]         │
│ Green=Ready, Grey=absent                         │          │ <5m=grn  │                          │
└─────────────────────────────────────────────────┴──────────┴──────────┴──────────────────────────┘

Section 6: Swap & System Health (y=35, h=6)

Purpose: System-level health. Swap = memory pressure = early warning.

┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ SWAP USAGE PER NODE (timeseries)               │ OPEN FILE DESCRIPTORS (timeseries)                  │
│                                                │                                                     │
│ permanent: 0 MB ✓                              │ Per node — high FD = connection leak                │
│ compute: 0 MB ✓                                │ Alert > 80% of limit                               │
│ ANY swap > 0 = memory pressure → alert         │                                                     │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ NETWORK THROUGHPUT (timeseries, stacked)        │ PVC USAGE (gauge per volume)                        │
│                                                │                                                     │
│ rx/tx per pod, KB/s                            │ Green <70%, Yellow 70-85%, Red >85%                │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

Section 7: Process Sizing (y=41, h=6)

Purpose: Are pods correctly sized? Under-provisioned = throttled. Over-provisioned = waste.

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ POD SIZING TABLE                                                                                  │
│                                                                                                    │
│ Pod              │ CPU req │ CPU lim │ CPU used │ CPU %  │ MEM req │ MEM lim │ MEM used │ MEM %   │
│ finetune-crypto-1│ 2       │ 4       │ 3.2      │ 80% ██ │ 4Gi     │ 8Gi     │ 1.1Gi    │ 14% ░░  │
│ finetune-crypto-2│ 2       │ 4       │ 2.9      │ 73% ██ │ 4Gi     │ 8Gi     │ 1.0Gi    │ 13% ░░  │
│ airflow-sched    │ 0.5     │ 1       │ 0.4      │ 80% ██ │ 1Gi     │ 2Gi     │ 0.6Gi    │ 30% ░░  │
│ mlflow           │ 0.25    │ 0.5     │ 0.01     │  4% ░░ │ 512Mi   │ 1Gi     │ 726Mi    │ 71% ██  │
│                                                                                                    │
│ Color: CPU% >90% = red (throttled), <20% = blue (over-provisioned)                                │
│        MEM% >80% = red (OOM risk), <20% = blue (over-provisioned)                                 │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

Section 8: Application Health (y=47, h=8)

Purpose: Application-level observability. DB, certs, DNS, HTTP errors.

┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ DB CONNECTIONS (timeseries + limit line)        │ HTTP ERROR RATES BY SERVICE (timeseries)            │
│                                                │                                                     │
│ Active connections vs pool max                 │ 4xx (yellow) + 5xx (red) per service               │
│ ── active (15)                                 │ API, Airflow webserver, MLflow                      │
│ - - - pool max (50) [red dashed]               │ Zero = healthy                                     │
│ Alert > 80% pool                               │                                                     │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ RESPONSE LATENCY p50/p95/p99 (timeseries)      │ CERTIFICATE EXPIRY (stat, days)                     │
│                                                │                                                     │
│ By service — green <100ms, red >500ms          │ TLS cert days remaining                             │
│ Identifies slow services                       │ >30d = green, 7-30d = yellow, <7d = red            │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ DNS RESOLUTION TIME (timeseries)                                                                    │
│                                                                                                      │
│ Latency for external APIs: Binance, Langfuse, Scaleway S3                                           │
│ Green <50ms, Yellow 50-200ms, Red >200ms                                                            │
└────────────────────────────────────────────────────────────────────────────────────────────────────┘
Panel Control Query SLA
DB Connections timeseries + limit pg_stat_activity_count or sum(kube_pod_container_resource_requests{resource="connections"}) <80% pool
HTTP Error Rates timeseries sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) 0 errors
Response Latency timeseries (p50/p95/p99) histogram_quantile(0.99, rate(http_duration_bucket[5m])) p99 <100ms
Certificate Expiry stat cert_manager_certificate_expiration_timestamp_seconds - time() / 86400 >30 days
DNS Resolution timeseries dns_lookup_duration_seconds or probe_dns_lookup_time_seconds <50ms

Section 9: Airflow Health (y=55, h=8)

Purpose: Is the orchestrator healthy? Queue bottlenecks? Failed tasks?

┌──────────┬──────────┬──────────┬────────────────────────────────────────────────────────────────┐
│ SCHED    │ DAG PARSE│ QUEUE    │ FAILED TASKS (24h) — table                                    │
│ HEARTBEAT│ TIME     │ DEPTH    │                                                                │
│ [stat]   │ [stat]   │ [stat]   │ Time        │ DAG              │ Task       │ Error             │
│ 12s      │ 2.4s     │ 0        │ 14:22       │ finetune__pte    │ run_crypto │ OOM               │
│ <30s grn │ <5s grn  │ 0=grn    │ 13:05       │ model__train     │ train      │ timeout            │
│ >60s red │ >10s red │ >5 red   │             │                  │            │                    │
└──────────┴──────────┴──────────┴────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ TASK STATES OVER TIME (timeseries, stacked)     │ WORKER UTILIZATION (gauge)                          │
│                                                │                                                     │
│ ■ running (green)                              │ Active slots / total slots                          │
│ ■ queued (yellow)                              │ Target: 60-90%                                      │
│ ■ failed (red)                                 │ <30% = over-provisioned                             │
│ ■ success (blue, low opacity)                  │ >95% = saturated                                    │
└────────────────────────────────────────────────┴────────────────────────────────────────────────────┘
Panel Control Query SLA
Scheduler Heartbeat stat airflow_scheduler_heartbeat age <30s green, >60s red
DAG Parse Time stat airflow_dag_processing_total_parse_time <5s green, >10s red
Queue Depth stat airflow_pool_queued_slots 0=green, >5=red
Failed Tasks 24h table airflow_ti_failures or PostgreSQL query on task_instance 0 target
Task States timeseries, stacked airflow_pool_running/queued/open_slots
Worker Utilization gauge running / (running + open) * 100 60-90% target

Note: Requires enabling Airflow StatsD exporter or Prometheus endpoint (AIRFLOW__METRICS__STATSD_ON=True).


Section 10: Performance Drift Indicators (y=63, h=8)

Purpose: Is the ML pipeline degrading over time? Early warning before results go bad.

┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ F1 MACRO TREND (timeseries, 30d)                 │ SORTINO TREND (timeseries, 30d)                  │
│                                                  │                                                  │
│ Rolling average f1_macro per run                 │ Rolling average Sortino per crypto per run        │
│ ── current run                                   │ ── UNIUSDC                                       │
│ - - 7d moving average                            │ ── OPUSDC                                        │
│ ▓▓▓ Alert zone: >10% drop from 7d avg           │ ── AAVEUSDC (always low — candidate exclusion)    │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ ACTION RATE DRIFT (timeseries)                   │ CUSUM FILTER RATE (timeseries)                    │
│                                                  │                                                  │
│ % of candles where model predicts BUY            │ % candles filtered by CUSUM pre-inference         │
│ Increase = model becoming permissive             │ Stable ~95% = normal                             │
│ Decrease = model becoming conservative           │ Drop = CUSUM relaxing → more signals             │
│ Alert on >20% change from baseline               │ Spike = market regime shift                      │
└─────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ TRAINING SAMPLE TREND (timeseries)                                                                │
│                                                                                                    │
│ n_train_samples per run — should be stable (~27K for 12m window)                                  │
│ Drop = data pipeline issue, missing OHLCV data, Binance API problem                               │
│ Alert on >15% drop from baseline                                                                   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
Panel Control Query (PostgreSQL) Alert
F1 Macro Trend timeseries SELECT run_id, AVG(f1_macro) FROM finetune_results GROUP BY run_id ORDER BY run_id >10% drop from 7d avg
Sortino Trend timeseries, per crypto SELECT run_id, crypto, AVG(sortino) FROM finetune_results WHERE cost_bps=15 GROUP BY run_id, crypto
Action Rate Drift timeseries SELECT run_id, AVG(action_rate) FROM finetune_results GROUP BY run_id >20% change
CUSUM Filter Rate timeseries SELECT run_id, AVG(cusum_block_rate) FROM finetune_results GROUP BY run_id ORDER BY run_id
Training Sample Trend timeseries SELECT run_id, AVG(n_train_samples) FROM finetune_results GROUP BY run_id >15% drop

Data source: PostgreSQL (Grafana datasource P5C4B7CDEC9D3684F already configured).


Section 11: Alert Log & Diagnostics (y=71, h=8)

Purpose: What went wrong? When? What to do about it?

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ALERT HISTORY (table, sorted by time desc)                                                        │
│                                                                                                    │
│ Time              │ Severity │ Alert                    │ Target       │ Value  │ Runbook          │
│ 2026-04-14 08:22  │ INFO     │ Compute node UP          │ compute-194  │ 1      │ —                │
│ 2026-04-14 08:20  │ WARNING  │ Pods pending (autoscale) │ finetune-*   │ 5      │ AUTOSCALER_STUCK │
│ 2026-04-13 22:30  │ CRITICAL │ OOMKilled                │ finetune-abc │ 1      │ OOM              │
│                                                                                                    │
│ Click runbook link → opens documentation/RUNBOOKS/{name}.md                                       │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────┬──────────────────────────────────────────────────────────┐
│ ACTIVE ALERTS (stat)                     │ ALERTS LAST 24H (stat)                                    │
│                                          │                                                           │
│ 0 [green background]                     │ 3 [yellow background]                                     │
└──────────────────────────────────────────┴──────────────────────────────────────────────────────────┘

Alerting Rules (PrometheusRule CRD)

Critical (SMS + Slack) — SLA: <5 min response

Rule Condition For SLA Runbook
Node Down count(kube_node_info) < 1 2m Availability >99.9% NODE_DOWN
OOM Kill OOMKilled event 0m Zero OOM in prod OOM
Pod CrashLoop restarts >3 in 15m 5m Service continuity CRASHLOOP
Disk Full >90% PVC usage 5m Data integrity DISK_FULL
DB Connection Saturated active > 80% pool 5m Query latency <100ms DB_SATURATED
Scheduler Down heartbeat > 120s 2m Orchestration continuity SCHEDULER_DOWN

Warning (Slack only) — SLA: <30 min response

Rule Condition For SLA Runbook
CPU Saturated >90% node CPU 10m CPU headroom >10% CPU_SATURATED
Memory Pressure (swap >0) swap used >0 5m Zero swap MEMORY_PRESSURE
FTF Pod Stuck Pending waiting >10m 10m Autoscale <5 min AUTOSCALER_STUCK
Compute Node Idle >30m 0 FTF pods 30m Cost efficiency COMPUTE_IDLE
CPU Throttling >50% cfs throttled ratio 10m <30% throttling CPU_THROTTLING
Scheduler Heartbeat >60s heartbeat age 2m Scheduler health SCHEDULER_DOWN
F1 Drift >10% f1_macro drop 1h Model stability MODEL_DRIFT
Certificate Expiry <30d cert age 24h Zero expired certs CERT_EXPIRY
HTTP 5xx Spike 5xx rate > 1/min 5m Zero 5xx HTTP_ERRORS

Info (annotation only)

Rule Condition Runbook
Compute Scale-Up node appears
Compute Scale-Down node disappears
FTF Run Start FTF pods appear
FTF Run Complete FTF pods disappear

Escalation Chain

INFO     → Grafana annotation (visible on timeline)
WARNING  → Slack #cvntrade-alerts (immediate)
CRITICAL → Slack #cvntrade-alerts + SMS to on-call (immediate)

No response in 15 min → escalate to next on-call
No response in 30 min → PagerDuty incident

Integration: Alertmanager → Slack webhook + Twilio SMS gateway.


Runbooks (documentation/RUNBOOKS/)

Each alert links to a runbook:

Runbook Sections
NODE_DOWN.md Detection, check kubectl get nodes, check Scaleway console, manual node restart
OOM.md Detection, identify pod, check memory limit, increase limit or optimize code
DISK_FULL.md Detection, identify PVC, cleanup old data, extend volume
AUTOSCALER_STUCK.md Detection, check Scaleway API, check pod resource requests, manual scale
CPU_SATURATED.md Detection, identify top consumers, reduce parallelism or upgrade node
SCHEDULER_DOWN.md Detection, check scheduler logs, restart scheduler pod
MODEL_DRIFT.md Detection, compare recent F1/Sortino, check data pipeline, retrain
CERT_EXPIRY.md Detection, renew cert, update secret
DB_SATURATED.md Detection, check slow queries, increase pool, optimize queries
HTTP_ERRORS.md Detection, check service logs, restart pod, check upstream

Runbook template:

# RUNBOOK: {ALERT_NAME}

## Detection
What triggered: {description}
Where to look: {Grafana panel, log command}

## Diagnosis (first 3 checks)
1. {check 1}
2. {check 2}
3. {check 3}

## Resolution
{step-by-step fix}

## Escalation
If unresolved in 15 min: contact {name/role}


Implementation Plan

Phase Scope Effort
1 Sections 1-4 (status bar, timeline, efficiency, FTF) ~3h
2 Sections 5-7 (autoscaler, system health, sizing) ~2h
3 Section 8 (application health — DB, HTTP, certs, DNS) ~3h
4 Section 9 (Airflow health — requires StatsD/Prometheus endpoint) ~2h
5 Section 10 (performance drift — PostgreSQL queries) ~2h
6 Section 11 + PrometheusRules + Alertmanager config ~3h
7 Runbooks (10 docs) ~3h
8 Slack webhook + Twilio SMS integration + end-to-end test ~2h
Total ~20h

Prerequisites

Dependency Status Action
Prometheus + kube-state-metrics Deployed
PostgreSQL datasource in Grafana Configured
Airflow StatsD/Prometheus metrics Not enabled Set AIRFLOW__METRICS__STATSD_ON=True
cert-manager Not deployed Deploy for cert expiry tracking, or manual check
Alertmanager Deployed Configure receivers (Slack, SMS)
Slack webhook Not configured Create webhook for #cvntrade-alerts
Twilio SMS Not configured Create account + configure in Alertmanager

Success Criteria

  • 5-second glance: status bar tells if platform is healthy
  • 30-second scan: 12h timeline shows anomalies with threshold zones
  • Bottleneck ID: top 5 CPU + balance pies show distribution
  • FTF parallelism: per-pod CPU/throttling shows balance and stragglers
  • Autoscaler: node lifecycle + wake-up time visible
  • Swap: any usage = visible alert
  • Process sizing: over/under-provisioned pods color-coded
  • Application health: DB connections, HTTP errors, cert expiry, DNS latency
  • Airflow health: scheduler heartbeat, queue depth, failed tasks
  • Performance drift: F1/Sortino/action rate trends with anomaly detection
  • Alert log: all events with severity + runbook link
  • Every alert linked to SLA target
  • Escalation: critical→SMS+Slack, warning→Slack, info→annotation
  • 10 runbooks with diagnosis + resolution steps
  • Dashboard loads <3s with 12h range