Skip to content

Runbook — OTel Collector down

Alert: structured events no longer land in Loki / Grafana dashboards silent.

Impact: shadow parity dashboard empty, signal funnel dashboard empty, any log_event() emitted from Airflow task pods invisible to Loki. Python logger side (Airflow S3 task log) still works — no data loss, but no real-time observability.

Affected dashboards: - Pipeline Runner — Shadow Parity - any future dashboard consuming {exporter="OTLP"} streams

Immediate human actions (0–5 min)

  1. Check pods:
kubectl -n cvntrade-observability get pods -l app.kubernetes.io/name=opentelemetry-collector

Expected: 2 replicas 1/1 Running. If CrashLoopBackOff, look at config error:

kubectl -n cvntrade-observability logs -l app.kubernetes.io/name=opentelemetry-collector --tail=40 | grep -iE "error|invalid"
  1. Check the collector's own telemetry metrics:
kubectl -n cvntrade-observability port-forward svc/otel-collector-opentelemetry-collector 8888:8888
curl -sS http://localhost:8888/metrics | grep -E "otelcol_exporter_(sent|send_failed)_log_records"
  • sent_log_records{exporter="loki"} growing → OK
  • send_failed_log_records{exporter="loki"} growing → Loki rejecting; see Loki runbook

  • If running but not receiving, check the OTLP endpoint from an emitter pod:

SCHED=$(kubectl -n cvntrade get pods -l component=scheduler -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec $SCHED -c scheduler -- python3 -c "
import socket
host = 'otel-collector-opentelemetry-collector.cvntrade-observability.svc.cluster.local'
s = socket.create_connection((host, 4317), timeout=3); s.close(); print('reachable')
"

Escalation path

  • 0–5 min: on-call operator executes immediate actions above
  • 5–15 min: if not resolved, page the platform team lead
  • 15+ min: committee review — consider the kill-switch (below) to restore emitter-side stability at the cost of Loki visibility

Kill-switch

Disable OTLP emission. CVN_OTEL_ENABLED is read once at module import (see src/commun/observability/otel.py:_ENABLED) so flipping the env var requires a rollout restart for running pods to pick it up. Short-lived task pods spawned after the env change inherit the new value automatically.

# 1. Set the env var on each service's Deployment (patches the spec):
kubectl -n cvntrade set env deploy/airflow-scheduler CVN_OTEL_ENABLED=0
kubectl -n cvntrade set env deploy/airflow-webserver CVN_OTEL_ENABLED=0
kubectl -n cvntrade set env deploy/cvntrade-runtime CVN_OTEL_ENABLED=0
kubectl -n cvntrade set env deploy/cvntrade-api CVN_OTEL_ENABLED=0

# 2. Rollout restart so running pods reload with the new env:
kubectl -n cvntrade rollout restart deploy/airflow-scheduler
kubectl -n cvntrade rollout restart deploy/airflow-webserver
kubectl -n cvntrade rollout restart deploy/cvntrade-runtime
kubectl -n cvntrade rollout restart deploy/cvntrade-api

Effect after restart: emit_event() is a no-op; Python logger path (→ S3) unchanged. Dashboards stay empty until the collector is restored, but no log record is lost.

Diagnostic steps

Grafana dashboard shows "No data" but SDK emits

  1. Confirm events reach the collector:
otelcol_receiver_accepted_log_records{receiver="otlp"} — should grow
  1. Confirm exporter processes them:
otelcol_processor_accepted_log_records{processor="memory_limiter"}
otelcol_exporter_sent_log_records{exporter="loki"}
  1. Confirm they land in Loki with the expected stream label:
kubectl -n cvntrade-observability port-forward pod/loki-0 3100:3100
curl -sS 'http://localhost:3100/loki/api/v1/label/run_id/values?start=…&end=…'

If run_id missing → the attributes processor failed to inject the loki.attribute.labels hint. Re-deploy the collector chart.

Emit failures from code side

Look at scheduler pod logs for: - event=otel_init_failed (SDK couldn't initialize — usually DNS/connectivity) - event=otel_emit_failed (runtime failure, emitter flipped to disabled)

Both degrade to logger-only (S3 task log still populated).

Recovery procedures

Rolling restart

kubectl -n cvntrade-observability rollout restart deploy/otel-collector-opentelemetry-collector

Preserves 2-replica availability via PDB.

Full redeploy (config change)

helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
  -n cvntrade-observability -f infra/helm/otel-collector/values.yaml

Common config traps

  • split_queries_by_interval must live under limits_config, not query_range (Loki 2.9 validation fails on the latter).
  • loki.attribute.labels hint must be a LOG RECORD attribute (via attributes processor), NOT a resource attribute. Resource-level hint is silently ignored.
  • The debug exporter must NOT be in the logs pipeline in prod — Promtail scrapes collector stdout and would duplicate events.

Post-incident

  1. Append an entry to mlflow_promotion_audit if a model promotion was blocked on observability during the outage.
  2. Review the root cause and update this runbook with the new signature.
  3. If the outage lasted > 15 min, add an ADR amendment noting the degradation mode and any new guardrail.
  • ADR-62 — OTel implementation status
  • Issue #589 — OTel Collector + SDK rollout
  • Issue #567 — parent observability stack (Loki, Tempo, Sentry)
  • Issue #577 — Loki deployment
  • infra/helm/otel-collector/README.md — chart-specific detail