Runbook — OTel Collector down¶
Alert: structured events no longer land in Loki / Grafana dashboards silent.
Impact: shadow parity dashboard empty, signal funnel dashboard empty, any
log_event() emitted from Airflow task pods invisible to Loki. Python logger
side (Airflow S3 task log) still works — no data loss, but no real-time
observability.
Affected dashboards:
- Pipeline Runner — Shadow Parity
- any future dashboard consuming {exporter="OTLP"} streams
Immediate human actions (0–5 min)¶
- Check pods:
Expected: 2 replicas 1/1 Running. If CrashLoopBackOff, look at config
error:
kubectl -n cvntrade-observability logs -l app.kubernetes.io/name=opentelemetry-collector --tail=40 | grep -iE "error|invalid"
- Check the collector's own telemetry metrics:
kubectl -n cvntrade-observability port-forward svc/otel-collector-opentelemetry-collector 8888:8888
curl -sS http://localhost:8888/metrics | grep -E "otelcol_exporter_(sent|send_failed)_log_records"
sent_log_records{exporter="loki"}growing → OK-
send_failed_log_records{exporter="loki"}growing → Loki rejecting; see Loki runbook -
If running but not receiving, check the OTLP endpoint from an emitter pod:
SCHED=$(kubectl -n cvntrade get pods -l component=scheduler -o jsonpath='{.items[0].metadata.name}')
kubectl -n cvntrade exec $SCHED -c scheduler -- python3 -c "
import socket
host = 'otel-collector-opentelemetry-collector.cvntrade-observability.svc.cluster.local'
s = socket.create_connection((host, 4317), timeout=3); s.close(); print('reachable')
"
Escalation path¶
- 0–5 min: on-call operator executes immediate actions above
- 5–15 min: if not resolved, page the platform team lead
- 15+ min: committee review — consider the kill-switch (below) to restore emitter-side stability at the cost of Loki visibility
Kill-switch¶
Disable OTLP emission. CVN_OTEL_ENABLED is read once at module import
(see src/commun/observability/otel.py:_ENABLED) so flipping the env var
requires a rollout restart for running pods to pick it up. Short-lived task
pods spawned after the env change inherit the new value automatically.
# 1. Set the env var on each service's Deployment (patches the spec):
kubectl -n cvntrade set env deploy/airflow-scheduler CVN_OTEL_ENABLED=0
kubectl -n cvntrade set env deploy/airflow-webserver CVN_OTEL_ENABLED=0
kubectl -n cvntrade set env deploy/cvntrade-runtime CVN_OTEL_ENABLED=0
kubectl -n cvntrade set env deploy/cvntrade-api CVN_OTEL_ENABLED=0
# 2. Rollout restart so running pods reload with the new env:
kubectl -n cvntrade rollout restart deploy/airflow-scheduler
kubectl -n cvntrade rollout restart deploy/airflow-webserver
kubectl -n cvntrade rollout restart deploy/cvntrade-runtime
kubectl -n cvntrade rollout restart deploy/cvntrade-api
Effect after restart: emit_event() is a no-op; Python logger path (→ S3)
unchanged. Dashboards stay empty until the collector is restored, but no log
record is lost.
Diagnostic steps¶
Grafana dashboard shows "No data" but SDK emits¶
- Confirm events reach the collector:
- Confirm exporter processes them:
otelcol_processor_accepted_log_records{processor="memory_limiter"}
otelcol_exporter_sent_log_records{exporter="loki"}
- Confirm they land in Loki with the expected stream label:
kubectl -n cvntrade-observability port-forward pod/loki-0 3100:3100
curl -sS 'http://localhost:3100/loki/api/v1/label/run_id/values?start=…&end=…'
If run_id missing → the attributes processor failed to inject the
loki.attribute.labels hint. Re-deploy the collector chart.
Emit failures from code side¶
Look at scheduler pod logs for:
- event=otel_init_failed (SDK couldn't initialize — usually DNS/connectivity)
- event=otel_emit_failed (runtime failure, emitter flipped to disabled)
Both degrade to logger-only (S3 task log still populated).
Recovery procedures¶
Rolling restart¶
Preserves 2-replica availability via PDB.
Full redeploy (config change)¶
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
-n cvntrade-observability -f infra/helm/otel-collector/values.yaml
Common config traps¶
split_queries_by_intervalmust live underlimits_config, notquery_range(Loki 2.9 validation fails on the latter).loki.attribute.labelshint must be a LOG RECORD attribute (viaattributesprocessor), NOT a resource attribute. Resource-level hint is silently ignored.- The
debugexporter must NOT be in the logs pipeline in prod — Promtail scrapes collector stdout and would duplicate events.
Post-incident¶
- Append an entry to
mlflow_promotion_auditif a model promotion was blocked on observability during the outage. - Review the root cause and update this runbook with the new signature.
- If the outage lasted > 15 min, add an ADR amendment noting the degradation mode and any new guardrail.
Related¶
- ADR-62 — OTel implementation status
- Issue #589 — OTel Collector + SDK rollout
- Issue #567 — parent observability stack (Loki, Tempo, Sentry)
- Issue #577 — Loki deployment
infra/helm/otel-collector/README.md— chart-specific detail