Skip to content

0062 unified observability via opentelemetry

ADR-62 — Unified Observability via OpenTelemetry

Status: Partially implemented (2026-04-18) — logs shipped; traces deferred to #567 (Tempo).

Implementation status: - ✅ OTel Collector deployed in cvntrade-observability (#590, helm chart infra/helm/otel-collector/, 2 replicas + PDB + anti-affinity, processors: memory_limiter + batch + resource + attributes). - ✅ Python SDK integration (#591) — commun.observability.otel.emit_event() + log_event() dual-sink (Python logger → S3 for Airflow UI, OTLP → Collector → Loki for dashboards). Thread-safe lazy init, secret redaction (ADR-62 catalog: password|secret|token|key|credential), emit failure flips emitter off with structured warn. - ✅ Loki label promotion — collector's attributes processor injects loki.attribute.labels: event,run_id,crypto hint on every record, so these log-record attributes become stream labels queryable without pipe stages (required for Grafana variable dropdowns). - ✅ Dashboard validated end-to-end (pipeline_shadow_parity.json) — FTF shadow events reach Loki via OTel and render in the parity matrix (5 control points × Monolith / Pipeline / Δ). - ⏳ Spans / traces — deferred to issue #567 (Tempo deployment); Phase 4 will add the OTLP traces pipeline to the collector and instrument CandlePipeline.process() + Hamilton driver.

Context: The platform emits structured logs (ADR-32 event=key=value) and scrapes Prometheus metrics, but has no distributed tracing. On an FTF run that chains training → backtest → persist across 5 cryptos × 5 folds, there is no way to see which step takes how long, which step blocks, or how a single candle's processing decomposes across pipeline steps. Issue #567 identified the gap. Without a standard API, each component would invent its own tracing convention (timings in dicts, custom span classes, ad-hoc logging), creating drift and making it impossible to correlate traces with logs and metrics.

Decision: All pipeline steps — in both CandlePipeline (ADR-60) and Hamilton DAGs (ADR-61) — MUST emit OpenTelemetry spans using the standard opentelemetry-api. The SDK backend is swappable: noop (default, zero overhead), Tempo (for distributed tracing), or stdout (for local debug). No component defines its own tracing primitive.

For structured events (complement to spans), log_event() in commun.logs.cvntrade_log_manager dual-sinks to the Python logger (for the Airflow UI via S3) AND to OTLP logs via commun.observability.otel.emit_event() (for Loki dashboards). The OTLP path is the ONLY one that reaches Loki from Airflow KubernetesExecutor task pods, whose stdout is captured by FileTaskHandler and never seen by Promtail. Operators MUST use log_event() — do not call emit_event() directly.

Invariants:

  • Standard API only: Use opentelemetry.trace.get_tracer() and tracer.start_as_current_span(). Do NOT invent custom timing decorators, custom span classes, or ad-hoc dicts of timings. Legacy step_timings dicts may coexist during Phase 2/3 migration, but new code MUST use OTel.
  • Pluggable backend: The OTel SDK is configured once at process startup (runtime, DAG pod, CLI). Default is the noop tracer — zero cost when no backend is deployed. When Tempo is deployed (#567), the SDK exports via OTLP.
  • Span naming convention: Spans use lowercase snake_case matching the step name. Nested spans for sub-operations (enrichment.compute_rsi, filter_chain.trend). This matches the event= convention in ADR-32.
  • Attributes = golden fields: Span attributes use the same golden field names as structured logs (crypto, fold_id, factor, variant). This enables correlation: a trace in Tempo links to logs in Loki via crypto=UNIUSDC.
  • No PII, no secrets in spans: Same filter as ADR-59 config dump — no values matching /PASSWORD|SECRET|TOKEN|KEY|CREDENTIAL/i as attributes.
  • Sampling: Production uses head sampling at 10% (configurable). Development uses 100%. Sampling decision is centralized in the SDK config, not per-span.
  • Error recording: Exceptions in spans MUST call span.record_exception() and span.set_status(StatusCode.ERROR). Do NOT log the same exception separately — the span carries the context.

Scope:

Component Spans emitted
CandlePipeline (hot path) One root span per candle, one child span per step
Hamilton DAG (batch) One root span per dr.execute(), child spans per node (via Hamilton adapter)
Airflow tasks Task-level spans via Airflow's OTel plugin (existing standard)
Runtime service (#417) HTTP request spans via FastAPI instrumentation

Alternatives rejected: - Keep custom timing dicts: no standard correlation with logs/metrics, no distributed tracing. - Jaeger-specific client: OpenTelemetry is the vendor-neutral standard; Jaeger is one possible backend. - Datadog APM: vendor lock-in, paid, redundant with our self-hosted stack. - Always-on 100% sampling: volume explosion in production, especially on 10K-candle backtests.

Files (to be created Phase 5 of #568): src/commun/observability/tracing.py (SDK config + initialization), integration points in CandlePipeline.process(), Hamilton driver adapter, FastAPI instrumentation.