Skip to content

loki-query — diagnostic Loki CLI (CVN-N014-EC-S02)

scripts/loki_query.py wraps the recurring Loki diagnostic ritual: port-forward svc/loki then run a LogQL query_range filtered by namespace / event / time.

Why it exists (the GOTCHA)

The OTel→Loki path drops events (the cluster constantly logs Failed to export … to otel-collector). A label match like {event="sNN_cell_outcome"} is therefore lossy — on the S23 group run it returned 1/5 verdicts. The reliable path is a line filter on the airflow-stdout stream: {namespace="cvntrade"} |~ "event=…" returned 5/5 + the group verdict. This script defaults to the line-filter path; the lossy label path is opt-in via --label-mode (with a warning).

Usage

# auto port-forward svc/loki (cvntrade-observability), reliable line-filter path
python scripts/loki_query.py --event s28_cell_outcome --since 48h

# against an already-running endpoint (no port-forward)
python scripts/loki_query.py --event training_complete --loki-url http://localhost:3100

# arbitrary LogQL (escape hatch)
python scripts/loki_query.py --query '{namespace="cvntrade"} |~ "event=dag_loaded"' --since 6h
flag default meaning
--event event name to line-filter (e.g. s28_cell_outcome, training_complete)
--namespace cvntrade log stream namespace label
--since 24h lookback: 30m / 2h / 7d / 90s
--limit 2000 max log lines
--query raw LogQL override (ignores --event/--namespace/--label-mode)
--label-mode off lossy OTel label path {event=…} — not recommended
--loki-url query this URL directly (skip kubectl port-forward)
--k8s-namespace cvntrade-observability k8s namespace of svc/loki
--org-id X-Scope-OrgID tenant header — only if this Loki has auth_enabled
--max-bytes 5_000_000 warn (don't fail) if output exceeds this size

Output is newest-first, UTC-timestamped (YYYY-MM-DDTHH:MM:SSZ prefix from the Loki entry time), de-duplicated by body, with Loki self-echo / querier noise (caller=, querier, Failed to export) stripped. A WARN is printed if output exceeds --max-bytes or if --limit was hit (results truncated, oldest dropped).

Infrastructure assumptions (committee plan_review 9183d8b6)

  • Auth: in-cluster Loki runs auth_enabled: false (single-tenant), so --org-id is normally unnecessary. If a Loki does have auth, a query returns HTTP 401/403 and the script fails fast with an explicit hint to pass --org-id <tenant>.
  • Time: --since is relative to now in UTC (epoch math, no date -d → portable Linux/macOS); output timestamps are UTC with a Z suffix. No local-timezone ambiguity.
  • Tenancy: single-tenant by default; --org-id future-proofs multi-tenant Loki.

Reference: memory reference_loki_query · plan dossier ../reviews/2026-05-25-cvn-n014-ec-s02-loki-query-plan.md.