Skip to content

Design — Unified Observability : 3-layer model (technical + execution + drift) (CVN-N010)

need_id: CVN-N010 (proposed — to allocate via importer after committee plan_review PASSED) Authors: Dominique (operator) + Claude Reviewers: Expert Committee (plan_review pending) Status: draft — awaiting committee plan_review Created: 2026-04-28 Last updated: 2026-04-28

Changelog

Date Change
2026-04-28 Initial draft for committee submission

1. Problem restated

Current CVNTrade observability covers the technical layer (Loki for logs, Prometheus for metrics, Tempo for traces, all fed by OpenTelemetry per ADR-62) but misses two layers that are load-bearing for "is this pipeline doing the right thing ?" :

  • Execution / business KPI layer — per-run metrics (run_id, module_name, dag_node, success_rate, data_quality_score, business_score, predictions_captured rate, …) that don't fit Prometheus's data model (high cardinality from run_id, structured metadata, business semantics).
  • Drift layer — distribution-shift detection across data, features, predictions, business KPIs, and even technical metrics (latency_drift, error_rate_drift). Today this is implicit (humans spotting weirdness in Grafana panels) ; we have no engine, no store, no automated alerting.

Concrete failure modes the missing layers would have caught earlier in the last quarter :

Incident Layer that would have caught it
Variance feature-selection structurally dead post-StandardScaler (#706, ~2 weeks unnoticed) Drift on feature_selection_variance_dispersion would have flagged immediately
Predictions hook silent skip 100 % on 1 fold (#700/#701) Execution KPI predictions_captured_rate per (variant × crypto × fold) would have fired
Phase 2 rerun silent failure (4-hour cost) Execution KPI on FTF run lifecycle (variants completed / total expected) would have alerted

The gap is structural, not a tooling oversight : technical observability and business-fitness observability serve different questions and need different tools.

Need doc: TBD — documentation/needs/CVN-N010-unified-observability.md (to be created post-committee approval).

2. Goals

  1. Single cognitive model : every observable in the system belongs to exactly one of three layers — technical, execution, drift. No overlap, no ambiguity.
  2. Each layer = its own store with appropriate cardinality, retention, and query language.
  3. KPI Store for execution layer : timeseries-friendly, accepts high-cardinality metadata (run_id, module, dag_node), SQL-queryable for ad-hoc analysis.
  4. Drift Store for drift layer : universal — accepts data drift, model drift, perf drift, latency drift, business KPI drift. Not ML-specific.
  5. Evidently as the drift engine for data + model drift (the well-trodden path) ; PromQL + custom analyzers for technical / business drift (we already have the inputs).
  6. Grafana as single entry point : extends ADR-26 — operator sees the 3 layers in one place.
  7. Emission contract : every microservice / DAG / Hamilton node knows where to write each kind of telemetry. No router, no implicit fan-out.

3. Non-goals

  • Not replacing Prometheus / Loki / Tempo — they handle the technical layer correctly. This design adds two new layers ; it does not refactor the existing one.
  • Not pure ML observability — we want broader (any pipeline drift), not narrower. Evidently is the data/model engine, but the Drift Store accepts non-ML drift entries too.
  • Not real-time prediction monitoring — we operate offline / near-real-time. Hot-path inference monitoring is a separate concern (ADR-71 kill-switch + future Story).
  • Not a vendor lock-in — Evidently is open-source ; the Drift Store is plain PostgreSQL ; nothing forces us to keep Evidently if a custom analyzer outperforms it.

4. Architecture

4.1 System context (left → right user/business flow ; observability extracted to the right)

flowchart LR
    %% LEFT: business flow
    User[👤 User / Operator]
    UI[🖥️ UI / Console]
    API[🌐 API Gateway]
    Svc[⚙️ Microservice métier]
    Airflow[🧭 Airflow orchestration]
    Hamilton[🧩 Hamilton dataflow]
    Data[💾 Data Sources / Sinks]

    %% RIGHT: observability layers (3)
    subgraph TECH["1. Technical observability (existing)"]
        OTel[🔎 OpenTelemetry]
        Tempo[🧵 Tempo
traces] Prom[📈 Prometheus
metrics] Loki[📜 Loki
logs] end subgraph EXEC["2. Execution observability (NEW)"] KPI[📊 KPI Store
TimescaleDB schema] end subgraph DRIFT["3. Drift observability (NEW)"] Evidently[📉 Evidently.ai
drift engine] Custom[🔧 Custom analyzers
PromQL + jobs] DriftStore[🧠 Drift Store
PostgreSQL schema] end Graf[📊 Grafana
single entry per ADR-26] %% main flow User --> UI --> API --> Svc --> Airflow --> Hamilton --> Data %% technical telemetry API --> OTel Svc --> OTel Airflow --> OTel Hamilton --> OTel OTel --> Tempo OTel --> Prom API --> Loki Svc --> Loki Airflow --> Loki Hamilton --> Loki %% execution KPIs API --> KPI Svc --> KPI Airflow --> KPI Hamilton --> KPI Data --> KPI %% drift inputs Data --> Evidently Hamilton --> Evidently KPI --> Evidently Prom --> Custom KPI --> Custom Evidently --> DriftStore Custom --> DriftStore %% visualization Tempo --> Graf Prom --> Graf Loki --> Graf KPI --> Graf DriftStore --> Graf

4.2 The 3 layers

Layer Question it answers Store Engine Cardinality Retention Examples
1 — Technical "what's happening / how" Tempo + Prometheus + Loki OpenTelemetry low–medium 30d (Loki), 30d (Prom), 7d (Tempo) latency p99, CPU, span tree, error logs
2 — Execution "is this run doing what we asked" KPI Store (TimescaleDB extension on existing PG) direct emit (emit_kpi(...)) medium–high (run_id, module, fold_id) 90d hot + S3 cold run_id, success_rate, data_quality_score, business_score, predictions_captured_rate, n_trades_per_fold
3 — Drift "is the system degrading vs reference" Drift Store (dedicated PG schema, separate from KPI) Evidently (data/model) + custom analyzers (technical/business) low 365d (drift events are rare and heavy) data_drift_psi, feature_drift, prediction_drift, latency_drift, error_rate_drift, business_kpi_drift

Why 3 layers, not 1 or 2 : - One store would force trade-offs : Prometheus loses on cardinality, PG loses on PromQL ergonomics, ClickHouse adds a vendor. - Two layers (technical + business) would conflate execution and drift. Drift is a derived observable computed from the others ; mixing it with raw KPI emission creates write-amplification and circular-dependency surface. - Three layers map cleanly to three questions, three storage profiles, three emission patterns.

4.3 Emission contracts

Each layer has one emission helper. No service or Hamilton node bypasses them.

# Layer 1 — Technical (existing — ADR-62)
from commun.observability.otel import emit_event
emit_event("training_fold_complete", crypto="BTCUSDC", fold_id=0, duration_ms=12_345)

# Layer 2 — Execution KPI (new — this design)
from commun.observability.kpi import emit_kpi
emit_kpi(
    module="hamilton.feature_engineering",
    run_id=run_id,
    dag_node="compute_features",
    success_rate=0.97,
    data_quality_score=0.94,
    n_input=1000,
    n_output=985,
    duration_ms=2_340,
)

# Layer 3 — Drift (new — this design)
from commun.observability.drift import push_drift_report
from evidently import Report
from evidently.metrics import DataDriftPreset

report = Report(metrics=[DataDriftPreset()]).run(reference_data, current_data)
push_drift_report(
    report,
    run_id=run_id,
    module="hamilton.feature_engineering",
    drift_type="data_drift",
)

# OR for non-ML drift (custom analyzer)
from commun.observability.drift import push_drift_event
push_drift_event(
    drift_type="latency_drift",
    module="api.predictions",
    metric="p99_latency_ms",
    reference_value=120,
    current_value=540,
    significance=0.001,
)

4.4 Storage choices

KPI Store

Option Pros Cons Reco
Prometheus already deployed, PromQL familiar not designed for high-cardinality metadata (run_id explodes labels) NO — cardinality explosion
TimescaleDB extension on existing managed PG already deployed PG (Scaleway), SQL ergonomics, hypertables compress old data adds an extension to managed PG (verify Scaleway supports it) YES (default)
Dedicated ClickHouse best analytics on huge event volumes new infra dep, ops overhead NO unless volume > 100k events/day
Plain PG without TimescaleDB zero new dep manual partitioning + retention fallback if TimescaleDB unavailable

Drift Store

Option Pros Cons Reco
Same TimescaleDB as KPI one less store mixes concerns ; drift is sparse + heavy, KPI is dense + light NO
Dedicated PG schema (no TimescaleDB extension needed — drift events are sparse) clean separation, no infra cost one more schema YES (default)
Evidently Cloud (managed UI) turnkey external SaaS, vendor lock-in, $$$ NO
WhyLabs also managed same lock-in concerns NO

4.5 Container view (Structurizr DSL extension to architecture/workspace.dsl)

Append the following to documentation/architecture/workspace.dsl (after the existing cvntrade softwareSystem block) — proposed extension only ; merged after committee plan_review and ADR ratification.

// CVN-N010 extension : 3-layer observability containers.
// Inserts INSIDE the `cvntrade` softwareSystem block, after the existing
// containers (frontend, configConsole, ..., docs, grafana).
//
// New containers (this design proposes) :
kpiStore = container "KPI Store" "Execution + business KPIs per run/module/dag_node. TimescaleDB extension on the existing managed PG (CVN-N010-EA)." "PostgreSQL + TimescaleDB" "database"
driftStore = container "Drift Store" "Drift events from Evidently + custom analyzers. Dedicated PG schema, sparse events, 365d retention (CVN-N010-EB)." "PostgreSQL" "database"
evidently = container "Evidently" "Drift engine for data + model drift. Runs in Airflow batch jobs and as a sidecar in select services. Outputs drift reports → Drift Store (CVN-N010-EB)." "Python, Evidently OSS"

// Relationships : services / DAGs / Hamilton emit KPIs to the KPI Store
api -> kpiStore "emit_kpi() — execution metrics per request / batch"
runtime -> kpiStore "emit_kpi() — paper / live trade outcomes per session"
airflow -> kpiStore "emit_kpi() — DAG run completion + per-task success rate"
configApi -> kpiStore "emit_kpi() — config-change audit trail"

// Relationships : drift sources feed Evidently + the Drift Store
postgres -> evidently "Reference + current datasets via SQL"
evidently -> driftStore "Drift reports (data, feature, prediction, target)"
kpiStore -> driftStore "Custom analyzers detect KPI drift via SQL jobs"

// Relationships : Grafana reads the 3 layers
grafana -> kpiStore "PostgreSQL datasource — KPI panels"
grafana -> driftStore "PostgreSQL datasource — drift panels + alerts"

The corresponding Containers view (just a views block addition) :

// In the `views` block of workspace.dsl
container cvntrade "ObservabilityLayers" {
    include kpiStore
    include driftStore
    include evidently
    include grafana
    include api
    include runtime
    include airflow
    include configApi
    include hamilton            // if a Hamilton container is ever added
    autoLayout lr
    title "CVN-N010 — 3-layer observability"
    description "Execution KPI Store + Drift Store + Evidently extending the existing technical observability."
}

4.6 Sample drift schema (Drift Store)

Per the proposed dedicated PG schema drift :

CREATE SCHEMA IF NOT EXISTS drift;

CREATE TABLE IF NOT EXISTS drift.events (
    id              BIGSERIAL PRIMARY KEY,
    run_id          TEXT,                                            -- NULL when scheduled batch (no run scope)
    module          TEXT NOT NULL,                                   -- e.g. 'hamilton.feature_engineering', 'api.predictions'
    drift_type      TEXT NOT NULL CHECK (drift_type IN (
                        'data_drift', 'feature_drift', 'prediction_drift',
                        'target_drift', 'data_quality',
                        'latency_drift', 'error_rate_drift', 'volume_drift',
                        'business_kpi_drift', 'model_confidence_drift'
                    )),
    metric          TEXT NOT NULL,                                   -- e.g. 'psi_X1', 'p99_latency_ms'
    reference_value DOUBLE PRECISION,
    current_value   DOUBLE PRECISION,
    significance    DOUBLE PRECISION,                                -- p-value or PSI score
    severity        TEXT NOT NULL CHECK (severity IN ('info', 'warn', 'crit')),
    payload         JSONB,                                           -- full Evidently report or analyzer output
    detected_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX drift_events_module_detected_idx ON drift.events (module, detected_at DESC);
CREATE INDEX drift_events_run_id_idx ON drift.events (run_id) WHERE run_id IS NOT NULL;
CREATE INDEX drift_events_severity_idx ON drift.events (severity, detected_at DESC) WHERE severity IN ('warn', 'crit');

KPI Store schema (TimescaleDB hypertable) :

CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE SCHEMA IF NOT EXISTS kpi;

CREATE TABLE IF NOT EXISTS kpi.events (
    time            TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    run_id          TEXT,
    module          TEXT NOT NULL,
    dag_node        TEXT,
    metric_name     TEXT NOT NULL,
    metric_value    DOUBLE PRECISION,
    tags            JSONB,                                           -- whitelisted keys (see §6 risk R1)
    duration_ms     INTEGER,
    success         BOOLEAN
);
SELECT create_hypertable('kpi.events', 'time', if_not_exists => TRUE);

CREATE INDEX kpi_events_run_module_time_idx ON kpi.events (run_id, module, time DESC);
CREATE INDEX kpi_events_metric_time_idx ON kpi.events (metric_name, time DESC);

-- 90-day hot retention, then move to S3 cold storage (separate Story)
SELECT add_retention_policy('kpi.events', INTERVAL '90 days');

4.7 Security model (per committee 4b508801 reco 6)

Both new stores (kpi.events, drift.events) sit on the existing managed Scaleway PostgreSQL ; their security profile inherits the existing platform stance (network policies, TLS in transit, encrypted-at-rest), with three additions specific to the new emission paths :

Concern Mechanism Where defined Story
Secrets management Connection strings injected into pods via Kubernetes Secrets (existing cvntrade-pg-credentials secret extended with new role users cvntrade_kpi_writer, cvntrade_drift_writer). No connection string in code, env, or git. infra/helm/cvntrade-airflow/values-prod.yaml + new K8s Secret resource EA-S01
Least-privilege DB role Each helper has its own DB role with the minimum GRANT : cvntrade_kpi_writer has INSERT on kpi.events only ; cvntrade_drift_writer has INSERT on drift.events only ; SELECT on both reserved to cvntrade_grafana_reader (read-only). No DELETE / UPDATE / DDL. New SQL migration in EA-S01 (infra/migrations/0XX_kpi_drift_roles.sql) EA-S01 (KPI roles) + EB-S01 (Drift roles)
Network policy New PG users accessed only from cvntrade-runtime namespace (Airflow DAG pods + service pods) ; no direct Console UI write path (Console reads via Grafana, not direct PG). NetworkPolicy enforcement via existing Kapsule cluster policies. infra/helm/cvntrade-runtime/templates/networkpolicy.yaml extended EA-S01
Helper API surface emit_kpi() + push_drift_*() import the connection string lazily and verify its presence at first call ; missing → RuntimeError per ADR-25 (no silent fallback to noop). Connection pooling via psycopg2.pool.ThreadedConnectionPool with bounded size (avoid leaking connections in long-running training jobs). src/commun/observability/kpi.py + src/commun/observability/drift.py EA-S01 + EB-S01
Audit / who-wrote-what Every INSERT carries a tags['caller_module'] set automatically from inspect.stack() of the helper call site. Allows tracing back which Python module emitted a given KPI / drift event without trusting caller-provided metadata. Helper internals EA-S01
PII safety KPI Store + Drift Store schemas contain NO PII fields by design (run_id, module, numeric metrics only). The tag whitelist (per ADR-73) explicitly forbids tag keys matching email, wallet_address, api_key, *_secret patterns. Helper validation + ADR-73 invariant EA-S01

The emission helpers MUST NOT log the connection string, the PG password, or any tag value matching the PII deny-list (silent redaction with a WARNING log per ADR-25).

5. Implementation phases (proposed)

If approved, the design splits into 5 Stories under 3 Epics :

Epic Phase Story Scope
CVN-N010-EA (KPI Store) Phase 1 EA-S01 HARD BLOCKER ENTRY : TimescaleDB extension support verification on Scaleway managed PG (with realistic-load benchmark — pg_stat_statements p99 < 50 ms on the kpi.events SELECT path). If unsupported, a documented plain-PG fallback plan (manual partitioning + retention cron + ops overhead estimate) MUST be approved by operator BEFORE the rest of EA-S01 proceeds (committee 4b508801 reco 1). After verification : kpi.events schema + Python emit_kpi() helper + 5 unit tests + JSONB-vs-S3 100 KB threshold validated end-to-end with a synthetic > 100 KB report (committee reco 9).
Phase 2 EA-S02 Hamilton hooks (auto-emit per node : duration, success, n_in/n_out) + integration test with the existing label_pipeline (Track 5)
Phase 3 EA-S03 Grafana KPI dashboard (PostgreSQL datasource) + 3 starter panels (run-success-rate, per-module duration, business_score by crypto)
CVN-N010-EB (Drift) Phase 4 EB-S01 drift.events schema + push_drift_event() + push_drift_report() helpers + Evidently dependency pin + Airflow DAG dag_drift__daily + Evidently validation phase : benchmark vs NannyML on CVNTrade crypto data characteristics (15 m candles, high volatility, binary/3-class), with documented decision (keep Evidently vs swap to NannyML vs develop crypto-specific custom analyzers) per committee 4b508801 reco 2.
Phase 5 EB-S02 Custom drift analyzers (PromQL → drift event for latency / error rate ; SQL job → KPI drift) + 3 Grafana drift panels + P1 alert wiring + runbooks for warn and crit drift events in documentation/runbooks/drift_*.md (detection, escalation, initial response, integration with ADR-71 kill-switch for crit severity per committee reco 7) + info-severity S3 archive cron verified end-to-end (committee reco 8 / R3).

Each Story gates on per-Story MLOps readiness template (ADR-70) + committee pr_review for substantive code. Sprint cadence per ADR-69.

6. Risks + mitigations

# Risk Likelihood Impact Mitigation
R1 KPI cardinality explosion (run_id × module × tag combinations) eats TimescaleDB M M Whitelist of allowed tag keys at emit_kpi() API ; reject unknown tags fail-fast (ADR-25)
R2 Evidently latency in hot path (inference) L H Batch-only in this design ; never inline in inference. Hot-path drift is a separate Story (post-EB-S02)
R3 Drift Store growth (high-volume drift events drown signal) L M 365d retention + severity-based alert filtering ; cron archives info-severity events to S3 monthly. Implemented in EB-S02 per committee 4b508801 reco 8 ; verified end-to-end before EB-S02 closure.
R4 Operator alert fatigue from drift Grafana panels M M Drift events have a severity enum ; only warn + crit page ; info only on dashboards
R5 TimescaleDB extension unavailable on managed Scaleway PG L M EA-S01 starts with a verification step ; fallback to plain PG with manual partition cron if needed
R6 Evidently version drift breaks reports between runs M L Pin Evidently exact version (same pattern as cleanlab pin per ADR-70 §5)
R7 KPI emission becomes mandatory boilerplate that slows velocity M L Hamilton hooks (Phase 2) auto-emit for every node ; explicit emit_kpi() reserved for business-specific metrics outside Hamilton
R8 Drift Store schema rigidity blocks new drift types L L drift_type enum extensible via schema migration ; payload JSONB accepts arbitrary Evidently output
R9 Grafana datasource explosion (5+ datasources to manage) L L All 5 (Tempo, Prom, Loki, KPI, Drift) are already supported by Grafana ; provisioned via existing infra/grafana/provisioning/

7. Falsifiable hypotheses

  • H1 (3-layer split is valuable) : within 6 months of EB-S02 merge, ≥ 1 incident is detected first by Drift Store (not by Loki / Prom) → 3 layers are not redundant.
  • H2 (KPI Store earns its keep) : within 3 months, ≥ 1 ad-hoc operator query against kpi.events produces a non-trivial finding that wouldn't have been visible in Prometheus → KPI Store is queryable, not just a write-only black box.
  • H3 (Evidently fits) : drift report wall-time stays under 60 s for the largest typical reference vs current dataset (10k × 50 features) → Evidently is fast enough for batch use.
  • H4 (operator adoption) : within 90 days, the operator opens the KPI Grafana dashboard ≥ 5 times / week unprompted → it's actually useful, not write-only.

If any of H1-H4 is falsified at the relevant horizon, this design is amendment-eligible (revisit storage choice or layer decomposition).

8. ADRs that this design will produce (if accepted)

Per committee 4b508801 reco 5, ratification is approved on the strength of this design alone (no working prototype required) ; implementation validation happens during EA-S01 + subsequent Stories.

ADR Topic Specific content per committee 4b508801 recos
ADR-72 Three-layer observability model (technical / execution / drift) — codifies the layer separation, the question each answers, the no-overlap invariant + Hot-path inference drift forward-looking section (reco 3) : real-time drift detection in live crypto trading is the v2 frontier ; high-level architectural considerations include (a) lightweight in-memory PSI on the inference path, (b) integration with ADR-71 kill-switch (drift signal can auto-engage the switch above threshold), (c) decouple from batch evaluation latency. Out of v1 scope ; tracked as a future Story under CVN-N010-EC.
ADR-73 KPI Store on TimescaleDB extension (storage choice + emission contract emit_kpi() + tag whitelist + security model) + Tag whitelist amendment process (reco 4) : the whitelist lives at src/commun/observability/kpi_tag_whitelist.py as a Python set ; adding a tag = open PR with the new key + 2-line justification + at least one usage in emit_kpi() ; CR + 1 maintainer approval = lightweight (no committee). Removing a tag = ADR amendment (heavier — operators may have built dashboards on it). + Security model section (reco 6) : least-privilege role cvntrade_kpi_writer, K8s Secret-injected connection string, PII deny-list at helper level (see §4.7).
ADR-74 Evidently as drift engine + Drift Store schema (PG schema, 10 drift types, severity enum, 365d retention) + security model + Validation phase mandate (reco 2) : EB-S01 includes a benchmark vs NannyML on CVNTrade crypto data (15 m candles, binary/3-class) ; if Evidently underperforms, ADR-74 is amended with a swap-engine clause. + info-severity S3 archive cron (reco 8) : implemented in EB-S02 to prevent signal drowning in drift.events. + Security model section (reco 6) : least-privilege role cvntrade_drift_writer, S3 hybrid-storage audit trail (see §4.7). + Hybrid storage invariant (reco 9) : payload < 100 KB inline JSONB ; payload ≥ 100 KB stored in S3 with payload_url pointer in JSONB ; threshold validated end-to-end in EA-S01.
ADR-75 Grafana as single entry for the 3 layers — extends ADR-26 (does NOT supersede ; just adds the 2 new datasources to the existing single-entry-point invariant) unchanged

9. Open questions for committee

  1. 3-layer cognitive split : technical / execution / drift — right cognitive model, or should we collapse execution + drift into a single "business observability" layer ?
  2. TimescaleDB extension on the existing managed Scaleway PG — viable, or do we need a dedicated PG instance / a different timeseries DB (VictoriaMetrics, ClickHouse) ?
  3. Evidently as drift engine — fits our 15m candles + binary classification + crypto data ? Any known alternative we should benchmark first (NannyML, WhyLogs, custom) ?
  4. KPI Store cardinality control — strict tag whitelist (operator pain when adding a new tag) vs free-form (cardinality explosion risk) ? Reco : whitelist with documented amendment process.
  5. Grafana datasource topology — 5 datasources (Tempo, Prom, Loki, KPI, Drift) vs unified via PG-only (KPI + Drift on same datasource, different schemas) ? Reco : 5 datasources, but the 2 new ones share the same PG connection string.
  6. ADR-26 supersession — does ADR-75 supersede ADR-26, or extend it ? ADR-26 says "Grafana as single entry point" ; this design adds 2 new datasources but Grafana stays the entry point. Reco : extend, not supersede.
  7. Phase 1 scope — KPI helper alone (EA-S01) or KPI helper + Hamilton hooks (EA-S01 + EA-S02) bundled together ? Reco : separate, ship EA-S01 first to prove the schema + helper before wiring Hamilton everywhere.
  8. Hot-path inference drift — explicitly out of scope for v1 (batch-only Evidently). Should the design include a forward-looking "v2 hot-path drift" section, or defer entirely ?
  9. Evidently artifact storage — drift reports can be ~MB each ; store raw report in drift.events.payload JSONB, or in S3 with a pointer in payload ? Reco : payload < 100 KB inline ; > 100 KB to S3.
  10. Acceptance criteria for ADR-72 to be ratified — does committee require a working EA-S01 prototype before ratifying the ADR, or is the design alone sufficient for ratification ?

10. Acceptance criteria (Story level — for the Need / Epic when allocated)

This design is DONE when :

  • Committee plan_review PASSED (≥ ACCEPTED, ≥ 8.0 avg, no blockers)
  • Need CVN-N010 allocated via scripts/openproject_import_gh.py per ADR-69
  • Epic CVN-N010-EA (KPI Store) + CVN-N010-EB (Drift) created
  • At least the 5 Stories EA-S01 → EB-S02 created in OP, sequenced
  • ADR-72 (3-layer model) drafted and submitted (gate before EA-S01 starts)
  • documentation/architecture/workspace.dsl extended with the 3 new containers + 1 new view (post EA-S01 merge)

11. References

11.1 Existing ADRs the design builds on / extends

  • ADR-26 — Grafana as single entry point (extended by ADR-75, not superseded)
  • ADR-25 — No silent fallback (drift triggers must be loud — fail-fast on unknown KPI tags)
  • ADR-30 / 32 / 33 — Structured logs (technical layer, unchanged)
  • ADR-58 — FTF guardrails (drift on guardrail violations could feed the Drift Store)
  • ADR-59 — All pipeline params in PostgreSQL ftf_config (KPI store sits next to it on the same managed PG)
  • ADR-61 — Hamilton for batch flows (auto-emit hooks at node level — Phase 2)
  • ADR-62 — Unified OTel observability (technical layer, unchanged)
  • ADR-67 — Pluggable feature selection (drift on feature selector outputs — would catch #706 type incidents)
  • ADR-68 — Committee = default review channel (this design submitted as plan_review)
  • ADR-69 — OpenProject is the project orchestrator (this design will become a Need + Epic + Stories)
  • ADR-70 — MLOps readiness template (drift detection is one of the 6 mandatory sections — drift store enables it programmatically)
  • ADR-71 — Trading kill-switch invariants (drift events on kill-switch transitions feed Drift Store)

11.2 Existing artefacts referenced

  • documentation/architecture/workspace.dsl — Structurizr workspace (extension point shown in §4.5)
  • src/commun/observability/otel.py — existing OTel emit_event helper (Layer 1)
  • infra/grafana/dashboards/ — existing dashboards (datasources extended, panels added)
  • Issues : #608 (parent F1 mission), #707 (F1 boost epic that will benefit from drift detection)

11.3 External

11.4 Committee sessions

  • 4b508801plan_review round 1 (this design first submission). PASSED OK avg 8.46, strong consensus, 0 blocker, 2 minor dissents (ADR-72 ratification gating, hot-path drift forward-looking section), 9 recommendations all forward-looking. All 9 recos applied to this design in revision before re-submission for round 2 :
  • reco 1 — TimescaleDB hard blocker entry in EA-S01 (§5)
  • reco 2 — Evidently validation phase mandate in EB-S01 (§5)
  • reco 3 — hot-path drift forward-looking section in ADR-72 (§8)
  • reco 4 — KPI tag whitelist amendment process in ADR-73 (§8)
  • reco 5 — ADR-72 ratify on design alone (no prototype) — accepted (§8 prelude)
  • reco 6 — security model section §4.7 (new) + ADR-73 / ADR-74 sections (§8)
  • reco 7 — runbooks for warn / crit drift events as EB-S02 deliverable (§5)
  • reco 8 — info-severity S3 archive cron in EB-S02 explicit (§6 R3)
  • reco 9 — hybrid storage 100 KB threshold validated in EA-S01 (§5)
  • <TBD>plan_review round 2 (this revision). Expected verdict ≥ ACCEPTED or ACCEPTED-WITH-CHANGES, no blockers (all blockers from round 1 — there were none — already addressed ; recos applied).