Design — Unified Observability : 3-layer model (technical + execution + drift) (`CVN-N010`)¶

need_id: CVN-N010 (proposed — to allocate via importer after committee plan_review PASSED) Authors: Dominique (operator) + Claude Reviewers: Expert Committee (plan_review pending) Status: draft — awaiting committee plan_review Created: 2026-04-28 Last updated: 2026-04-28

Changelog¶

Date	Change
2026-04-28	Initial draft for committee submission

1. Problem restated¶

Current CVNTrade observability covers the technical layer (Loki for logs, Prometheus for metrics, Tempo for traces, all fed by OpenTelemetry per ADR-62) but misses two layers that are load-bearing for "is this pipeline doing the right thing ?" :

Execution / business KPI layer — per-run metrics (run_id, module_name, dag_node, success_rate, data_quality_score, business_score, predictions_captured rate, …) that don't fit Prometheus's data model (high cardinality from run_id, structured metadata, business semantics).
Drift layer — distribution-shift detection across data, features, predictions, business KPIs, and even technical metrics (latency_drift, error_rate_drift). Today this is implicit (humans spotting weirdness in Grafana panels) ; we have no engine, no store, no automated alerting.

Concrete failure modes the missing layers would have caught earlier in the last quarter :

Incident	Layer that would have caught it
Variance feature-selection structurally dead post-StandardScaler (#706, ~2 weeks unnoticed)	Drift on `feature_selection_variance_dispersion` would have flagged immediately
Predictions hook silent skip 100 % on 1 fold (#700/#701)	Execution KPI `predictions_captured_rate` per (variant × crypto × fold) would have fired
Phase 2 rerun silent failure (4-hour cost)	Execution KPI on FTF run lifecycle (variants completed / total expected) would have alerted

The gap is structural, not a tooling oversight : technical observability and business-fitness observability serve different questions and need different tools.

Need doc: TBD — documentation/needs/CVN-N010-unified-observability.md (to be created post-committee approval).

2. Goals¶

Single cognitive model : every observable in the system belongs to exactly one of three layers — technical, execution, drift. No overlap, no ambiguity.
Each layer = its own store with appropriate cardinality, retention, and query language.
KPI Store for execution layer : timeseries-friendly, accepts high-cardinality metadata (run_id, module, dag_node), SQL-queryable for ad-hoc analysis.
Drift Store for drift layer : universal — accepts data drift, model drift, perf drift, latency drift, business KPI drift. Not ML-specific.
Evidently as the drift engine for data + model drift (the well-trodden path) ; PromQL + custom analyzers for technical / business drift (we already have the inputs).
Grafana as single entry point : extends ADR-26 — operator sees the 3 layers in one place.
Emission contract : every microservice / DAG / Hamilton node knows where to write each kind of telemetry. No router, no implicit fan-out.

3. Non-goals¶

Not replacing Prometheus / Loki / Tempo — they handle the technical layer correctly. This design adds two new layers ; it does not refactor the existing one.
Not pure ML observability — we want broader (any pipeline drift), not narrower. Evidently is the data/model engine, but the Drift Store accepts non-ML drift entries too.
Not real-time prediction monitoring — we operate offline / near-real-time. Hot-path inference monitoring is a separate concern (ADR-71 kill-switch + future Story).
Not a vendor lock-in — Evidently is open-source ; the Drift Store is plain PostgreSQL ; nothing forces us to keep Evidently if a custom analyzer outperforms it.

4. Architecture¶

4.1 System context (left → right user/business flow ; observability extracted to the right)¶

flowchart LR
    %% LEFT: business flow
    User[👤 User / Operator]
    UI[🖥️ UI / Console]
    API[🌐 API Gateway]
    Svc[⚙️ Microservice métier]
    Airflow[🧭 Airflow orchestration]
    Hamilton[🧩 Hamilton dataflow]
    Data[💾 Data Sources / Sinks]

    %% RIGHT: observability layers (3)
    subgraph TECH["1. Technical observability (existing)"]
        OTel[🔎 OpenTelemetry]
        Tempo[🧵 Tempo
traces]
        Prom[📈 Prometheus
metrics]
        Loki[📜 Loki
logs]
    end

    subgraph EXEC["2. Execution observability (NEW)"]
        KPI[📊 KPI Store
TimescaleDB schema]
    end

    subgraph DRIFT["3. Drift observability (NEW)"]
        Evidently[📉 Evidently.ai
drift engine]
        Custom[🔧 Custom analyzers
PromQL + jobs]
        DriftStore[🧠 Drift Store
PostgreSQL schema]
    end

    Graf[📊 Grafana
single entry per ADR-26]

    %% main flow
    User --> UI --> API --> Svc --> Airflow --> Hamilton --> Data

    %% technical telemetry
    API --> OTel
    Svc --> OTel
    Airflow --> OTel
    Hamilton --> OTel
    OTel --> Tempo
    OTel --> Prom
    API --> Loki
    Svc --> Loki
    Airflow --> Loki
    Hamilton --> Loki

    %% execution KPIs
    API --> KPI
    Svc --> KPI
    Airflow --> KPI
    Hamilton --> KPI
    Data --> KPI

    %% drift inputs
    Data --> Evidently
    Hamilton --> Evidently
    KPI --> Evidently
    Prom --> Custom
    KPI --> Custom
    Evidently --> DriftStore
    Custom --> DriftStore

    %% visualization
    Tempo --> Graf
    Prom --> Graf
    Loki --> Graf
    KPI --> Graf
    DriftStore --> Graf

4.2 The 3 layers¶

Layer	Question it answers	Store	Engine	Cardinality	Retention	Examples
1 — Technical	"what's happening / how"	Tempo + Prometheus + Loki	OpenTelemetry	low–medium	30d (Loki), 30d (Prom), 7d (Tempo)	latency p99, CPU, span tree, error logs
2 — Execution	"is this run doing what we asked"	KPI Store (TimescaleDB extension on existing PG)	direct emit (`emit_kpi(...)`)	medium–high (run_id, module, fold_id)	90d hot + S3 cold	`run_id`, `success_rate`, `data_quality_score`, `business_score`, `predictions_captured_rate`, `n_trades_per_fold`
3 — Drift	"is the system degrading vs reference"	Drift Store (dedicated PG schema, separate from KPI)	Evidently (data/model) + custom analyzers (technical/business)	low	365d (drift events are rare and heavy)	`data_drift_psi`, `feature_drift`, `prediction_drift`, `latency_drift`, `error_rate_drift`, `business_kpi_drift`

Why 3 layers, not 1 or 2 : - One store would force trade-offs : Prometheus loses on cardinality, PG loses on PromQL ergonomics, ClickHouse adds a vendor. - Two layers (technical + business) would conflate execution and drift. Drift is a derived observable computed from the others ; mixing it with raw KPI emission creates write-amplification and circular-dependency surface. - Three layers map cleanly to three questions, three storage profiles, three emission patterns.

4.3 Emission contracts¶

Each layer has one emission helper. No service or Hamilton node bypasses them.

# Layer 1 — Technical (existing — ADR-62)
from commun.observability.otel import emit_event
emit_event("training_fold_complete", crypto="BTCUSDC", fold_id=0, duration_ms=12_345)

# Layer 2 — Execution KPI (new — this design)
from commun.observability.kpi import emit_kpi
emit_kpi(
    module="hamilton.feature_engineering",
    run_id=run_id,
    dag_node="compute_features",
    success_rate=0.97,
    data_quality_score=0.94,
    n_input=1000,
    n_output=985,
    duration_ms=2_340,
)

# Layer 3 — Drift (new — this design)
from commun.observability.drift import push_drift_report
from evidently import Report
from evidently.metrics import DataDriftPreset

report = Report(metrics=[DataDriftPreset()]).run(reference_data, current_data)
push_drift_report(
    report,
    run_id=run_id,
    module="hamilton.feature_engineering",
    drift_type="data_drift",
)

# OR for non-ML drift (custom analyzer)
from commun.observability.drift import push_drift_event
push_drift_event(
    drift_type="latency_drift",
    module="api.predictions",
    metric="p99_latency_ms",
    reference_value=120,
    current_value=540,
    significance=0.001,
)

4.4 Storage choices¶

KPI Store¶

Option	Pros	Cons	Reco
Prometheus	already deployed, PromQL familiar	not designed for high-cardinality metadata (run_id explodes labels)	NO — cardinality explosion
TimescaleDB extension on existing managed PG	already deployed PG (Scaleway), SQL ergonomics, hypertables compress old data	adds an extension to managed PG (verify Scaleway supports it)	YES (default)
Dedicated ClickHouse	best analytics on huge event volumes	new infra dep, ops overhead	NO unless volume > 100k events/day
Plain PG without TimescaleDB	zero new dep	manual partitioning + retention	fallback if TimescaleDB unavailable

Drift Store¶

Option	Pros	Cons	Reco
Same TimescaleDB as KPI	one less store	mixes concerns ; drift is sparse + heavy, KPI is dense + light	NO
Dedicated PG schema (no TimescaleDB extension needed — drift events are sparse)	clean separation, no infra cost	one more schema	YES (default)
Evidently Cloud (managed UI)	turnkey	external SaaS, vendor lock-in, $$$	NO
WhyLabs	also managed	same lock-in concerns	NO

4.5 Container view (Structurizr DSL extension to `architecture/workspace.dsl`)¶

Append the following to documentation/architecture/workspace.dsl (after the existing cvntrade softwareSystem block) — proposed extension only ; merged after committee plan_review and ADR ratification.

// CVN-N010 extension : 3-layer observability containers.
// Inserts INSIDE the `cvntrade` softwareSystem block, after the existing
// containers (frontend, configConsole, ..., docs, grafana).
//
// New containers (this design proposes) :
kpiStore = container "KPI Store" "Execution + business KPIs per run/module/dag_node. TimescaleDB extension on the existing managed PG (CVN-N010-EA)." "PostgreSQL + TimescaleDB" "database"
driftStore = container "Drift Store" "Drift events from Evidently + custom analyzers. Dedicated PG schema, sparse events, 365d retention (CVN-N010-EB)." "PostgreSQL" "database"
evidently = container "Evidently" "Drift engine for data + model drift. Runs in Airflow batch jobs and as a sidecar in select services. Outputs drift reports → Drift Store (CVN-N010-EB)." "Python, Evidently OSS"

// Relationships : services / DAGs / Hamilton emit KPIs to the KPI Store
api -> kpiStore "emit_kpi() — execution metrics per request / batch"
runtime -> kpiStore "emit_kpi() — paper / live trade outcomes per session"
airflow -> kpiStore "emit_kpi() — DAG run completion + per-task success rate"
configApi -> kpiStore "emit_kpi() — config-change audit trail"

// Relationships : drift sources feed Evidently + the Drift Store
postgres -> evidently "Reference + current datasets via SQL"
evidently -> driftStore "Drift reports (data, feature, prediction, target)"
kpiStore -> driftStore "Custom analyzers detect KPI drift via SQL jobs"

// Relationships : Grafana reads the 3 layers
grafana -> kpiStore "PostgreSQL datasource — KPI panels"
grafana -> driftStore "PostgreSQL datasource — drift panels + alerts"

The corresponding Containers view (just a views block addition) :

// In the `views` block of workspace.dsl
container cvntrade "ObservabilityLayers" {
    include kpiStore
    include driftStore
    include evidently
    include grafana
    include api
    include runtime
    include airflow
    include configApi
    include hamilton            // if a Hamilton container is ever added
    autoLayout lr
    title "CVN-N010 — 3-layer observability"
    description "Execution KPI Store + Drift Store + Evidently extending the existing technical observability."
}

4.6 Sample drift schema (Drift Store)¶

Per the proposed dedicated PG schema drift :

CREATE SCHEMA IF NOT EXISTS drift;

CREATE TABLE IF NOT EXISTS drift.events (
    id              BIGSERIAL PRIMARY KEY,
    run_id          TEXT,                                            -- NULL when scheduled batch (no run scope)
    module          TEXT NOT NULL,                                   -- e.g. 'hamilton.feature_engineering', 'api.predictions'
    drift_type      TEXT NOT NULL CHECK (drift_type IN (
                        'data_drift', 'feature_drift', 'prediction_drift',
                        'target_drift', 'data_quality',
                        'latency_drift', 'error_rate_drift', 'volume_drift',
                        'business_kpi_drift', 'model_confidence_drift'
                    )),
    metric          TEXT NOT NULL,                                   -- e.g. 'psi_X1', 'p99_latency_ms'
    reference_value DOUBLE PRECISION,
    current_value   DOUBLE PRECISION,
    significance    DOUBLE PRECISION,                                -- p-value or PSI score
    severity        TEXT NOT NULL CHECK (severity IN ('info', 'warn', 'crit')),
    payload         JSONB,                                           -- full Evidently report or analyzer output
    detected_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX drift_events_module_detected_idx ON drift.events (module, detected_at DESC);
CREATE INDEX drift_events_run_id_idx ON drift.events (run_id) WHERE run_id IS NOT NULL;
CREATE INDEX drift_events_severity_idx ON drift.events (severity, detected_at DESC) WHERE severity IN ('warn', 'crit');

KPI Store schema (TimescaleDB hypertable) :

CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE SCHEMA IF NOT EXISTS kpi;

CREATE TABLE IF NOT EXISTS kpi.events (
    time            TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    run_id          TEXT,
    module          TEXT NOT NULL,
    dag_node        TEXT,
    metric_name     TEXT NOT NULL,
    metric_value    DOUBLE PRECISION,
    tags            JSONB,                                           -- whitelisted keys (see §6 risk R1)
    duration_ms     INTEGER,
    success         BOOLEAN
);
SELECT create_hypertable('kpi.events', 'time', if_not_exists => TRUE);

CREATE INDEX kpi_events_run_module_time_idx ON kpi.events (run_id, module, time DESC);
CREATE INDEX kpi_events_metric_time_idx ON kpi.events (metric_name, time DESC);

-- 90-day hot retention, then move to S3 cold storage (separate Story)
SELECT add_retention_policy('kpi.events', INTERVAL '90 days');

4.7 Security model (per committee `4b508801` reco 6)¶

Both new stores (kpi.events, drift.events) sit on the existing managed Scaleway PostgreSQL ; their security profile inherits the existing platform stance (network policies, TLS in transit, encrypted-at-rest), with three additions specific to the new emission paths :

Concern	Mechanism	Where defined	Story
Secrets management	Connection strings injected into pods via Kubernetes Secrets (existing `cvntrade-pg-credentials` secret extended with new role users `cvntrade_kpi_writer`, `cvntrade_drift_writer`). No connection string in code, env, or git.	`infra/helm/cvntrade-airflow/values-prod.yaml` + new K8s `Secret` resource	EA-S01
Least-privilege DB role	Each helper has its own DB role with the minimum GRANT : `cvntrade_kpi_writer` has INSERT on `kpi.events` only ; `cvntrade_drift_writer` has INSERT on `drift.events` only ; SELECT on both reserved to `cvntrade_grafana_reader` (read-only). No DELETE / UPDATE / DDL.	New SQL migration in EA-S01 (`infra/migrations/0XX_kpi_drift_roles.sql`)	EA-S01 (KPI roles) + EB-S01 (Drift roles)
Network policy	New PG users accessed only from `cvntrade-runtime` namespace (Airflow DAG pods + service pods) ; no direct Console UI write path (Console reads via Grafana, not direct PG). NetworkPolicy enforcement via existing Kapsule cluster policies.	`infra/helm/cvntrade-runtime/templates/networkpolicy.yaml` extended	EA-S01
Helper API surface	`emit_kpi()` + `push_drift_*()` import the connection string lazily and verify its presence at first call ; missing → `RuntimeError` per ADR-25 (no silent fallback to noop). Connection pooling via `psycopg2.pool.ThreadedConnectionPool` with bounded size (avoid leaking connections in long-running training jobs).	`src/commun/observability/kpi.py` + `src/commun/observability/drift.py`	EA-S01 + EB-S01
Audit / who-wrote-what	Every INSERT carries a `tags['caller_module']` set automatically from `inspect.stack()` of the helper call site. Allows tracing back which Python module emitted a given KPI / drift event without trusting caller-provided metadata.	Helper internals	EA-S01
PII safety	KPI Store + Drift Store schemas contain NO PII fields by design (`run_id`, `module`, numeric metrics only). The tag whitelist (per ADR-73) explicitly forbids tag keys matching `email`, `wallet_address`, `api_key`, `*_secret` patterns.	Helper validation + ADR-73 invariant	EA-S01

The emission helpers MUST NOT log the connection string, the PG password, or any tag value matching the PII deny-list (silent redaction with a WARNING log per ADR-25).

5. Implementation phases (proposed)¶

If approved, the design splits into 5 Stories under 3 Epics :

Epic	Phase	Story	Scope
`CVN-N010-EA` (KPI Store)	Phase 1	EA-S01	HARD BLOCKER ENTRY : TimescaleDB extension support verification on Scaleway managed PG (with realistic-load benchmark — `pg_stat_statements` p99 < 50 ms on the `kpi.events` SELECT path). If unsupported, a documented plain-PG fallback plan (manual partitioning + retention cron + ops overhead estimate) MUST be approved by operator BEFORE the rest of EA-S01 proceeds (committee `4b508801` reco 1). After verification : `kpi.events` schema + Python `emit_kpi()` helper + 5 unit tests + JSONB-vs-S3 100 KB threshold validated end-to-end with a synthetic > 100 KB report (committee reco 9).
	Phase 2	EA-S02	Hamilton hooks (auto-emit per node : duration, success, n_in/n_out) + integration test with the existing label_pipeline (Track 5)
	Phase 3	EA-S03	Grafana KPI dashboard (PostgreSQL datasource) + 3 starter panels (run-success-rate, per-module duration, business_score by crypto)
`CVN-N010-EB` (Drift)	Phase 4	EB-S01	`drift.events` schema + `push_drift_event()` + `push_drift_report()` helpers + Evidently dependency pin + Airflow DAG `dag_drift__daily` + Evidently validation phase : benchmark vs NannyML on CVNTrade crypto data characteristics (15 m candles, high volatility, binary/3-class), with documented decision (`keep Evidently` vs `swap to NannyML` vs `develop crypto-specific custom analyzers`) per committee `4b508801` reco 2.
	Phase 5	EB-S02	Custom drift analyzers (PromQL → drift event for latency / error rate ; SQL job → KPI drift) + 3 Grafana drift panels + P1 alert wiring + runbooks for `warn` and `crit` drift events in `documentation/runbooks/drift_.md` (detection, escalation, initial response, integration with ADR-71 kill-switch for `crit` severity per committee reco 7) + info-severity S3 archive cron* verified end-to-end (committee reco 8 / R3).

Each Story gates on per-Story MLOps readiness template (ADR-70) + committee pr_review for substantive code. Sprint cadence per ADR-69.

6. Risks + mitigations¶

#	Risk	Likelihood	Impact	Mitigation
R1	KPI cardinality explosion (run_id × module × tag combinations) eats TimescaleDB	M	M	Whitelist of allowed tag keys at `emit_kpi()` API ; reject unknown tags fail-fast (ADR-25)
R2	Evidently latency in hot path (inference)	L	H	Batch-only in this design ; never inline in inference. Hot-path drift is a separate Story (post-EB-S02)
R3	Drift Store growth (high-volume drift events drown signal)	L	M	365d retention + severity-based alert filtering ; cron archives `info`-severity events to S3 monthly. Implemented in EB-S02 per committee `4b508801` reco 8 ; verified end-to-end before EB-S02 closure.
R4	Operator alert fatigue from drift Grafana panels	M	M	Drift events have a `severity` enum ; only `warn` + `crit` page ; `info` only on dashboards
R5	TimescaleDB extension unavailable on managed Scaleway PG	L	M	EA-S01 starts with a verification step ; fallback to plain PG with manual partition cron if needed
R6	Evidently version drift breaks reports between runs	M	L	Pin Evidently exact version (same pattern as cleanlab pin per ADR-70 §5)
R7	KPI emission becomes mandatory boilerplate that slows velocity	M	L	Hamilton hooks (Phase 2) auto-emit for every node ; explicit `emit_kpi()` reserved for business-specific metrics outside Hamilton
R8	Drift Store schema rigidity blocks new drift types	L	L	`drift_type` enum extensible via schema migration ; `payload JSONB` accepts arbitrary Evidently output
R9	Grafana datasource explosion (5+ datasources to manage)	L	L	All 5 (Tempo, Prom, Loki, KPI, Drift) are already supported by Grafana ; provisioned via existing `infra/grafana/provisioning/`

7. Falsifiable hypotheses¶

H1 (3-layer split is valuable) : within 6 months of EB-S02 merge, ≥ 1 incident is detected first by Drift Store (not by Loki / Prom) → 3 layers are not redundant.
H2 (KPI Store earns its keep) : within 3 months, ≥ 1 ad-hoc operator query against kpi.events produces a non-trivial finding that wouldn't have been visible in Prometheus → KPI Store is queryable, not just a write-only black box.
H3 (Evidently fits) : drift report wall-time stays under 60 s for the largest typical reference vs current dataset (10k × 50 features) → Evidently is fast enough for batch use.
H4 (operator adoption) : within 90 days, the operator opens the KPI Grafana dashboard ≥ 5 times / week unprompted → it's actually useful, not write-only.

If any of H1-H4 is falsified at the relevant horizon, this design is amendment-eligible (revisit storage choice or layer decomposition).

8. ADRs that this design will produce (if accepted)¶

Per committee 4b508801 reco 5, ratification is approved on the strength of this design alone (no working prototype required) ; implementation validation happens during EA-S01 + subsequent Stories.

ADR	Topic	Specific content per committee `4b508801` recos
ADR-72	Three-layer observability model (technical / execution / drift) — codifies the layer separation, the question each answers, the no-overlap invariant	+ Hot-path inference drift forward-looking section (reco 3) : real-time drift detection in live crypto trading is the v2 frontier ; high-level architectural considerations include (a) lightweight in-memory PSI on the inference path, (b) integration with ADR-71 kill-switch (drift signal can auto-engage the switch above threshold), (c) decouple from batch evaluation latency. Out of v1 scope ; tracked as a future Story under CVN-N010-EC.
ADR-73	KPI Store on TimescaleDB extension (storage choice + emission contract `emit_kpi()` + tag whitelist + security model)	+ Tag whitelist amendment process (reco 4) : the whitelist lives at `src/commun/observability/kpi_tag_whitelist.py` as a Python set ; adding a tag = open PR with the new key + 2-line justification + at least one usage in `emit_kpi()` ; CR + 1 maintainer approval = lightweight (no committee). Removing a tag = ADR amendment (heavier — operators may have built dashboards on it). + Security model section (reco 6) : least-privilege role `cvntrade_kpi_writer`, K8s Secret-injected connection string, PII deny-list at helper level (see §4.7).
ADR-74	Evidently as drift engine + Drift Store schema (PG schema, 10 drift types, severity enum, 365d retention) + security model	+ Validation phase mandate (reco 2) : EB-S01 includes a benchmark vs NannyML on CVNTrade crypto data (15 m candles, binary/3-class) ; if Evidently underperforms, ADR-74 is amended with a swap-engine clause. + info-severity S3 archive cron (reco 8) : implemented in EB-S02 to prevent signal drowning in `drift.events`. + Security model section (reco 6) : least-privilege role `cvntrade_drift_writer`, S3 hybrid-storage audit trail (see §4.7). + Hybrid storage invariant (reco 9) : payload < 100 KB inline JSONB ; payload ≥ 100 KB stored in S3 with `payload_url` pointer in JSONB ; threshold validated end-to-end in EA-S01.
ADR-75	Grafana as single entry for the 3 layers — extends ADR-26 (does NOT supersede ; just adds the 2 new datasources to the existing single-entry-point invariant)	unchanged

9. Open questions for committee¶

3-layer cognitive split : technical / execution / drift — right cognitive model, or should we collapse execution + drift into a single "business observability" layer ?
TimescaleDB extension on the existing managed Scaleway PG — viable, or do we need a dedicated PG instance / a different timeseries DB (VictoriaMetrics, ClickHouse) ?
Evidently as drift engine — fits our 15m candles + binary classification + crypto data ? Any known alternative we should benchmark first (NannyML, WhyLogs, custom) ?
KPI Store cardinality control — strict tag whitelist (operator pain when adding a new tag) vs free-form (cardinality explosion risk) ? Reco : whitelist with documented amendment process.
Grafana datasource topology — 5 datasources (Tempo, Prom, Loki, KPI, Drift) vs unified via PG-only (KPI + Drift on same datasource, different schemas) ? Reco : 5 datasources, but the 2 new ones share the same PG connection string.
ADR-26 supersession — does ADR-75 supersede ADR-26, or extend it ? ADR-26 says "Grafana as single entry point" ; this design adds 2 new datasources but Grafana stays the entry point. Reco : extend, not supersede.
Phase 1 scope — KPI helper alone (EA-S01) or KPI helper + Hamilton hooks (EA-S01 + EA-S02) bundled together ? Reco : separate, ship EA-S01 first to prove the schema + helper before wiring Hamilton everywhere.
Hot-path inference drift — explicitly out of scope for v1 (batch-only Evidently). Should the design include a forward-looking "v2 hot-path drift" section, or defer entirely ?
Evidently artifact storage — drift reports can be ~MB each ; store raw report in drift.events.payload JSONB, or in S3 with a pointer in payload ? Reco : payload < 100 KB inline ; > 100 KB to S3.
Acceptance criteria for ADR-72 to be ratified — does committee require a working EA-S01 prototype before ratifying the ADR, or is the design alone sufficient for ratification ?

10. Acceptance criteria (Story level — for the Need / Epic when allocated)¶

This design is DONE when :

Committee plan_review PASSED (≥ ACCEPTED, ≥ 8.0 avg, no blockers)
Need CVN-N010 allocated via scripts/openproject_import_gh.py per ADR-69
Epic CVN-N010-EA (KPI Store) + CVN-N010-EB (Drift) created
At least the 5 Stories EA-S01 → EB-S02 created in OP, sequenced
ADR-72 (3-layer model) drafted and submitted (gate before EA-S01 starts)
documentation/architecture/workspace.dsl extended with the 3 new containers + 1 new view (post EA-S01 merge)

11. References¶

11.1 Existing ADRs the design builds on / extends¶

ADR-26 — Grafana as single entry point (extended by ADR-75, not superseded)
ADR-25 — No silent fallback (drift triggers must be loud — fail-fast on unknown KPI tags)
ADR-30 / 32 / 33 — Structured logs (technical layer, unchanged)
ADR-58 — FTF guardrails (drift on guardrail violations could feed the Drift Store)
ADR-59 — All pipeline params in PostgreSQL ftf_config (KPI store sits next to it on the same managed PG)
ADR-61 — Hamilton for batch flows (auto-emit hooks at node level — Phase 2)
ADR-62 — Unified OTel observability (technical layer, unchanged)
ADR-67 — Pluggable feature selection (drift on feature selector outputs — would catch #706 type incidents)
ADR-68 — Committee = default review channel (this design submitted as plan_review)
ADR-69 — OpenProject is the project orchestrator (this design will become a Need + Epic + Stories)
ADR-70 — MLOps readiness template (drift detection is one of the 6 mandatory sections — drift store enables it programmatically)
ADR-71 — Trading kill-switch invariants (drift events on kill-switch transitions feed Drift Store)

11.2 Existing artefacts referenced¶

documentation/architecture/workspace.dsl — Structurizr workspace (extension point shown in §4.5)
src/commun/observability/otel.py — existing OTel emit_event helper (Layer 1)
infra/grafana/dashboards/ — existing dashboards (datasources extended, panels added)
Issues : #608 (parent F1 mission), #707 (F1 boost epic that will benefit from drift detection)

11.3 External¶

Evidently AI docs : https://docs.evidentlyai.com/
TimescaleDB hypertables : https://docs.timescale.com/use-timescale/latest/hypertables/
López de Prado, Advances in Financial Machine Learning — drift detection patterns
Müller (2019), Northcutt et al. — same authors cited in Track 5 design (label noise + smoothing — drift on these signals would be Layer 3)

11.4 Committee sessions¶

4b508801 — plan_review round 1 (this design first submission). PASSED OK avg 8.46, strong consensus, 0 blocker, 2 minor dissents (ADR-72 ratification gating, hot-path drift forward-looking section), 9 recommendations all forward-looking. All 9 recos applied to this design in revision before re-submission for round 2 :
reco 1 — TimescaleDB hard blocker entry in EA-S01 (§5)
reco 2 — Evidently validation phase mandate in EB-S01 (§5)
reco 3 — hot-path drift forward-looking section in ADR-72 (§8)
reco 4 — KPI tag whitelist amendment process in ADR-73 (§8)
reco 5 — ADR-72 ratify on design alone (no prototype) — accepted (§8 prelude)
reco 6 — security model section §4.7 (new) + ADR-73 / ADR-74 sections (§8)
reco 7 — runbooks for warn / crit drift events as EB-S02 deliverable (§5)
reco 8 — info-severity S3 archive cron in EB-S02 explicit (§6 R3)
reco 9 — hybrid storage 100 KB threshold validated in EA-S01 (§5)
<TBD> — plan_review round 2 (this revision). Expected verdict ≥ ACCEPTED or ACCEPTED-WITH-CHANGES, no blockers (all blockers from round 1 — there were none — already addressed ; recos applied).

Design — Unified Observability : 3-layer model (technical + execution + drift) (CVN-N010)¶