0061 batch dataflow dag pattern hamilton for feature computation

ADR-61 — Batch Dataflow DAG Pattern (Hamilton for Feature Computation)¶

Status: Decided (2026-04-17)

Context: EnrichmentAPI and FeatureEngineeringAPI are ~800 lines of imperative DataFrame operations. Adding a new technical indicator requires editing a monolithic process() method. Feature dependencies are implicit (reading the code to find that atr_normalized depends on atr_14). There is no lineage, no caching of intermediate computations, no way to visualize the feature graph. Labelling (triple barrier) has the same imperative structure. These are batch computations that fit naturally as a declarative DAG — CandlePipeline (ADR-60) is the wrong pattern for them because it assumes sequential short-circuit per event.

Decision: Batch dataflow computations — enrichment, feature engineering (fit phase), labelling, any transformation of DataFrames into DataFrames where dependencies form a DAG — MUST use Hamilton (Stitch Fix's declarative dataflow library). Each feature/label/derived column is a pure Python function. Hamilton resolves dependencies from function signatures, caches intermediate results, and produces lineage diagrams.

Invariants:

Scope — batch DAGs only: Use Hamilton for batch DataFrame transformations where dependencies form a directed acyclic graph. Do NOT use it for the hot path (ADR-60), for workflow orchestration (Airflow stays), or for model training itself (XGBoost HPO is a workflow, not a feature DAG).
Function-per-node: Each node is a pure function. Inputs are declared via parameter names (Hamilton resolves them from other node outputs or external inputs). No hidden globals, no side effects.
Typed signatures: All function parameters and return values MUST have type annotations (pd.Series, pd.DataFrame, float). Hamilton uses these for validation.
FTF-toggleable: Feature groups use @hamilton.function_modifiers.config.when(enabled="group_name") to enable A/B testing via env var (ADR-56). Disabled features are not computed.
Lineage emitted: Every batch run MUST emit the Hamilton execution graph as an artifact (MLflow run or filesystem). dr.visualize_execution() is the canonical source of lineage.
No duplication with CandlePipeline: The same feature function is used in both pipelines — Hamilton computes it during batch (training/fit), CandlePipeline consumes the fitted artifact via FeatureEngineeringStep (streaming/transform). The fit ↔ transform contract lives in the FE pipeline, not in separate code paths.

Scope by domain:

Domain	Pattern	Rationale
ETL (OHLCV ingestion → PostgreSQL)	Airflow only	I/O bound, retry, distributed — Airflow strengths
Enrichment (batch, training)	Hamilton DAG	Declarative features with deps
Labelling (triple barrier)	Hamilton DAG	Each label depends on multiple columns
Feature engineering (fit)	Hamilton DAG	Imputer → variance → normalizer pipeline
Feature engineering (transform, per-candle)	`CandlePipeline` step (ADR-60)	Hot path, sub-ms latency
Training (XGBoost HPO)	Airflow + orchestrator	Workflow, not DAG
Candle loop (BT/paper/live)	`CandlePipeline` (ADR-60)	Hot path

Alternatives rejected: - Keep imperative code: incurs growing maintenance cost, no lineage, no caching. - Airflow task per feature: too heavy — each feature as a task adds seconds of overhead and pollutes the DAG graph. - Custom DAG framework: reinventing Hamilton, no lineage tooling, no community. - Prefect/Dagster: heavier than Hamilton, oriented workflow not dataflow.

Files (to be created Phase 5 of #568): src/commun/pipeline/features/technical.py, src/commun/pipeline/features/labelling.py, src/commun/pipeline/features/pipeline_fit.py