0061 batch dataflow dag pattern hamilton for feature computation
ADR-61 — Batch Dataflow DAG Pattern (Hamilton for Feature Computation)¶
Status: Decided (2026-04-17)
Context: EnrichmentAPI and FeatureEngineeringAPI are ~800 lines of imperative DataFrame operations. Adding a new technical indicator requires editing a monolithic process() method. Feature dependencies are implicit (reading the code to find that atr_normalized depends on atr_14). There is no lineage, no caching of intermediate computations, no way to visualize the feature graph. Labelling (triple barrier) has the same imperative structure. These are batch computations that fit naturally as a declarative DAG — CandlePipeline (ADR-60) is the wrong pattern for them because it assumes sequential short-circuit per event.
Decision: Batch dataflow computations — enrichment, feature engineering (fit phase), labelling, any transformation of DataFrames into DataFrames where dependencies form a DAG — MUST use Hamilton (Stitch Fix's declarative dataflow library). Each feature/label/derived column is a pure Python function. Hamilton resolves dependencies from function signatures, caches intermediate results, and produces lineage diagrams.
Invariants:
- Scope — batch DAGs only: Use Hamilton for batch DataFrame transformations where dependencies form a directed acyclic graph. Do NOT use it for the hot path (ADR-60), for workflow orchestration (Airflow stays), or for model training itself (XGBoost HPO is a workflow, not a feature DAG).
- Function-per-node: Each node is a pure function. Inputs are declared via parameter names (Hamilton resolves them from other node outputs or external inputs). No hidden globals, no side effects.
- Typed signatures: All function parameters and return values MUST have type annotations (
pd.Series,pd.DataFrame,float). Hamilton uses these for validation. - FTF-toggleable: Feature groups use
@hamilton.function_modifiers.config.when(enabled="group_name")to enable A/B testing via env var (ADR-56). Disabled features are not computed. - Lineage emitted: Every batch run MUST emit the Hamilton execution graph as an artifact (MLflow run or filesystem).
dr.visualize_execution()is the canonical source of lineage. - No duplication with CandlePipeline: The same feature function is used in both pipelines — Hamilton computes it during batch (training/fit),
CandlePipelineconsumes the fitted artifact viaFeatureEngineeringStep(streaming/transform). Thefit↔transformcontract lives in the FE pipeline, not in separate code paths.
Scope by domain:
| Domain | Pattern | Rationale |
|---|---|---|
| ETL (OHLCV ingestion → PostgreSQL) | Airflow only | I/O bound, retry, distributed — Airflow strengths |
| Enrichment (batch, training) | Hamilton DAG | Declarative features with deps |
| Labelling (triple barrier) | Hamilton DAG | Each label depends on multiple columns |
| Feature engineering (fit) | Hamilton DAG | Imputer → variance → normalizer pipeline |
| Feature engineering (transform, per-candle) | CandlePipeline step (ADR-60) |
Hot path, sub-ms latency |
| Training (XGBoost HPO) | Airflow + orchestrator | Workflow, not DAG |
| Candle loop (BT/paper/live) | CandlePipeline (ADR-60) |
Hot path |
Alternatives rejected: - Keep imperative code: incurs growing maintenance cost, no lineage, no caching. - Airflow task per feature: too heavy — each feature as a task adds seconds of overhead and pollutes the DAG graph. - Custom DAG framework: reinventing Hamilton, no lineage tooling, no community. - Prefect/Dagster: heavier than Hamilton, oriented workflow not dataflow.
Files (to be created Phase 5 of #568): src/commun/pipeline/features/technical.py, src/commun/pipeline/features/labelling.py, src/commun/pipeline/features/pipeline_fit.py