Mode Operatoire CVNTrade MLOps¶
Version : 3.0 Derniere mise a jour : 2026-03-28 Perimetre : exploitation de la plateforme MLOps CVNTrade
Statut operationnel : - Grafana (5 dashboards) : operationnel - Airflow (10 DAGs) : operationnel - ZenML (7 pipelines natifs) : operationnel - MLflow (model registry) : operationnel - Prometheus (metriques K8s) : operationnel - Alerting automatique (Slack/SMS) : cible (issue #397) - Drift monitoring : cible (pas encore implemente) - Diagnostic automatique dans Grafana : cible (logs structurés a migrer)
0. Etat global CVNTrade¶
Toute session d'exploitation commence par determiner l'etat global.
| Etat | Definition | Reaction |
|---|---|---|
| GREEN | Infra saine, pipelines sains, au moins 1 run utile recent | exploitation normale |
| YELLOW | Derive controlee — signaux faibles d'anomalie | surveillance renforcee |
| ORANGE | Degradation metier ou pipeline significative | investigation prioritaire |
| RED | Indisponibilite ou incapacite a produire | escalade immediate |
Regles de calcul¶
| Condition | Etat |
|---|---|
| Infra critical (pod down, PVC full, OOMKill en boucle) | RED |
| Pipeline success rate < 70% | ORANGE |
| Pass rate = 0 sur 3 runs consecutifs | ORANGE |
| Modele stale > 14j sur crypto core (btc-core) | ORANGE |
| 0 candidat utile sur univers complet > 3 jours | ORANGE |
| Success rate 70-90% ou Sortino negatif recurrent | YELLOW |
| Modele stale > 7j sur crypto non-core | YELLOW |
| Tout le reste | GREEN |
1. Objet¶
Ce document decrit le mode operatoire pour : - determiner l'etat global du systeme, - superviser les golden signals, - lancer les pipelines standards, - diagnostiquer les resultats, - escalader selon la severite.
Principe directeur : Grafana est le point d'entree unique (ADR-26). Les autres outils servent au drill-down selon la nature du probleme.
2. Golden Signals CVNTrade¶
Ces signaux sont le noyau de la supervision. Toute vue operateur doit en deriver.
Plateforme¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
infra_health |
Saturation / OOM / PVC / disponibilite | OOM=1, CPU>70% | OOM>2, PVC>85%, pod down |
pipeline_success_rate |
Taux de succes glissant (7j) | 70-90% | < 70% |
pipeline_latency |
Duree glissante vs p95 historique | p95 + 30% | p95 + 100% |
ML (par crypto — ADR-27)¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
f1_buy_delta_vs_baseline |
f1_buy - baseline_f1_buy |
delta < 15% | delta <= 0 |
screening_testing_gap |
screening_f1_buy - last_fold_f1_buy |
gap > 0.15 | gap > 0.30 |
std_f1_buy |
Stabilite inter-fold | 0.05-0.10 | > 0.10 |
Trading¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
sortino |
KPI financier principal | 0 a 0.5 | < 0 |
sl_pct |
Part des sorties stop-loss | 35-50% | > 50% |
timeout_pct |
Part des sorties timeout | 30-40% | > 40% |
Lifecycle¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
model_freshness |
Age du dernier modele utile | > 7j | > 14j |
challenger_count |
Challengers enregistres non promus | > 5 pending | > 10 pending |
Regle : ces signaux doivent etre visibles dans Grafana sans lecture de logs.
Dette d'observabilite : aujourd'hui, les signaux ML et Trading ne sont visibles que dans les logs Airflow (section CORRELATION DATA). Migration vers Grafana = cible prioritaire.
3. Architecture des outils¶
GRAFANA (grafana.cvntrade.eu) <- POINT D'ENTREE UNIQUE
|-- supervision globale (golden signals)
|-- dashboards metier, pipeline, modele, infra
'-- alerting centralise (cible #397)
AIRFLOW (airflow.cvntrade.eu) <- EXECUTION
|-- trigger DAG
|-- suivi des runs
'-- logs detailles (debug uniquement, pas decision)
ZENML (zenml.cvntrade.eu) <- DRILL-DOWN PIPELINE
|-- lignage
|-- historique des runs
'-- artifacts pipeline versionnes
W&B (wandb.ai) <- DRILL-DOWN ML
|-- screening matrices
|-- HPO trials
'-- comparaison detaillee de runs ML
MLFLOW (mlflow.cvntrade.eu) <- DRILL-DOWN MODELES
|-- model registry
|-- metriques par run
'-- artifacts lies aux modeles
PROMETHEUS (interne K8s) <- METRIQUES INFRA
'-- collecte CPU/RAM/pods, exposee dans Grafana
CONSOLE-NEXT (console.cvntrade.eu) <- IDP host (CVN-N012-EA)
|-- /catalog — service inventory (CVN-N012-EA-S02, runbook §catalog-add-service)
|-- /dashboards — Grafana embed module (CVN-N012-EA-S03, planned)
'-- /tokens-preview — design token reference
4. Roles et responsabilites¶
| Role | Responsable de | Acces |
|---|---|---|
| Operateur | Morning check, lancement standard, lecture Grafana | Grafana (readonly), Airflow (trigger + logs) |
| MLOps maintainer | Incident pipeline, registry, metriques diagnostiques | Airflow (admin), MLflow (registry admin), ZenML |
| ML owner | Interpretation modele, features, labels, HPO | W&B, MLflow (readonly), ZenML |
| Platform owner | K8s, PVC, nodes, limits, ingress, secrets | kubectl (admin restreint), Helm |
Escalade¶
| Situation | Decideur | Executant | Approbateur |
|---|---|---|---|
| Lancer un discovery | Operateur | Operateur | - |
| Investiguer un echec pipeline | MLOps maintainer | MLOps maintainer | - |
| Modifier HPO / features / labels | ML owner | ML owner | MLOps maintainer |
| Promotion challenger -> champion | ML owner | MLOps maintainer | Operateur (validation metier) |
| Killswitch / rollback | Operateur (urgence) | Platform owner | - |
| Modifier Helm / K8s resources | Platform owner | Platform owner | MLOps maintainer |
5. Dashboards Grafana¶
| Dashboard | Role principal | Usage |
|---|---|---|
| MLOps Overview | Vue executive | Etat global des modeles, screenings, pass rate |
| Testing & Backtest | Vue qualite | Resultats testing, WFRB, gates, rejets |
| Pipeline Health | Vue execution | Duree des runs, taux de succes, HPO, anomalies |
| Model Registry | Vue cycle de vie modele | Versions, fraicheur, modeles actifs / stale |
| Infra Monitoring | Vue plateforme | Pods, CPU, RAM, PVC, OOMKill, saturation |
Bon usage : - commencer par MLOps Overview ou Pipeline Health - aller sur Infra Monitoring si suspicion de probleme plateforme - aller sur Testing & Backtest si la question porte sur la qualite des resultats
6. Procedures standard¶
6.1 Verification matinale¶
Duree cible : 2 a 5 minutes
Etape 1 — Determiner l'etat global
Ouvrir Grafana > Infra Monitoring puis Pipeline Health. Determiner l'etat (GREEN / YELLOW / ORANGE / RED) selon les regles de la section 0. Si RED ou ORANGE → passer directement a la section 9 (incidents).
Etape 2 — Sante plateforme
Grafana > Infra Monitoring. Verifier : - pods critiques operationnels, - absence d'OOMKill recent, - saturation CPU/RAM sous seuils, - PVC sous seuil d'alerte.
Etape 3 — Sante pipelines
Grafana > Pipeline Health. Verifier : - runs de la nuit termines, - taux d'echec dans la norme, - pas de derive de duree, - absence de backlog.
Etape 4 — Resultats metier
Grafana > MLOps Overview. Verifier : - nouveaux candidats qualifies, - pass rate dans la norme, - fraicheur des derniers runs utiles.
6.2 Lancer un discovery¶
Outil : Airflow — DAG : launch__discovery
Run groupe :
Run cible :
Run etendu :
{
"group": "defi",
"horizons": "H1,H2,H3,H4,H5,H6",
"sl_range": "0.8,1.0,1.2,1.5",
"tp_range": "1.5,2.0,2.5,3.0",
"hpo_trials": 50,
"backtest_days": 60
}
Suivi : Airflow (statut DAG run), Grafana (resultats apres completion).
Note : les seuils ML/Trading s'interpretent dans le contexte du run (univers, taille de grille, horizons).
6.3 Analyser les resultats¶
Niveau 1 — Grafana > Testing & Backtest
Chercher : meilleures strategies qualifiees, motifs de rejet, distribution F1/Sortino.
Niveau 2 — Logs Airflow (debug uniquement)
Chercher les blocs structures === CORRELATION DATA ===. Distinguer :
- qualite ML : f1_buy vs baseline_f1_buy, precision_buy, recall_buy, buy_rate
- qualite financiere : Sortino, tp/sl/timeout %, gates
- coherence : screening_f1_buy vs last_fold_f1_buy (meme metrique, meme periode — ADR-28, ADR-29)
Niveau 3 — Drill-down W&B : screening matrix, HPO trials, comparaison de runs.
6.4 Verifier un modele specifique¶
- Grafana > Model Registry — identifier crypto / version
- MLflow — run source, metriques, artifacts, hyperparams
- ZenML — pipeline source, artifacts amont, lignage
6.5 FTF — Fine-Tuning Framework Operations¶
6.5.1 Launching FTF runs¶
-
Verify no stale runs:
If stale runs exist → kill them first (6.5.4). -
Trigger from Airflow UI or CLI. Run-level params only (ADR-65) — PTE, fold/trial counts and history months live in the Console (
ftf_config.base_env), resolved at DAG parse time:
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags trigger finetune__pte \
--conf '{"factor":"calibration","crypto_group":"defi_top5","phase":"manual","power_mode":"standard"}'
To change the PTE / folds / trials / history: edit Console → Baseline
Config (CVN_DEFAULT_PTE, CVN_DEFAULT_N_FOLDS, CVN_DEFAULT_N_TRIALS,
CVN_DEFAULT_HISTORY_MONTHS) and re-trigger.
ADR-90 hyperparameter seeding (auto, issue #985). The canonical
training hyperparameters (CVN_HPO_<MODEL>_<TF>_<PARAM>, ~481 keys)
live in ftf_config.base_env and are auto-seeded on every deploy
by a Helm post-install,post-upgrade hook Job in the
cvntrade-runtime chart (templates/ftf-seed-hook-job.yaml). It is
insert-missing-only — operator Console edits are never clobbered
(no --force-overwrite). It is idempotent (already-present keys are
skipped). No operator action is required; the seed is NOT a manual
prerequisite.
- Fail-loud: if the hook Job fails, helm upgrade --wait fails →
the Deploy Runtime step fails. The deploy is blocked, never a
silently half-seeded Console.
- Recovery: inspect kubectl -n cvntrade logs job/cvntrade-runtime-ftf-seed
(look for event=seed_db_connect_failed / event=seed_apply_failed
+ the event=seed_summary line). Most failures are DB connectivity
/ cvntrade-env-secrets / cvntrade-env-config issues — fix the
secret/ConfigMap and re-run the deploy (the hook re-applies
idempotently). To preview without writing, from any pod with the
env: python scripts/seed_hyperparams_console.py --dry-run (runs
offline if the DB is unreachable — prints the full plan, never
crashes).
- Monitor first 2 minutes — verify sample count:
- Expected:
Train: ~1000-2000 samples(CUSUM enabled) - If
Train: >10000 samples: STOP — cache is stale or CUSUM misconfigured. See 6.5.3.
6.5.2 After Helm deploy with FTF config changes¶
MANDATORY — pods keep old config until killed.
-
Verify new code deployed:
-
Kill ALL running FTF pods:
-
Wait 30 seconds for termination.
-
Trigger new runs (6.5.1).
6.5.3 Cache flush (MLflow feature store)¶
When: After CUSUM config change, after feature engineering change, or when sample counts are wrong.
What gets flushed: Feature selection (Level 4) and feature engineering caches. ETL, labels, and models are NOT affected.
Procedure:
-
Kill all FTF pods first:
-
Flush feature selection + feature engineering cache:
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}') kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c " import os, sys sys.path.insert(0, '/opt/airflow/src') import mlflow mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000')) client = mlflow.tracking.MlflowClient() runs = client.search_runs(experiment_ids=['3'], max_results=500) deleted = 0 for run in runs: name = run.data.tags.get('mlflow.runName', '') if 'feature_selection' in name or 'feature_eng' in name.lower(): client.delete_run(run.info.run_id) deleted += 1 print(f'Deleted {deleted} cache entries') " -
Verify cache is empty:
kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c " import os, sys sys.path.insert(0, '/opt/airflow/src') import mlflow mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000')) client = mlflow.tracking.MlflowClient() runs = client.search_runs(experiment_ids=['3'], max_results=500) fs = [r for r in runs if 'feature_selection' in r.data.tags.get('mlflow.runName', '')] print(f'Remaining feature_selection entries: {len(fs)} (should be 0)') " -
Trigger new runs (6.5.1). Cache will regenerate automatically (~2 min extra on first run per crypto).
WARNING: Do NOT flush experiment 1 (ETL), 4 (Labels), 7 (HPO), or 8 (Models). These are independent of CUSUM config.
6.5.4 Killing stale FTF runs¶
# Kill all pods
kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
# Mark failed runs in Airflow (optional — they'll stay as 'failed' automatically)
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain
6.5.5 Monitoring FTF runs¶
# Pod status
kubectl get pods -n cvntrade --no-headers | grep finetune
# Progress per pod
for pod in $(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | awk '{print $1}'); do
count=$(kubectl logs -n cvntrade $pod --tail=5000 | grep "event=weighted_variant_evaluated" | wc -l)
crypto=$(kubectl logs -n cvntrade $pod | grep "Running factor=" | head -1 | grep -o "crypto=[A-Z]*")
echo "$pod: $crypto completed=$count"
done
# CPU/memory usage
kubectl top pods -n cvntrade --no-headers | grep finetune
Grafana: Infrastructure Monitoring dashboard shows FTF pods, CPU/memory, throttling. Grafana: Fine-Tuning Results dashboard shows results as they arrive in PostgreSQL.
7. Taxonomie de diagnostic CVNTrade¶
Chaque run de testing/backtest doit produire un diagnostic principal.
Audit léger s40 — flag
CVN_S40_SKIP_S22A1_CROSSREF(CVN-N001-EI-S07 Lever #2, défaut PGftf_config.base_env, Console UI uniquement) : pour un audit d'intégrité de validation (s40, X/y-only), mettre le flag àonsaute l'ancrerun_s22a1(re-preuve 300-round, p50 ~4 min / p95 ~14 min) que les probes n'utilisent pas. Le verdict des probes est inchangé, mais le run ne certifie plus la reproduction du symptôme S22A1 : la sortie porte alorss22a1_status=SKIPPED+s22a1_anchor_available=false— un audit skip ne doit jamais être lu comme un audit complet. Défaut :offpour les missions de reproduction (S22→S28),onpour les audits s40 légers. Le skip est refusé pour tout diagnostic nonxy_only(registrecapture_requirements.py). Parquet capturé absent en mode skip ⇒INCONCLUSIVE_TOOLING(jamais de re-capture auto, ADR-25).
| Code | Signification | Signal declencheur |
|---|---|---|
ML_USELESS |
Modele sans valeur predictive | f1_buy <= baseline_f1_buy (delta <= 0) |
ML_MARGINAL |
Gain ML insuffisant pour survivre a l'execution | delta +0 a +5 pts vs baseline |
ML_EXPLOITABLE |
Signal exploitable sous reserve d'enveloppe favorable | delta +5 a +10 pts vs baseline |
ML_SOLID |
Signal solide | delta > +10 pts vs baseline |
ML_UNSTABLE |
Modele instable entre periodes | std_f1_buy > 0.10 |
EXECUTION_MISMATCH |
Enveloppe defavorable — ML exploitable mais PnL negatif | f1_buy > baseline +5pts et Sortino < 0 |
SL_TOO_TIGHT |
Suspicion SL trop serre (hypothese dominante, pas prouvee) | sl_pct > 50% |
TP_TOO_AMBITIOUS |
Take-profit trop ambitieux | tp_pct < 20% |
HORIZON_TOO_SHORT |
Horizon trop court | timeout_pct > 40% |
SCREENING_OVERFIT |
Divergence screening/testing | screening_f1_buy - last_fold_f1_buy > 0.30 |
PIPELINE_DEGRADED |
Sante pipeline degradee | success_rate < 70% |
INFRA_SATURATED |
Saturation K8s / PVC / OOM | OOMKill > 2, PVC > 85%, CPU > 90% |
NO_CANDIDATES |
Aucune strategie viable | 0 candidats passes sur un run complet |
Arbre de decision (30 secondes)¶
1. f1_buy vs baseline_f1_buy ?
|
|-- delta <= 0 pts --> ML_USELESS
|-- delta +0 a +5 pts --> ML_MARGINAL
|-- delta +5 a +10 pts --> ML_EXPLOITABLE, verifier execution :
|-- delta > +10 pts --> ML_SOLID, verifier execution :
|
2. Sortino ?
|
|-- Sortino > 0.5 --> strategie OK
|-- Sortino < 0 --> EXECUTION_MISMATCH, verifier :
|
3. Exit stats ?
|-- sl_pct > 50% --> SL_TOO_TIGHT
|-- tp_pct < 20% --> TP_TOO_AMBITIOUS
|-- timeout_pct > 40% --> HORIZON_TOO_SHORT
8. Niveaux de severite¶
| Niveau | Description | Exemples | Reaction |
|---|---|---|---|
| P0 Critical | Systeme down ou corruption | Pipeline completement bloque, OOMKill en boucle, PVC full | Escalade immediate, platform owner |
| P1 Urgent | Degradation significative | 0 candidats sur run complet, success rate < 70% | Investigation dans l'heure, MLOps maintainer |
| P2 Warning | Performance degradee | Sortino negatif, SL rate eleve, modele stale > 14j | Investigation dans la journee, ML owner |
| P3 Info | Optimisation | Run plus lent, pass rate en baisse legere | Backlog, traiter quand disponible |
9. Playbooks par incident¶
9.1 EXECUTION_MISMATCH — ML utile mais Sortino negatif¶
Symptome : f1_buy >> baseline, Sortino < 0
Impact : le modele detecte les BUY mais les trades perdent de l'argent
Hypotheses :
- SL_TOO_TIGHT : ATR multiplier trop faible
- TP_TOO_AMBITIOUS : TP multiplier trop eleve
- HORIZON_TOO_SHORT : horizon insuffisant
Verifications : 1. sl_pct, tp_pct, timeout_pct dans CORRELATION DATA 2. Comparer les ATR ranges entre strategies passees et echouees 3. Verifier si le pattern est specifique a une crypto ou generalise
Action :
- Si sl_pct > 50% : relancer avec sl_range: "1.2,1.5,1.8,2.0"
- Si tp_pct < 20% : reduire tp_range: "1.5,2.0,2.5"
- Si timeout_pct > 40% : augmenter horizons
Critere de sortie : Sortino > 0 ou sl_pct < 40%
Escalade : ML owner si 3 tentatives sans amelioration
9.2 ML_USELESS — Modele sans valeur predictive¶
Symptome : f1_buy <= baseline_f1_buy
Impact : le modele ne fait pas mieux que predire "toujours BUY"
Hypotheses : - Dataset trop desequilibre (buy_rate > 50%) - Features non informatives - Objectif HPO mal aligne
Verifications : 1. buy_rate dans CORRELATION DATA 2. action_rate HPO (le modele predit-il assez de BUY ?) 3. Comparer avec d'autres cryptos du meme run
Action :
1. Verifier buy_rate — si > 50%, le probleme est le labeling
2. Tester avec horizon different (change la distribution des labels)
3. Revoir l'objectif HPO (precision_recall_auc adapte ?)
4. En dernier recours : revoir le feature set
Critere de sortie : f1_buy > baseline + 15%
Escalade : ML owner
9.3 NO_CANDIDATES — Aucune strategie viable¶
Symptome : 0 candidats passes sur un run complet
Impact : aucune strategie exploitable pour cette crypto/groupe
Verifications : 1. Gate rejection reasons dans Grafana > Testing & Backtest 2. Si tous rejetes par Sortino : probleme PTE (voir 9.1) 3. Si tous rejetes par n_trades : modele trop conservateur (theta trop haut) 4. Si tous rejetes par f1 : probleme ML (voir 9.2)
Action : - Relancer avec grid elargi (plus d'horizons, ATR ranges plus larges) - Tester une autre crypto du meme groupe - Si recurrent sur tout un groupe : le groupe n'est peut-etre pas viable
Critere de sortie : au moins 1 candidat passe les gates
9.4 INFRA_SATURATED — Pod OOMKilled ou PVC plein¶
Symptome : OOMKill > 0 dans Grafana Infra, ou PVC > 85%
Impact : pipeline instable, runs qui crashent
Verifications : 1. Identifier le pod concerne dans Grafana Infra 2. Verifier si le probleme est recurrent ou ponctuel 3. Verifier la memoire consommee vs limits
Action : - Si OOMKill ponctuel : relancer le run - Si OOMKill recurrent : augmenter memory limits dans Helm values - Si PVC > 85% : nettoyer les anciens artifacts (MLflow / S3)
Critere de sortie : OOMKill = 0 sur 24h, PVC < 70%
Escalade : Platform owner
Dernier recours uniquement :
A utiliser seulement si le pod est reellement bloque, que le controleur peut le recreer, et que la cause a ete qualifiee.9.5 Run trop long (> 3h)¶
Symptome : duree run > p95 + 30% ou > 3h absolu
Verifications : 1. Identifier le step bloquant dans Airflow 2. Si HPO : verifier n_trials (50 = normal, > 100 = suspect) 3. Si data fetch : verifier Binance API / S3 connectivity 4. Si backtest : verifier le nombre de candles (60j x 5min = 17K = normal)
Action : selon la cause identifiee
Critere de sortie : duree < p95 historique
10. Pipelines disponibles¶
Exploitation courante (operateur)¶
| DAG Airflow | Pipeline | Role |
|---|---|---|
launch__discovery |
pte__discovery |
Screen -> Test -> WFRB -> Register challenger |
launch__backtesting |
pte__backtesting |
Test -> WFRB a partir de candidats existants |
launch__walkforward |
pte__walkforward |
Validation walk-forward |
launch__retrain |
pte__retrain |
Retrain + register |
launch__monitoring |
pte__monitoring |
Health checks |
Administration controlee (MLOps maintainer / ML owner)¶
| DAG Airflow | Pipeline | Role | Approbation requise |
|---|---|---|---|
launch__meta_training |
pte__meta_training |
Entrainement meta-label | MLOps maintainer |
launch__promotion |
pte__promotion |
Challenger -> Champion | ML owner + operateur |
Urgence (platform owner)¶
| DAG Airflow | Role | Impact |
|---|---|---|
pte__7_killswitch |
Quarantaine d'urgence | Desactive un modele |
pte__8_rollback |
Retour version precedente | Remplace le champion |
pte__10_decommission |
Archivage / nettoyage | Supprime un modele |
Regle : ne jamais executer killswitch, rollback ou decommission sans validation explicite.
11. Groupes de cryptos¶
| Groupe | Univers indicatif |
|---|---|
btc-core |
BTCUSDC, ETHUSDC |
defi |
SOLUSDC, ADAUSDC, BONKUSDC, XRPUSDC |
Source de verite : table cvntrade_universes en base PostgreSQL.
Note : les seuils ML/Trading s'interpretent dans le contexte du groupe et de la crypto. Un pass rate de 3% sur un grid de 450 points pour une crypto volatile est different d'un pass rate de 3% sur un grid de 50 points pour BTC.
12. Acces¶
| Service | URL | Profil | Usage | Niveau |
|---|---|---|---|---|
| Grafana | grafana.cvntrade.eu |
Operateur | supervision | readonly |
| Airflow | airflow.cvntrade.eu |
Operateur / MLOps | execution + logs | trigger + readonly |
| MLflow | mlflow.cvntrade.eu |
MLOps / ML owner | modeles | registry admin |
| ZenML | zenml.cvntrade.eu |
MLOps / ML owner | lignage | readonly |
| W&B | wandb.ai |
ML owner | analyse ML | compte organisationnel |
| K8s | kubectl | Platform owner | depannage | admin restreint |
| Console (IDP) | console.cvntrade.eu |
Operateur | service catalog + dashboards | readonly (catalog edits via git PR — voir runbooks/catalog-add-service.md) |
Regle : ne jamais documenter ou diffuser d'identifiants dans ce document. Tous les acces sont geres par secrets Kubernetes.
13. Regles d'exploitation (policy)¶
- Grafana d'abord — les logs Airflow sont du debug, pas de la decision
- Airflow pour executer, pas pour conclure
- MLflow pour les modeles, W&B pour l'analyse ML
- ZenML pour le lignage
- Pas de fallback silencieux dans les diagnostics (ADR-25)
- Pas d'action destructive sans qualification du probleme
- Toute conclusion doit distinguer ML, PTE/backtest et infra
- Comparaison intra-crypto uniquement (ADR-27)
- 0 SELL est normal en mode binaire (ADR-28)
- Toute metrique ML doit etre comparee a la baseline naive (ADR-29)
14. Dette d'observabilite¶
Les elements suivants sont aujourd'hui dans les logs Airflow et doivent migrer vers Grafana :
| Element | Source actuelle | Cible |
|---|---|---|
f1_buy vs baseline_f1_buy |
Logs CORRELATION DATA |
Dashboard Grafana |
screening_f1_buy vs last_fold_f1_buy |
Logs CORRELATION DATA |
Dashboard Grafana |
tp_pct / sl_pct / timeout_pct |
Logs CORRELATION DATA |
Dashboard Grafana |
| Code diagnostic principal | Logs (verdict) | Dashboard Grafana |
| Etat global CVNTrade | Calcul humain | Dashboard Grafana (automatise) |
Prerequis : ecrire les diagnostics dans PostgreSQL (pas seulement dans les logs) pour que Grafana puisse les requeter.
15. Comite d'experts — runbook (ADR-52, ADR-68)¶
Le comite d'experts est le canal par defaut pour la revue de plan (process step 3) et la revue de PR substantielle (process step 8). 5 experts personas + 1 consolidateur produisent un verdict structure (ACCEPTED / ACCEPTED-WITH-CHANGES / REJECTED + code).
15.1 Quand invoquer¶
| Trigger | session_type | Mandatoire ? |
|---|---|---|
| Plan d'implementation (process step 3) | plan_review |
oui |
PR touchant src/commun/pipeline/, finetune/, cache/, backtest/, training, labels, prod trading |
pr_review |
oui |
| Interpretation d'un run experimental (FTF, ablation) | experiment_review |
recommande |
| Question de strategie ouverte | general |
optionnel |
Exemption : docs / dashboards / config seuls (cf ADR-68).
Checklist pr_review — transport inter-taches (ADR-0100, anti-regression Epic CVN-N014-ED) : pour toute PR touchant dags/**, src/commun/finetune/** ou scripts/**, verifier qu'aucune tache ne bricole un stockage intermediaire par-tache (np.savez/np.load/put_object/upload_fileobj/to_parquet/to_pickle/chemin /tmp/manifest S3) pour franchir une frontiere de tache/pod. Exiger return/xcom_pull (le backend object-storage gere l'offload, ADR-0100) ou le pass-by-reference S3 partage (cvntrade_s3_manager, cf. s43_io). Exception : capture single-pod bornee (producteur+consommateur meme pod). C'est le pendant humain du gate CodeRabbit ; guideline : process/INTER_TASK_DATA_TRANSPORT.md.
15.2 Preparer le dossier¶
Convention : un fichier markdown self-contained dans documentation/reviews/YYYY-MM-DD-<slug>.md.
Sections recommandees : 1. Demande au reviewer — questions explicites a poser (3-7) 2. Contexte projet — 1 paragraphe pour quelqu'un qui ne connait pas CVNTrade 3. Probleme observe — chiffres, pas adjectifs 4. Ce qui a deja ete tente — outcomes explicites 5. Hypothese et plan propose — concret, testable 6. Ce qu'on a ecarte (et pourquoi) — anti-suggestions 7. Risques identifies et mitigations 8. Criteres de succes cochables
Le comite n'a aucun acces filesystem : tout ce qui n'est pas dans le dossier n'existe pas pour lui.
15.3 Lancer la session¶
source .venv_airflow/bin/activate
python scripts/expert_committee.py \
--artifact documentation/reviews/2026-04-26-track-a.md \
--question "Question concise listant les decisions a valider" \
--session-type plan_review \
--issue "#690"
Options utiles :
- --experts expert-ml-engineer,expert-data-scientist : sous-ensemble si la question est etroite
- --no-consolidation : opinions brutes seulement (utile en debug)
- --dry-run : compile et affiche les prompts sans appeler les LLMs (gratuit)
- --model gemini/gemini-2.5-pro : override le modele par defaut
Sortie :
- committee/sessions/{session_id}_committee.json — verdict + opinions complets
- committee/sessions/{session_id}_artifact.md — copie archivee de l'artifact
- log FinOps dans committee/finops.jsonl
Cout typique : $0.10 – $0.30 / session (Mistral large + Gemini 2.5 flash). Au-dela de $2 → reduire le dossier plutot que retry.
15.4 Lire le verdict¶
GUI : make committee-gui (port 8502) → naviguer vers la session par ID ou date.
Champs cles dans le JSON :
- verdict.status : ACCEPTED / ACCEPTED_WITH_CHANGES / REJECTED
- verdict.code : code structure (ex METHODOLOGY_FLAW, INSUFFICIENT_EVIDENCE)
- verdict.consensus_strength : strong / weak / split
- verdict.blockers : liste a traiter avant resoumission si REJECTED
- verdict.recommendations : actions priorisees, numerotees
- verdict.areas_of_dissent : ou les experts ne sont PAS d'accord — souvent plus instructif que les zones de consensus
- expert_opinions[].score : 0-10 par expert ; un score < 4 isole signale soit un blind spot du dossier soit un desaccord profond
- finops.totals : tokens et cout USD
15.5 Apres un REJECTED¶
Trois reactions admissibles :
1. Adresser les blockers et resoumettre (incrementer le slug dossier -v2)
2. Reduire le scope pour evacuer les blockers (ex : retirer la partie contestee du plan, traiter ailleurs)
3. Waiver explicite dans l'issue avec justification ecrite ("le comite suggere X mais Y because Z")
Silence = blocage du process step suivant (cf ADR-68 invariant "REJECTED is not optional to address").
15.6 Apres un ACCEPTED / ACCEPTED-WITH-CHANGES¶
- Logger le
session_iddans le commit qui implemente (addresses committee session b2e4c384) - Pour les PR reviews : copier le lien session JSON dans la description PR
- Si ACCEPTED-WITH-CHANGES : adresser les
recommendationspriorite haute avant merge
15.7 Logger la session comme OP Meeting (ADR-82, obligatoire)¶
Tout plan_review / pr_review / experiment_review réussi DOIT être loggé comme objet Meeting dans OpenProject, immédiatement après la commande expert_committee.py (≤ 24h max — au-delà la review est considérée non-conforme ADR-68 + ADR-82).
# kubectl access requis (le POST /api/v3/meetings n'est pas exposé via l'API publique en OP 17.3.1)
set -a && . .env && set +a
python scripts/op_save_committee_as_meeting.py \
--session committee/sessions/{session_id}_committee.json \
--linked-wp <wp_id>
--linked-wp est OBLIGATOIRE pour les sessions de type plan_review / pr_review / experiment_review (le meeting apparaît sous l'onglet "Meetings" du WP). Seules les sessions general peuvent s'en passer. Note : session_type est lu depuis le JSON de session (session["verdict"]["session_type"]), il n'y a pas de flag CLI --session-type sur ce script.
Le script est idempotent :
- 1ère exécution → crée le Meeting + 5 locked-Users (un par expert) + agenda items + uploade les 2 attachments
- 2ème exécution sur la même session → EXISTING_MEETING_ID=<n> + skip ... already attached
Sortie attendue :
[op_save_committee_as_meeting] pod=openproject-... session=<id> linked_wp=<wp>
[attach] uploaded <id>_committee.json (attachment id=...)
[attach] uploaded <id>_artifact.md (attachment id=...)
CREATED_MEETING_ID=<n>
CREATED_MEETING_URL=https://openproject.cvntrade.eu/meetings/<n>
AGENDA_ITEMS=7
Vérifier dans la UI : https://openproject.cvntrade.eu/projects/cvntrade/meetings?upcoming=false (onglet Past). Le filtre par défaut ?upcoming=true ne montre PAS les sessions passées (logique : un comité est toujours dans le passé une fois loggé).
Le CREATED_MEETING_URL doit être collé en plus du lien session JSON dans la description PR / commentaire OP Story (les deux références cohabitent — JSON = source-of-truth contenu, Meeting = surface queryable).
15.8 Erreurs frequentes¶
| Symptome | Cause | Fix |
|---|---|---|
| Verdict REJECTED systematique | dossier trop court ou questions floues | enrichir le dossier, lister explicitement les decisions a valider |
| Un expert score 0/10 sans raison | parse failure JSON (LLM a renvoye markdown au lieu de JSON pur) | check committee/sessions/parse_failures/ ; relancer la session |
| Cout > $2 sur une session | artifact > 200k chars | tronquer ; le loader truncate automatiquement mais perd le contexte |
Langfuse 404 sur review-consolidator |
prompt pas synchronise vers Langfuse | non-bloquant — le fallback local prend le relais |
16. Sprint orchestration runbook (ADR-69)¶
OpenProject est l'orchestrateur du projet. Toute activité de dev démarre en sélectionnant une Story dans une version (sprint) open de la roadmap, et se termine en fermant la Story (et la version quand sa dernière Story ferme).
URLs clés :
- Roadmap : https://openproject.cvntrade.eu/projects/cvntrade/roadmap
- Versions admin : https://openproject.cvntrade.eu/projects/cvntrade/settings/versions
- WPs filtrés par version : https://openproject.cvntrade.eu/projects/cvntrade/work_packages → filtre Version
Convention de nommage des versions : <epic-shortname>-<phase>-<descriptor> (ex: F1B-S1-QW-PhaseA). Une version mélangeant des Stories de plusieurs Epics viole l'invariant ADR-69.
16.1 Picker une Story (start)¶
- Roadmap → identifier la version
opencouvrant la période courante - Lister ses Stories
NewouIn progress; respecter l'ordre indiqué dans le doc Epic (documentation/epics/<epic>.md§3 ou plan canonique §6) - Vérifier la règle single-WIP : aucune autre Story déjà
In progresspour soi - Cliquer la Story →
Status→In progress - Récupérer le
cvn_id(ex:CVN-N001-EE-S01) et legithub_issue_urldu panneau Détails - Ouvrir / créer la branche :
feat/cvn-n001-ee-s01-<slug>(lecvn_iddoit apparaître) - Démarrer le step 1 du dev process CLAUDE.md
Si la Story n'a pas de github_issue_url rempli, créer l'issue puis :
source .venv_airflow/bin/activate
export OPENPROJECT_API_KEY=<...>
python scripts/openproject_import_gh.py --issue <NNN> --type Story \
--cvn-id <CVN-...-S0X> --parent-cvn-id <CVN-...-EE>
16.2 Pendant l'implémentation¶
- Pour TOUTE transition de statut OP (ex.
Specified → In progress), utiliserscripts/op_story_transition.py --wp <id> --to "<state>" --verdict "..." --evidence "...": il applique le rituel ADR-81 (résolution du statut par nom, garde sur les edges légaux, commentaire verdict/evidence obligatoire). Détail : STORY_WORKFLOW §2. - Le
cvn_idde la Story DOIT apparaître dans : nom de branche, titre PR, et description PR - Si la Story s'étire au-delà de la fin de version : ajouter un commentaire OP expliquant le slip + déplacer la Story vers la version suivante (PATCH
_links.versionou via UI). Ne PAS la laisser orpheline. - Si on doit stopper sans terminer : Story status
In progress→New+ commentaire OP expliquant l'interruption
16.3 Fermer une Story (close)¶
DoD obligatoire avant fermeture (cumul des steps CLAUDE.md 12-13) :
- [ ] PR mergée sur main
- [ ] CI green
- [ ] Tests système passés (cf. CLAUDE.md step 12)
- [ ] Issue GH fermée avec commentaire de validation
- [ ] (Si Story d'un Epic FTF) gate per-track validé : f1_buy ≥ +0.015, expectancy_net ≥ baseline, etc. — cf. l'Epic doc
Puis dans OP :
1. Story → Status → Closed
2. Commenter avec : lien PR, SHA du squash-merge, lien session committee pr_review si applicable
3. Vérifier que la Story disparaît de la liste In progress du tableau
16.4 Fermer une version (sprint roll-over)¶
Une version se ferme quand toutes ses Stories sont Closed ET le gate de version est validé. Procédure :
- Vérifier la complétude : toutes les Stories de la version doivent être en statut
Closed(filtrer la liste des WPs parVersion: <name>+Status: !Closed→ doit être vide) - Vérifier le gate de version : ouvrir l'Epic associé (
documentation/epics/<epic>.md§4), confirmer que les acceptance criteria attribuables à cette version sont satisfaits. Pour les Epics FTF c'est typiquement : f1_buylift cumulé ≥ seuil (ex: +0.05 avec CI95 excluant 0)expectancy_net≥ baselinesortino≥ baselinemax_drawdown≤ baseline + 1%- Si gate OK :
- OP UI → Settings → Versions → cliquer la version → Edit
- Status
open→closed - Description : ajouter une note de clôture en markdown :
- Si gate KO :
- Ouvrir une issue GH
[gate-failure] <epic> <version>avec le diagnostic - Appliquer §6 escalation de l'Epic (ex: F1_buy boost = "évaluer big-bet bundle si QW < +0.05")
- Fermer la version avec note
Gate result: FAILED → see issue #NNN - Run la rétrospective : 15 min de revue, focus sur :
- Quelles hypothèses falsifiables ont tenu / échoué (ADR-29 baseline naïve)
- Quel temps prévu vs réel par Story
- Quels guardrails ADR-58 ont déclenché
- Une mémoire
feedback_*.mdà écrire si lesson durable - Ouvrir la version suivante : si pas déjà créée, suivre §16.5
16.5 Créer une nouvelle version¶
Via API (script type /tmp/op_sprints.py du 2026-04-27) ou UI :
1. UI : Settings → Versions → New version
2. Nom selon convention <epic-shortname>-<phase>-<descriptor>
3. Start date + End date (cadence par défaut 2 semaines, sprint de transition 1 semaine)
4. Description : objectif sprint + Stories planifiées + gate attendu
5. Status open
6. Assigner les Stories prévues : pour chaque WP cible, panneau Details → Version → choisir la nouvelle version
16.5b Sprint version : Stories oui, Epics et Needs non¶
Convention canonique (formalisée 2026-04-29 après audit constatant 5 Epics CVN-N012-EA → -EE pollués avec F1B-Backlog créés via OP UI) :
Story.version: toujours set (Backlog ou sprint actif) — c'est l'unité d'engagement, single-WIP s'appliqueEpic.version: toujoursNONE— un Epic agrège des Stories sur plusieurs sprints, lui assigner une version crée une fausse promesse de fermetureNeed.version: toujoursNONE— objectif stratégique sur timeline multi-sprint / multi-mois
Vérifier l'état actuel — script Python qui résout les types et le custom field cvn_id dynamiquement (mêmes patterns que scripts/openproject_import_gh.py, voir #668) et pagine la liste des work packages pour ne pas rater de violation au-delà de la première page :
python3 - <<'PY'
"""Audit : Epics + Needs ne doivent jamais avoir de version assignée.
Pas d'IDs codés en dur — les types Epic/Need et la clé du custom field
cvn_id sont résolus à l'exécution via /api/v3/types et /api/v3/work_packages/schemas/...
(robuste à un re-ordering admin UI). Pagine la liste WP pour éviter les
faux "OK" sur des projets > 1 page.
"""
import json
import os
import urllib.parse
import urllib.request
API_KEY = os.environ["OPENPROJECT_API_KEY"]
BASE = "https://openproject.cvntrade.eu"
PROJECT = "cvntrade"
PAGE_SIZE = 100 # OpenProject cap is typically 200, 100 is safe
def op_get(path: str) -> dict:
req = urllib.request.Request(f"{BASE}{path}")
auth = "Basic " + __import__("base64").b64encode(f"apikey:{API_KEY}".encode()).decode()
req.add_header("Authorization", auth)
with urllib.request.urlopen(req, timeout=30) as r:
return json.load(r)
# Resolve type IDs by name (Epic, Need)
types = op_get("/api/v3/types").get("_embedded", {}).get("elements", [])
type_id_by_name = {t["name"]: t["id"] for t in types}
epic_id = type_id_by_name.get("Epic")
need_id = type_id_by_name.get("Need")
if not epic_id or not need_id:
raise SystemExit(f"Could not resolve Epic/Need type IDs (got Epic={epic_id}, Need={need_id})")
# Resolve cvn_id customField key (use the Epic schema as a representative)
project = op_get(f"/api/v3/projects/{PROJECT}")
schema = op_get(f"/api/v3/work_packages/schemas/{project['id']}-{epic_id}")
cvn_id_key = next(
(k for k, v in schema.items() if k.startswith("customField") and isinstance(v, dict) and v.get("name") == "cvn_id"),
None,
)
if cvn_id_key is None:
raise SystemExit("Custom field 'cvn_id' not found on Epic schema — fix in OP admin UI")
# Paginate through Epic + Need work packages.
# OpenProject API v3 uses `offset` as the 1-based PAGE NUMBER (not item index).
# https://www.openproject.org/docs/api/collections/
filters = json.dumps([{"type": {"operator": "=", "values": [str(epic_id), str(need_id)]}}])
violations = []
page_num = 1
seen = 0
while True:
qs = urllib.parse.urlencode({"filters": filters, "pageSize": PAGE_SIZE, "offset": page_num})
page = op_get(f"/api/v3/projects/{PROJECT}/work_packages?{qs}")
elems = page.get("_embedded", {}).get("elements", [])
if not elems:
break
for e in elems:
v = e.get("_links", {}).get("version", {}).get("title")
if v:
violations.append((e["id"], e.get(cvn_id_key) or "?", v))
seen += len(elems)
total = page.get("total", 0)
if seen >= total:
break
page_num += 1
if violations:
print("VIOLATIONS:")
for wp_id, cvn_id, version in violations:
print(f" wp#{wp_id} ({cvn_id}): version={version}")
raise SystemExit(1)
print("OK — no Epic/Need has a version assigned (audited in full, paginated)")
PY
Stripper une version mal assignée :
WP_ID=<wp_id>
LOCK_VERSION=$(curl -s -u "apikey:$OPENPROJECT_API_KEY" \
"https://openproject.cvntrade.eu/api/v3/work_packages/$WP_ID" | jq -r .lockVersion)
curl -s -u "apikey:$OPENPROJECT_API_KEY" \
-X PATCH -H "Content-Type: application/json" \
-d "{\"lockVersion\": $LOCK_VERSION, \"_links\": {\"version\": {\"href\": null}}}" \
"https://openproject.cvntrade.eu/api/v3/work_packages/$WP_ID"
Le canonique scripts/openproject_import_gh.py ne set jamais version (correct par défaut). Les scripts ad-hoc (/tmp/_create_*.py ou autre) doivent suivre la convention : version sur Story uniquement, jamais sur Epic / Need.
16.6 Vue Stories par version¶
Liste rapide via filtre URL :
https://openproject.cvntrade.eu/projects/cvntrade/work_packages?query_props={"f":[{"n":"version","o":"=","v":["<version_id>"]}]}
Ou via API :
curl -s -u "apikey:$OPENPROJECT_API_KEY" \
"https://openproject.cvntrade.eu/api/v3/projects/cvntrade/work_packages?filters=%5B%7B%22version%22%3A%7B%22operator%22%3A%22%3D%22%2C%22values%22%3A%5B%22<id>%22%5D%7D%7D%5D"
16.7 Erreurs fréquentes¶
| Symptome | Cause | Fix |
|---|---|---|
Story sans github_issue_url |
importer pas joué OU cf non attaché au type Story | jouer openproject_import_gh.py ; vérifier les CFs sur Settings → Work package types |
Plusieurs Stories In progress simultanées |
violation single-WIP | repasser les autres en New avec commentaire ; ne garder qu'une seule |
Version reste open après dernière Story closed |
étape de gate review oubliée | exécuter §16.4 (vérifier gate, ajouter note de clôture, fermer) |
| Story commencée sans entrée OP | violation ADR-69 | créer la Story rétroactivement dans la version courante avant merge |
_links.version PATCH en 422 |
lockVersion obsolète (concurrent edit dans l'UI) |
re-GET le WP, prendre le nouveau lockVersion, retry |
Story tirée du Backlog sans justification |
violation invariant ADR-69 | ajouter un commentaire OP expliquant pourquoi la priorité a changé ; sinon remettre la Story dans son sprint planifié |
16.8 Guardrails CI — bypass + audit (CVN-N011-EA-S12)¶
Le workflow .github/workflows/pr-workflow-guardrails.yml enforce les 4 gates G1-G4 (PR title / Story ref / plan dossier / MLOps readiness) à l'ouverture de chaque PR. Mécanismes de bypass dans l'ordre de préférence :
- Bots auto-bypass —
dependabot[bot],github-actions[bot],renovate[bot]skip automatiquement (le workflow ne s'exécute pas). - Kill switch global (urgences uniquement) —
Settings → Variables → Actions → New repository variableavecGUARDRAILS_KILL_SWITCH=true. Désactive le workflow sans supprimer le fichier. À retirer immédiatement après l'urgence. - Waiver par label sur une PR spécifique — appliquer le label
guardrails-waiversur la PR. Le job principal skip mais un job auditguardrails-waiver-auditémet un::warning::qui apparaît dans la check-list. Obligation : ajouter une section## Guardrails waiverdans le body de la PR expliquant quels gates sont bypassés et pourquoi.
Audit mensuel (committee 98ca88b1 reco #2) : pendant la rétro de version (§16.4), recenser les PR mergées avec guardrails-waiver via :
gh pr list --state merged --label guardrails-waiver --search "merged:>=YYYY-MM-DD" --json number,title,body
Pour chaque waiver, vérifier que la justification du body est lisible et que le bypass est rétro-couvert (Story de fix, follow-up issue, etc.). Tendance > 1 waiver / sprint = signal d'un guardrail trop strict ou d'un trou de process à reprioritiser.
17. Incident log¶
Append-only chronological log of incidents that affected production, FTF runs, or the operator workflow. Entries follow a fixed format so future operators (and AI assistants) can grep / scan quickly. Per committee fd9317be recommendation #4, every CRITICAL severity bug merits an entry here even if the code fix is small ; the goal is organizational learning, not blame.
Severity per §8 :
- P0 — production trading down or losing money
- P1 — production degraded, observable in prod metrics
- P2 — FTF / training pipeline broken, no live trading impact, but lock decisions blocked
- P3 — operator workflow friction, no data integrity impact
17.1 2026-04-28 — Track 5 FTF sweep type-mismatch failure (P2)¶
Time : 2026-04-28 12:25 UTC trigger → 12:27 UTC first failure observed → 14:50 UTC hotfix merged → ~14:55 UTC operator re-trigger.
Severity : P2 — FTF sweep cassé, lock decision pour Track 5 (CVN-N001-EE-S01) bloqué. Aucun impact live trading (ADR-71 + EG-S06 flatten_all gate toujours en place ; Track 5 est FTF-only par design).
Symptom :
ValueError: Type requirement mismatch.
Expected X_train:<class 'numpy.ndarray'>
got [...DataFrame with columns open, xgb_accel_amplitude_ratio_24_grp2, ...]
label_smoothing in {mild, aggressive} + cleanlab in {filter, reweight}) ont échoué au call apply_label_pipeline. Seul le baseline identity short-circuit (label_smoothing=none × cleanlab=off) a survécu.
Root cause : apply_label_pipeline (Track 5 PR #734 commit 77aa6389) invoke un Hamilton driver dont les nodes typent X_train: np.ndarray. Hamilton's validate_inputs rejette les pd.DataFrame que la prod feed (le trainer XGBoost reçoit DataFrame depuis le cache layer cvntrade_autonomous_orchestrator).
Test gap : tests/unit/training/labels/test_label_pipeline.py:_make_imbalanced_dataset retournait (np.ndarray, np.ndarray). Aucun test n'a couvert le cas DataFrame. Le signal CI était propre (181/181 pre-hotfix) mais ne couvrait pas le codepath production.
Fix : PR #751 commit 3837a886 (mergé 2026-04-28 14:50 UTC). Coercion défensive pd.DataFrame → np.ndarray à l'entrée de apply_label_pipeline, AVANT le Hamilton driver. Plus 4 nouveaux tests TestDataFrameCoercion qui exercent les codepaths avec DataFrame inputs.
Audit trail :
- GH issue : #750
- Hotfix PR : #751 (merged 3837a886)
- Hotfix dossier : documentation/reviews/2026-04-28-track5-hotfix-dataframe-coercion.md
- Committee pr_review : session fd9317be PASSED OK avg 8.1 strong, 0 blocker, 7 forward-looking recos
- Original Track 5 commits : implementation 77aa6389 (PR #734), plan dossier 1074891a
Time-to-detect (technical) : ~2 minutes (trigger 12:25 UTC → first failure log at 12:27 UTC). Logs surfaced the failure immediately ; the system's automated detection is fine.
Time-to-acknowledge (human) : ~55 minutes (operator notified Claude / flagged ~13:20 UTC after ~1h of failed-runs accumulating in Airflow). This is the real operator-loop latency — the failure was visible in logs from minute 2, but no alert paged the operator. Highlights the absence of a real-time "FTF run health" alert (cf. documentation/runbooks/cleanlab_cv_systemic_failure.md per ADR-70 §2 P1 alert wired in MLOps readiness for CVN-N001-EE-S01 — would have paged at minute 5 had it been deployed already).
Time-to-mitigate : ~10 minutes from human acknowledgement (operator stopped sweep on flag).
Time-to-fix : ~1.5h from human acknowledgement (diagnosis 5 min + code 5 min + 4 regression tests 10 min + dossier 15 min + committee submission 2 min + committee verdict ~2 min + merge ~5 min + sync delays).
Aggregate time-to-recovery (trigger → fix merged → re-trigger possible) : ~2.5h, dominated by the 55-min human-acknowledgement gap.
Lessons learned :
- Test fixture parity : every helper exposed to the production trainer MUST have at least one test fixture using the actual production input types (
pd.DataFramefor X,pd.Seriesfor y). Going forward, this is a checklist item in MLOps readiness template (ADR-70 §6 — committeefd9317bereco #2 to be applied as ADR-70 amendment). - Production smoke gate : substantive ML code (touching
src/training/,src/commun/{pipeline,inference,filters,labels}/) should pass a 1×1×1 FTF smoke variant on the cluster as a pre-merge gate, not just the local unit + integration tests. Committeefd9317bereco #3 — to be applied as ADR-69 amendment OR new ADR. - Hamilton strict typing is load-bearing : the validation that broke our run is a feature, not a bug. The fix is entry-point coercion, NOT loosening the type hints (which would defer the same class of bug to deeper code). Committee
fd9317bereco #5 explicitly endorses this strategy. - Cache layer audit : the upstream
cvntrade_autonomous_orchestratorpropagates pandas types through the cache without explicit boundary coercion. Committee reco #6 — follow-up issue to add type validation at the cache boundary, reducing future drift surface.
Follow-up actions :
| # | Action | Status |
|---|---|---|
| F1 | Apply committee reco #2 → ADR-70 amendment for DataFrame fixture parity | open |
| F2 | Apply committee reco #3 → ADR-69 / ADR-70 amendment for production smoke gate | open |
| F3 | Open issue for cache-layer type audit (committee reco #6) | open |
17.2 2026-04-28 — Track 5 FTF sweep XGBoost feature-name mismatch (P0, incident #2 same surface as §17.1)¶
Time : trigger 2026-04-28 ~15:30 UTC → detect <60s → mitigate (operator stops sweep) ~5min → fix (PR #754 merged) ~3h. Severity : P0 — same blocker on the same surface within 24h of incident #1, all FTF variants failing. Symptom :
ValueError: data did not contain feature names, but the following fields are
expected: open, high, low, close, BBL_8_2_0, BBM_8_2_0, BBU_8_2_0, RSI_14, ...
xgb.train's _validate_features cross-check between dtrain and dval eval pairs.
Root cause : Hotfix v1 of incident #1 (PR #752 commit 3837a886) coerced X_train from DataFrame → ndarray at the entry of apply_label_pipeline to satisfy Hamilton's validate_inputs. That fix was correct for Hamilton, but it broke an implicit, undocumented contract with XGBoost : the trainer never touches X_val (still DataFrame, with feature_names). When xgb.train was called with evals=[(dtrain, "train"), (dval, "val")], dtrain (no feature names) and dval (has feature names) were inconsistent → crash.
Pattern of failure : Hotfix v1 unit tests (TestDataFrameCoercion) covered apply_label_pipeline in isolation. They satisfied the Hamilton contract but never replayed the full trainer codepath. Methodology gap : unit tests of a transform are not enough — the integration with downstream consumers must be tested at the boundary.
Test gap : no integration test replayed xgb.DMatrix(X_train, ...) + xgb.DMatrix(X_val, ...) + xgb.train(..., evals=[...]) with both eval pairs. The new TestTrainerEndToEndDataFrame class (PR #754) does exactly this — verified to fail without the fix and pass with it.
Fix : PR #754 (commit 6156ff4d).
Layer 1 — apply_label_pipeline now PRESERVES the input type round-trip :
- True identity short-circuit BEFORE any coercion (preserves DataFrame index)
- For the Hamilton path : capture metadata (column names, Series name, index) → coerce to ndarray → re-wrap to original type with original index when row count matches
Layer 2 — Trainer-side _assert_dmatrix_contract(dtrain, dval) invoked AFTER DMatrix construction and BEFORE xgb.train (committee reco #2). Validates feature_names presence parity, equality, num_col, feature_types. Fail-fast per ADR-25 : if a future transform regresses, the error surfaces here, immediate and readable, not deep in xgb.train.
Layer 3 — 12 new integration tests :
- TestTrainerEndToEndDataFrame (6 tests) replays the EXACT trainer codepath end-to-end
- TestSiblingFailureModes (5 tests) covers dtype preservation, sample_weights row alignment after filter, NaN handling per ADR-25, trainer assertion regression bar
- test_dataframe_index_preserved_when_row_count_unchanged locks DatetimeIndex round-trip
Audit trail :
- GH issue #753
- PR #754 (merged commit 6156ff4d)
- Committee pr_review session 0e15acc0 — verdict PASSED EXECUTION_RISK (5 experts, consensus strong)
- Dossier : documentation/reviews/2026-04-28-track5-hotfix-v2-type-preservation.md
- Story : CVN-N001-EE-S01 (OP wp#40)
Time-to-detect / acknowledge / mitigate / fix : - Tech detect : <60s (cluster log surfaced ValueError) - Human ack : ~5min (operator pinged after observing failed variants in Airflow UI) - Mitigate : ~5min (operator stopped sweep) - Fix : ~3h (PR #754 with 3 CR cycles + committee + recos #2 + #3 inline)
Lessons :
-
An incident on the same surface within 24h is a methodology signal, not a bug accident. Fixing the immediate symptom (incident #1) without integration test parity (#17.1 lesson #1) was insufficient — incident #2 hit the very next operator trigger. The committee
0e15acc0verdict captures this : "the recurrence of critical production failures on the same surface within 24 hours indicates systemic weaknesses in data contract enforcement, integration testing, and operational readiness." -
Defense-in-depth at every contract boundary. The label_pipeline → trainer boundary is now triple-protected : (a) type preservation in the transform, (b) trainer-side assertion before xgb.train, (c) integration tests that replay both. This pattern should generalize to every ML pipeline boundary (committee reco #1, OP Story
CVN-N011-EA-S01#755). -
OP-first for backlog. Per ADR-76, the 6 forward-looking recos from session
0e15acc0were created in OP first (NeedCVN-N011, EpicCVN-N011-EA, Stories S01-S06 = wp#69-#74), then GH issues #755-#760 derived from them.
Follow-up actions :
| # | Action | OP | GH |
|---|---|---|---|
| F4 | ADR explicit data contracts at ML pipeline boundaries (P1) | wp#69 | #755 |
| F5 | ADR amendment — mandate integration testing parity (P2) | wp#70 | #756 |
| F6 | Production smoke gate pre-FTF-sweep (P2) | wp#71 | #757 |
| F7 | Schema validation tooling eval (Great Expectations / Pydantic) (P3) | wp#72 | #758 |
| F8 | Systemic post-mortem — incidents §17.1 + §17.2 (P3) | wp#73 | #759 |
| F9 | Real-time observability — schema/dtypes/NaN at boundaries (P3) | wp#74 | #760 |
17.3 2026-04-28 — Track 5 FTF sweep calibration crash on soft labels (P0, incident #3 same surface as §17.1+§17.2)¶
Time : trigger 2026-04-28 ~15:30 UTC → detect <60s → mitigate (operator stops sweep) ~5min → fix (PR opened ~17:00 UTC) ~3h.
Severity : P0 — third blocker on the same surface in 24h, ~50% of FTF sweep variants crashing (all variants with eps_buy > 0 AND calibration != "none").
Symptom :
File "src/training/XGBoost/cvntrade_XGBoost_trainer.py", line 315
self._apply_calibration(X_train, y_train, config.calibration)
File ".../sklearn/calibration.py", line 319, in fit
check_classification_targets(y)
ValueError: Unknown label type: continuous. Maybe you are trying to fit a
classifier, which expects discrete classes on a regression target with
continuous values.
Root cause : Track 5 label smoothing (apply_label_pipeline with eps_buy > 0) transforms y_train into soft labels (continuous floats in [0, 1]). xgb.train(..., 'binary:logistic') accepts soft labels natively → training succeeds. But the immediately-following _apply_calibration calls sklearn's CalibratedClassifierCV.fit(X_train, y_train) which calls check_classification_targets(y) and rejects continuous targets.
This is the third incident in 24h on the same conceptual surface — apply_label_pipeline violating an undocumented contract with a downstream consumer of y_train :
| Incident | Downstream consumer | Hotfix |
|---|---|---|
| §17.1 | Hamilton validate_inputs (X type) |
PR #752 |
| §17.2 | XGBoost _validate_features (feature_names parity) |
PR #754 |
| §17.3 (this) | sklearn CalibratedClassifierCV (discrete targets) |
PR #767 (CVN-N011-EA-S07) |
Test gap : Track 5 integration tests + the new TestTrainerEndToEndDataFrame from PR #754 only exercise the path up to xgb.train — NOT the _apply_calibration step that follows. The committee 0e15acc0 reco #4 ("mandate integration testing parity") predicted exactly this gap, but had not yet been broadly applied.
Fix : PR #767 (CVN-N011-EA-S07).
Adopted Option C (chosen after operator triage and committee 986ea335 plan_review PASSED) :
- Calibration runs on (X_val, y_val) instead of (X_train, y_train) — the val split is hard-labeled (never touched by apply_label_pipeline), so soft labels in train no longer impact calibration
- Sémantiquement plus correct que les options A (round soft → hard, lossy) and B (preserve original y_train, in-sample calibration) — best-practice ML : calibration toujours sur hold-out (Platt 1999, Niculescu-Mizil & Caruana 2005). Aligne avec ADR-15 (theta calibré OOS) déjà précédent dans le projet.
- Defense-in-depth : new _assert_calibration_targets_discrete() helper invoked BEFORE each CalibratedClassifierCV.fit() (per ADR-25 fail-fast). If a future transform breaks the contract, the error surfaces immediately with a readable message pointing at the suspected root cause.
- Renamed _apply_calibration(X_train, y_train, ...) → _apply_calibration(X_calib, y_calib, ...) to make the new contract explicit (same for _apply_hybrid_calibration).
Tests : 5 new integration tests in TestCalibrationOnSoftLabelsTrain :
- 3 parametrized variants (calibration=isotonic/sigmoid/platt) with eps_buy=0.3 end-to-end through trainer.train(...) — all 3 PASS post-fix, FAIL pre-fix with the exact production error message (regression bar verified)
- 1 test for the _assert_calibration_targets_discrete helper (healthy + continuous inputs)
- 1 baseline sanity (eps=0, calibration=none) — no regression on the no-op path
Audit trail :
- Plan dossier : documentation/reviews/2026-04-28-track5-bug1-calibration-refactor-plan.md
- Committee plan_review session 986ea335 — PASSED EXECUTION_RISK (5 experts, consensus strong)
- GH issue #764
- PR #767
- OP Story CVN-N011-EA-S07 (wp#85)
Time-to-detect / acknowledge / mitigate / fix : - Tech detect : <60s (cluster log surfaced ValueError on first non-baseline variant) - Human ack : ~5min (operator pasted log dump immediately) - Mitigate : ~5min (operator stopped sweep) - Fix : ~3h wall-clock (plan dossier 30min + committee 5min + implementation+tests 1h + PR+commit 30min)
Lessons :
-
Three same-surface incidents in 24h confirms a systemic gap, not a bug accident. The unit-test surface (label_pipeline in isolation) and even the new trainer-end-to-end test (path up to xgb.train) BOTH missed
_apply_calibrationbecause no test exercised the FULL trainer.train(...) lifecycle with soft labels. The committee0e15acc0reco #4 ("mandate integration testing parity") and reco #5 ("production smoke gate pre-FTF-sweep") would have caught this — escalating their priority. -
Defense-in-depth at every contract boundary, generalized by ADR-mandate. We now have boundary assertions at 2 contracts (dtrain↔dval feature_names from §17.2, calibration target discreteness here). The next 4th incident is the one we don't yet have an assertion for. Story
CVN-N011-EA-S01(data contracts ADR, #755) becomes the systemic answer — accelerate. -
Pre-existing methodological weakness exposed by the bug. Before §17.3, calibration ran on training data — methodologically optimistic (in-sample calibration). The fix not only solves the immediate crash but corrects this weakness by moving calibration to hold-out (X_val, y_val), aligning with ADR-15 (theta calibré OOS) precedent. Bugs occasionally surface latent design issues — prefer the structural fix over the tactical one when they coincide.
Follow-up actions :
| # | Action | OP | GH |
|---|---|---|---|
| F10 | Audit other y_train/y_val consumers in trainer + post-trainer for similar implicit contracts (Risk #5 of plan dossier) | TBD | TBD if found |
| F11 | Re-prioritize CVN-N011-EA-S01 (data contracts ADR) — move from P1 backlog to next sprint | wp#69 | #755 |
| F12 | Re-prioritize CVN-N011-EA-S03 (production smoke gate) — would have caught all 3 of §17.1+§17.2+§17.3 | wp#71 | #757 |
17.4 2026-04-29 — Track 6 focal_loss FTF sweep crash on missing sympy (P1)¶
Time : 11:35 trigger → 11:37 detect (operator log review) → 11:49 mitigate (PR #775 merged) → 11:52 fix (image deploy SUCCESS).
Severity : P1 — every focal_loss trial crashed identically with ModuleNotFoundError: No module named 'sympy'. 5 pods × 50 trials = 250/250 failures, zero rows persisted, best_score=-1000.0 on all variants.
Symptom : Loki query event=xgboost_training_failed error=No module named 'sympy' showed identical traceback across all 5 pods, originating at src/training/XGBoost/focal_loss.py:66 → import sympy as sp inside _build_focal_lambdas() called at module import (line 90).
Root cause : Track 6 PR #767 added focal_loss.py which imports sympy at module load time, but neither requirements.txt (root) nor airflow_docker/requirements.txt (image build) declared the dependency. Local .venv_airflow had sympy as a transitive dep so all CI tests passed silently. Same gap pattern as PR #762 (cleanlab).
Test gap : no end-to-end runtime smoke against the actual airflow image. CI uses .venv_airflow derivatives, so missing prod deps don't surface until first real trial. CodeRabbit doesn't compare module imports against requirements files.
Fix : PR #775 squash 510b10db — pinned sympy==1.14.0 in BOTH requirements.txt and airflow_docker/requirements.txt with explicit comment about runtime requirement + ADR-25 fail-loud rationale.
Audit trail : GH #776, OP wp#92 (CVN-N011-EA-S11) post-mortem Story tracking the systemic gap, PR #775 hotfix.
Time-to-detect (technical) / acknowledge (human) / mitigate / fix : 0min / 2min / 12min / 17min. Detection was fast because the operator was watching the sweep live. Without the live watch, the silent failure (250 trials with best_score=-1000.0) would have looked indistinguishable from "model converged badly" until the results dossier showed every single trial had identical impossible scores.
Lessons learned :
- Module-load-time imports are a runtime trap that the test suite cannot catch — the test framework imports happen in dev environment, where transitive deps are abundant.
- The "fix" pattern (sync requirements.txt ↔ airflow_docker/requirements.txt) is recurring (#762 cleanlab, this incident sympy) — process gate is needed, not more vigilance.
- 6 CodeRabbit passes + committee
pr_reviewon PR #767 didn't catch this because none of those layers compares declared deps to actual imports.
Follow-up actions :
| # | Action | Story | Issue |
|---|---|---|---|
| F13 | Build a CI gate that fails when a module import in src/ is not declared in either requirements file (Hypothesis 1 in S11 plan) |
wp#92 | #776 |
| F14 | Add a post-build dockerized smoke test of curated entry points (Hypothesis 2 in S11 plan) | wp#92 | #776 |
| F15 | Update TEMPLATE_mlops_readiness.md §3 with a "new dep declared in BOTH requirements files" checkbox |
wp#92 | #776 |
| F16 | Retroactive scan of src/ for other latent missing-dep regressions on main today |
wp#92 | #776 |
17.5 2026-04-29 — Cleanlab FTF sweep gRPC fork deadlock (P1)¶
Time : 09:46 trigger → 09:47 first symptom → 10:10 detect (operator review of pod state) → 10:12 mitigate (operator killed pods) → ~12:00 root cause confirmed (focal_loss sweep ran cleanly on same stack) → fix in flight via PR (this Story CVN-N011-EA-S10).
Severity : P1 — every cleanlab FTF sweep deadlocks silently after ~4 trials per pod. Pods report Running to K8s, consume RAM at 2-3 GiB each, write zero rows to finetune_results. No alert exists today (this incident is what triggers adding one — see runbook).
Symptom : 5 pods stuck identically at cleanlab_cv_probs event after ~4 HPO trials. LDOUSDC pod showed explicit WARNING: All log messages before absl::InitializeLog() is called are written to STDERR then I0000 ... fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers then 24 minutes of complete silence.
Root cause : cleanlab.filter.find_label_issues defaulted to n_jobs=cpu_count() and spawned multiprocessing.Pool with the OS default start method (fork on Linux + Python<3.14). The forked children inherited live MLflow autolog gRPC threads in a half-locked state and hung forever on first gRPC call. Diagnostic was confirmed by the focal_loss FTF sweep (2026-04-29 11:08-11:30 UTC) which ran cleanly on the same HPO + MLflow + Optuna stack — the only difference being focal_loss doesn't call cleanlab.find_label_issues.
Test gap : no integration test exercised cleanlab CV with concurrent MLflow autolog gRPC threads alive in the parent process. The 8 unit tests in TestSuspectMaskPerClassCap (S08) covered the per-class cap logic but mocked out cleanlab so the fork pattern never fired in CI.
Fix : PR #777 CVN-N011-EA-S10 — pin n_jobs=1 on the find_label_issues call in src/training/labels/label_pipeline.py::suspect_mask (eliminates the fork) + defence-in-depth GRPC_ENABLE_FORK_SUPPORT=1 + GRPC_POLL_STRATEGY=poll env vars in Helm + docker-compose. Reproducer test in tests/integration/test_grpc_fork_deadlock_regression.py asserts the contract holds.
Audit trail : GH #774, OP wp#91 (CVN-N011-EA-S10), plan dossier 2026-04-29-grpc-fork-deadlock-plan.md, committee plan_review session 7bf612b7 PASSED OK.
Time-to-detect (technical) / acknowledge (human) / mitigate / fix : 0min (no alert wired) / 24min (operator manual review) / 2min (pod kill) / fix in flight. The 24-min technical-vs-human gap is exactly the alert-wiring debt that the new runbook + hpo_pod_stuck alert (wired in this Story's MLOps readiness §1) closes.
Lessons learned :
- Forking after gRPC threads are alive is a deterministic deadlock — must be guarded everywhere gRPC is used (i.e. the entire prod codebase via MLflow autolog).
- Cleanlab's
n_jobsdefault is unsafe in any process that has MLflow autolog active. - Detecting "pod stuck but Running" requires a dedicated log-progress alert — Prometheus liveness probes alone don't catch this (process is alive, just hung).
Follow-up actions :
| # | Action | Story | Issue |
|---|---|---|---|
| F17 | Land the H4 + H2 fix + reproducer test on main | wp#91 | #774 |
| F18 | Wire the hpo_pod_stuck alert in Grafana per the new runbook §1 |
wp#91 | #774 |
| F19 | Add gRPC client metrics (latency, errors) to OTel pipeline (deferred from committee 7bf612b7 reco #6) | CVN-N010-EA | TBD |
| F20 | Liveness probe on hpo_heartbeat events (deferred from reco #1) | CVN-N010-EA | TBD |
17.X — Template for future entries¶
### 17.X 2026-MM-DD — <one-line title> (severity Pn)
**Time** : trigger → detect → mitigate → fix.
**Severity** : Pn — short justification.
**Symptom** : observable signal (log line, metric, dashboard).
**Root cause** : 1-3 sentences naming the file:line if applicable.
**Test gap** : what the test suite was missing.
**Fix** : PR # commit, files touched, mechanism.
**Audit trail** : GH issue, PR, committee session, related ADR/Story.
**Time-to-detect (technical) / acknowledge (human) / mitigate / fix** : durations. Split tech vs human detection — they are usually different and the gap is operationally interesting (it's where alert-wiring debt lives).
**Lessons learned** : 1-N bullets, action-oriented.
**Follow-up actions** : table of issues/PRs to track each lesson.
Glossaire¶
| Terme | Definition |
|---|---|
| screening | Phase 1 : preselection rapide des PTE candidates (grid search) |
| testing | Phase 2 : validation multi-fold HPO + backtest OOS |
| WFRB | Phase 3 : validation walk-forward rolling backtest |
| challenger | Modele enregistre mais non promu — en attente d'approbation |
| champion | Modele actif de reference pour une crypto |
| stale | Modele non rafraichi depuis N jours |
| baseline_f1_buy | Score F1 d'un classifier naif "always BUY" — reference minimale |
| buy_rate | Proportion de labels BUY dans le split test |
| PTE | Parametres de Trade Execution : SL, TP, horizon |
| golden signal | Metrique cle de supervision — le noyau de l'observabilite |
| last_fold_f1_buy | F1 BUY du dernier fold (meme periode que le screening) |
| CORRELATION DATA | Bloc structure dans les logs test_step — interface stable (ADR-30) |
Working with the docs site (docs.cvntrade.eu)¶
The Design System + ADRs + runbooks are published to
docs.cvntrade.eu. Source lives in documentation/;
build config is mkdocs.yml at the repo root. Phase 2 of #593
(#637 tracks the scaffolding).
Local preview¶
make docs-install # one-time: installs mkdocs + plugins into .venv_airflow
make docs-serve # hot-reload at http://127.0.0.1:8000
Edit any .md file in documentation/ — the browser reloads on save.
Local strict build (same checks as CI)¶
--strict fails on broken internal links, missing nav entries, or unknown config.
If this passes locally, CI will pass too.
Adding a page¶
- Drop the
.mdfile in the right subfolder (needs/,epics/,adr/, …). - Add an entry in
mkdocs.ymlundernav:so it's discoverable (required unless it's referenced from another page's index). make docs-build— fix any broken links.- Commit + PR. CI rebuilds on every PR that touches
documentation/**.
Deploy¶
mainpush →.github/workflows/docs.yml→ builds strict → GitHub Pages.- First deploy: enable GH Pages in repo settings (Source: GitHub Actions),
then set CNAME
docs.cvntrade.eu→dococeven.github.ioat the registrar. - Subsequent deploys are fully automated. No operator action required.
Architecture diagrams¶
documentation/architecture/workspace.dsl is the single Structurizr DSL source.
See documentation/architecture/workspace-reference.md for rendering options
(VS Code live preview, structurizr-cli, Structurizr Lite in Docker).
Troubleshooting¶
| Symptom | Cause / fix |
|---|---|
mkdocs: command not found |
Run make docs-install. |
Strict build fails on broken link |
Fix the .md link — relative path from the file's location. |
| New file isn't in the sidebar | Add it under nav: in mkdocs.yml. |
README.md shows as an empty page |
Expected — README.md at docs_dir root is excluded in favor of index.md. |
OpenProject operator playbook (#593 Phase 1)¶
URL: https://openproject.cvntrade.eu
Role: product source of truth for Needs / Epics / Stories / Releases.
Deployed via: infra/helm/openproject/ chart, Helm-managed through the deploy-k8s workflow.
Chart setup doc: infra/helm/openproject/README.md §Setup.
Login¶
- Browser →
https://openproject.cvntrade.eu/login - Username
admininitially; operator account created post-setup - If cert warning appears, cert-manager hasn't yet issued — wait 5 min and refresh
Create a new Need¶
- Top nav → project
CVNTrade→ Work packages - Create → type
Need - Subject: short title, e.g. "Reach F1=0.75 binary classification"
- Custom fields:
need_id= next availableCVN-N<nnn>(strictly sequential)github_issue_url= link to the parent GitHub issue- Description: follow
documentation/templates/TEMPLATE_need.md - Save → OpenProject auto-generates the work package ID (ignore it, use
need_idfor referencing)
Create an Epic under a Need¶
- Open the Need work package
- Relations tab → Add relation → "includes" → New work package
- Type
Epic, setepic_id = <need_id>-E<letter>(A, B, C… per Need) - Save
Close an Epic¶
- Open the Epic
- Status →
Closed - Fill the epic's Closure section (template §7) in the description
- The parent Need's % complete updates automatically
Link a PR to a Story¶
The link is in both directions:
- OpenProject side: add GitHub PR URL to the Story's
Relationstab - GitHub side: PR body must contain
CVN-N<nnn>-US<m>(enforced by Phase 4 CI gate once live)
Create a Release¶
- Work packages → Create → type
Release release_id = CVN-R<yyyymmdd>-<n>- Add all closed Epics as "part of" relations
- Attach backtest report links (URL or file)
- When deploy succeeds, set status →
Deployed
Import a GitHub issue as an OpenProject work package¶
Use scripts/openproject_import_gh.py to lift an existing GitHub issue into OpenProject as a Need / Epic / Story / Release. The script is idempotent — re-running on the same issue updates the existing work package instead of creating a duplicate (matched by the github_issue_url custom field).
Prereqs (one-time):
- Project
CVNTradeexists, typesNeed/Epic/Story/Releaseexist, and the custom fields listed inDESIGN_SYSTEM_BLUEPRINT.md§2.2 are attached to those types (done via OpenProject admin UI — the API doesn't expose POST on types or custom fields). ghCLI authenticated (gh auth statusreturns OK).- Admin API key generated via OpenProject UI → avatar → My account → Access tokens → API access key. Keep it local; don't commit.
Usage:
# Project convention: every Python invocation goes through .venv_airflow
# (see CLAUDE.md). The session bootstrap activates it automatically; the
# explicit ``source`` line below is the documented form when copy-pasting
# into a fresh shell.
source .venv_airflow/bin/activate
export OPENPROJECT_API_KEY=<your admin api key>
# Import a Need (top-level)
python scripts/openproject_import_gh.py --issue 608 --type Need --cvn-id CVN-N001
# Import an Epic under an existing Need
python scripts/openproject_import_gh.py --issue 624 --type Epic --cvn-id CVN-N001-EB \
--parent-cvn-id CVN-N001
# Dry-run to inspect the payload without writing
python scripts/openproject_import_gh.py --issue 608 --type Need --cvn-id CVN-N001 --dry-run
What gets populated:
subject← GitHub issue titledescription← GitHub issue body (markdown preserved)status← open → "In progress", closed → "Closed"cvn_id← CLI arggithub_issue_url← issue URL (Link custom field)- Parent relation ← via
--parent-cvn-id(WP looked up by itscvn_id)
What is NOT populated (fill via UI afterward):
problem, impact, kpis, out_of_scope, objective, arch_notes, owner, acceptance_criteria, test_plan, models_promoted, epics_closed, backtest_report_url. These are long-form / structured fields the operator owns.
Re-run safety: the script detects an existing work package by github_issue_url and PATCHes it (with the correct lockVersion) — no duplicate, no lost edits as long as your local re-import carries the same or newer content.
Status on re-run: status is set only on the initial creation. A re-run does not touch the status field, so any workflow transitions the operator made in the UI are preserved. OpenProject's workflow policy gates transitions per role, and forcing one from the script would return HTTP 422.
Backups¶
OpenProject DB carries all operator-entered data. Losses are unrecoverable without a backup.
- If running on a dedicated Scaleway PG instance: daily snapshots are configured at the PG level (7-day retention by default).
- If reusing the
champollioninstance: seeinfra/helm/openproject/README.md§Backup for thepg_dumpCronJob option.