Skip to content

Mode Operatoire CVNTrade MLOps

Version : 3.0 Derniere mise a jour : 2026-03-28 Perimetre : exploitation de la plateforme MLOps CVNTrade

Statut operationnel : - Grafana (5 dashboards) : operationnel - Airflow (10 DAGs) : operationnel - ZenML (7 pipelines natifs) : operationnel - MLflow (model registry) : operationnel - Prometheus (metriques K8s) : operationnel - Alerting automatique (Slack/SMS) : cible (issue #397) - Drift monitoring : cible (pas encore implemente) - Diagnostic automatique dans Grafana : cible (logs structurés a migrer)


0. Etat global CVNTrade

Toute session d'exploitation commence par determiner l'etat global.

Etat Definition Reaction
GREEN Infra saine, pipelines sains, au moins 1 run utile recent exploitation normale
YELLOW Derive controlee — signaux faibles d'anomalie surveillance renforcee
ORANGE Degradation metier ou pipeline significative investigation prioritaire
RED Indisponibilite ou incapacite a produire escalade immediate

Regles de calcul

Condition Etat
Infra critical (pod down, PVC full, OOMKill en boucle) RED
Pipeline success rate < 70% ORANGE
Pass rate = 0 sur 3 runs consecutifs ORANGE
Modele stale > 14j sur crypto core (btc-core) ORANGE
0 candidat utile sur univers complet > 3 jours ORANGE
Success rate 70-90% ou Sortino negatif recurrent YELLOW
Modele stale > 7j sur crypto non-core YELLOW
Tout le reste GREEN

1. Objet

Ce document decrit le mode operatoire pour : - determiner l'etat global du systeme, - superviser les golden signals, - lancer les pipelines standards, - diagnostiquer les resultats, - escalader selon la severite.

Principe directeur : Grafana est le point d'entree unique (ADR-26). Les autres outils servent au drill-down selon la nature du probleme.


2. Golden Signals CVNTrade

Ces signaux sont le noyau de la supervision. Toute vue operateur doit en deriver.

Plateforme

Signal Definition Seuil warning Seuil critical
infra_health Saturation / OOM / PVC / disponibilite OOM=1, CPU>70% OOM>2, PVC>85%, pod down
pipeline_success_rate Taux de succes glissant (7j) 70-90% < 70%
pipeline_latency Duree glissante vs p95 historique p95 + 30% p95 + 100%

ML (par crypto — ADR-27)

Signal Definition Seuil warning Seuil critical
f1_buy_delta_vs_baseline f1_buy - baseline_f1_buy delta < 15% delta <= 0
screening_testing_gap screening_f1_buy - last_fold_f1_buy gap > 0.15 gap > 0.30
std_f1_buy Stabilite inter-fold 0.05-0.10 > 0.10

Trading

Signal Definition Seuil warning Seuil critical
sortino KPI financier principal 0 a 0.5 < 0
sl_pct Part des sorties stop-loss 35-50% > 50%
timeout_pct Part des sorties timeout 30-40% > 40%

Lifecycle

Signal Definition Seuil warning Seuil critical
model_freshness Age du dernier modele utile > 7j > 14j
challenger_count Challengers enregistres non promus > 5 pending > 10 pending

Regle : ces signaux doivent etre visibles dans Grafana sans lecture de logs.

Dette d'observabilite : aujourd'hui, les signaux ML et Trading ne sont visibles que dans les logs Airflow (section CORRELATION DATA). Migration vers Grafana = cible prioritaire.


3. Architecture des outils

GRAFANA (grafana.cvntrade.eu)          <- POINT D'ENTREE UNIQUE
  |-- supervision globale (golden signals)
  |-- dashboards metier, pipeline, modele, infra
  '-- alerting centralise (cible #397)

AIRFLOW (airflow.cvntrade.eu)          <- EXECUTION
  |-- trigger DAG
  |-- suivi des runs
  '-- logs detailles (debug uniquement, pas decision)

ZENML (zenml.cvntrade.eu)              <- DRILL-DOWN PIPELINE
  |-- lignage
  |-- historique des runs
  '-- artifacts pipeline versionnes

W&B (wandb.ai)                         <- DRILL-DOWN ML
  |-- screening matrices
  |-- HPO trials
  '-- comparaison detaillee de runs ML

MLFLOW (mlflow.cvntrade.eu)            <- DRILL-DOWN MODELES
  |-- model registry
  |-- metriques par run
  '-- artifacts lies aux modeles

PROMETHEUS (interne K8s)               <- METRIQUES INFRA
  '-- collecte CPU/RAM/pods, exposee dans Grafana

CONSOLE-NEXT (console.cvntrade.eu)     <- IDP host (CVN-N012-EA)
  |-- /catalog       — service inventory (CVN-N012-EA-S02, runbook §catalog-add-service)
  |-- /dashboards    — Grafana embed module (CVN-N012-EA-S03, planned)
  '-- /tokens-preview — design token reference

4. Roles et responsabilites

Role Responsable de Acces
Operateur Morning check, lancement standard, lecture Grafana Grafana (readonly), Airflow (trigger + logs)
MLOps maintainer Incident pipeline, registry, metriques diagnostiques Airflow (admin), MLflow (registry admin), ZenML
ML owner Interpretation modele, features, labels, HPO W&B, MLflow (readonly), ZenML
Platform owner K8s, PVC, nodes, limits, ingress, secrets kubectl (admin restreint), Helm

Escalade

Situation Decideur Executant Approbateur
Lancer un discovery Operateur Operateur -
Investiguer un echec pipeline MLOps maintainer MLOps maintainer -
Modifier HPO / features / labels ML owner ML owner MLOps maintainer
Promotion challenger -> champion ML owner MLOps maintainer Operateur (validation metier)
Killswitch / rollback Operateur (urgence) Platform owner -
Modifier Helm / K8s resources Platform owner Platform owner MLOps maintainer

5. Dashboards Grafana

Dashboard Role principal Usage
MLOps Overview Vue executive Etat global des modeles, screenings, pass rate
Testing & Backtest Vue qualite Resultats testing, WFRB, gates, rejets
Pipeline Health Vue execution Duree des runs, taux de succes, HPO, anomalies
Model Registry Vue cycle de vie modele Versions, fraicheur, modeles actifs / stale
Infra Monitoring Vue plateforme Pods, CPU, RAM, PVC, OOMKill, saturation

Bon usage : - commencer par MLOps Overview ou Pipeline Health - aller sur Infra Monitoring si suspicion de probleme plateforme - aller sur Testing & Backtest si la question porte sur la qualite des resultats


6. Procedures standard

6.1 Verification matinale

Duree cible : 2 a 5 minutes

Etape 1 — Determiner l'etat global

Ouvrir Grafana > Infra Monitoring puis Pipeline Health. Determiner l'etat (GREEN / YELLOW / ORANGE / RED) selon les regles de la section 0. Si RED ou ORANGE → passer directement a la section 9 (incidents).

Etape 2 — Sante plateforme

Grafana > Infra Monitoring. Verifier : - pods critiques operationnels, - absence d'OOMKill recent, - saturation CPU/RAM sous seuils, - PVC sous seuil d'alerte.

Etape 3 — Sante pipelines

Grafana > Pipeline Health. Verifier : - runs de la nuit termines, - taux d'echec dans la norme, - pas de derive de duree, - absence de backlog.

Etape 4 — Resultats metier

Grafana > MLOps Overview. Verifier : - nouveaux candidats qualifies, - pass rate dans la norme, - fraicheur des derniers runs utiles.

6.2 Lancer un discovery

Outil : Airflow — DAG : launch__discovery

Run groupe :

{"group": "defi"}

Run cible :

{"group": "defi", "crypto": "SOLUSDC"}

Run etendu :

{
  "group": "defi",
  "horizons": "H1,H2,H3,H4,H5,H6",
  "sl_range": "0.8,1.0,1.2,1.5",
  "tp_range": "1.5,2.0,2.5,3.0",
  "hpo_trials": 50,
  "backtest_days": 60
}

Suivi : Airflow (statut DAG run), Grafana (resultats apres completion).

Note : les seuils ML/Trading s'interpretent dans le contexte du run (univers, taille de grille, horizons).

6.3 Analyser les resultats

Niveau 1 — Grafana > Testing & Backtest

Chercher : meilleures strategies qualifiees, motifs de rejet, distribution F1/Sortino.

Niveau 2 — Logs Airflow (debug uniquement)

Chercher les blocs structures === CORRELATION DATA ===. Distinguer : - qualite ML : f1_buy vs baseline_f1_buy, precision_buy, recall_buy, buy_rate - qualite financiere : Sortino, tp/sl/timeout %, gates - coherence : screening_f1_buy vs last_fold_f1_buy (meme metrique, meme periode — ADR-28, ADR-29)

Niveau 3 — Drill-down W&B : screening matrix, HPO trials, comparaison de runs.

6.4 Verifier un modele specifique

  1. Grafana > Model Registry — identifier crypto / version
  2. MLflow — run source, metriques, artifacts, hyperparams
  3. ZenML — pipeline source, artifacts amont, lignage

6.5 FTF — Fine-Tuning Framework Operations

6.5.1 Launching FTF runs

  1. Verify no stale runs:

    SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
    kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain
    
    If stale runs exist → kill them first (6.5.4).

  2. Trigger from Airflow UI or CLI. Run-level params only (ADR-65) — PTE, fold/trial counts and history months live in the Console (ftf_config.base_env), resolved at DAG parse time:

kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags trigger finetune__pte \
  --conf '{"factor":"calibration","crypto_group":"defi_top5","phase":"manual","power_mode":"standard"}'

To change the PTE / folds / trials / history: edit Console → Baseline Config (CVN_DEFAULT_PTE, CVN_DEFAULT_N_FOLDS, CVN_DEFAULT_N_TRIALS, CVN_DEFAULT_HISTORY_MONTHS) and re-trigger.

ADR-90 hyperparameter seeding (auto, issue #985). The canonical training hyperparameters (CVN_HPO_<MODEL>_<TF>_<PARAM>, ~481 keys) live in ftf_config.base_env and are auto-seeded on every deploy by a Helm post-install,post-upgrade hook Job in the cvntrade-runtime chart (templates/ftf-seed-hook-job.yaml). It is insert-missing-only — operator Console edits are never clobbered (no --force-overwrite). It is idempotent (already-present keys are skipped). No operator action is required; the seed is NOT a manual prerequisite. - Fail-loud: if the hook Job fails, helm upgrade --wait fails → the Deploy Runtime step fails. The deploy is blocked, never a silently half-seeded Console. - Recovery: inspect kubectl -n cvntrade logs job/cvntrade-runtime-ftf-seed (look for event=seed_db_connect_failed / event=seed_apply_failed + the event=seed_summary line). Most failures are DB connectivity / cvntrade-env-secrets / cvntrade-env-config issues — fix the secret/ConfigMap and re-run the deploy (the hook re-applies idempotently). To preview without writing, from any pod with the env: python scripts/seed_hyperparams_console.py --dry-run (runs offline if the DB is unreachable — prints the full plan, never crashes).

  1. Monitor first 2 minutes — verify sample count:
    POD=$(kubectl get pods -n cvntrade --no-headers | grep "finetune-pte-run-factor-crypto.*Running" | head -1 | awk '{print $1}')
    kubectl logs -n cvntrade "$POD" --tail=20 | grep "Train:.*samples"
    
  2. Expected: Train: ~1000-2000 samples (CUSUM enabled)
  3. If Train: >10000 samples: STOP — cache is stale or CUSUM misconfigured. See 6.5.3.

6.5.2 After Helm deploy with FTF config changes

MANDATORY — pods keep old config until killed.

  1. Verify new code deployed:

    kubectl exec -n cvntrade $SCHED -c scheduler -- grep "<KEY_CHANGE>" /opt/airflow/src/...
    

  2. Kill ALL running FTF pods:

    kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
    

  3. Wait 30 seconds for termination.

  4. Trigger new runs (6.5.1).

6.5.3 Cache flush (MLflow feature store)

When: After CUSUM config change, after feature engineering change, or when sample counts are wrong.

What gets flushed: Feature selection (Level 4) and feature engineering caches. ETL, labels, and models are NOT affected.

Procedure:

  1. Kill all FTF pods first:

    kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
    

  2. Flush feature selection + feature engineering cache:

    SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
    kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c "
    import os, sys
    sys.path.insert(0, '/opt/airflow/src')
    import mlflow
    mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000'))
    client = mlflow.tracking.MlflowClient()
    runs = client.search_runs(experiment_ids=['3'], max_results=500)
    deleted = 0
    for run in runs:
        name = run.data.tags.get('mlflow.runName', '')
        if 'feature_selection' in name or 'feature_eng' in name.lower():
            client.delete_run(run.info.run_id)
            deleted += 1
    print(f'Deleted {deleted} cache entries')
    "
    

  3. Verify cache is empty:

    kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c "
    import os, sys
    sys.path.insert(0, '/opt/airflow/src')
    import mlflow
    mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000'))
    client = mlflow.tracking.MlflowClient()
    runs = client.search_runs(experiment_ids=['3'], max_results=500)
    fs = [r for r in runs if 'feature_selection' in r.data.tags.get('mlflow.runName', '')]
    print(f'Remaining feature_selection entries: {len(fs)} (should be 0)')
    "
    

  4. Trigger new runs (6.5.1). Cache will regenerate automatically (~2 min extra on first run per crypto).

WARNING: Do NOT flush experiment 1 (ETL), 4 (Labels), 7 (HPO), or 8 (Models). These are independent of CUSUM config.

6.5.4 Killing stale FTF runs

# Kill all pods
kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force

# Mark failed runs in Airflow (optional — they'll stay as 'failed' automatically)
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain

6.5.5 Monitoring FTF runs

# Pod status
kubectl get pods -n cvntrade --no-headers | grep finetune

# Progress per pod
for pod in $(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | awk '{print $1}'); do
  count=$(kubectl logs -n cvntrade $pod --tail=5000 | grep "event=weighted_variant_evaluated" | wc -l)
  crypto=$(kubectl logs -n cvntrade $pod | grep "Running factor=" | head -1 | grep -o "crypto=[A-Z]*")
  echo "$pod: $crypto completed=$count"
done

# CPU/memory usage
kubectl top pods -n cvntrade --no-headers | grep finetune

Grafana: Infrastructure Monitoring dashboard shows FTF pods, CPU/memory, throttling. Grafana: Fine-Tuning Results dashboard shows results as they arrive in PostgreSQL.


7. Taxonomie de diagnostic CVNTrade

Chaque run de testing/backtest doit produire un diagnostic principal.

Audit léger s40 — flag CVN_S40_SKIP_S22A1_CROSSREF (CVN-N001-EI-S07 Lever #2, défaut PG ftf_config.base_env, Console UI uniquement) : pour un audit d'intégrité de validation (s40, X/y-only), mettre le flag à on saute l'ancre run_s22a1 (re-preuve 300-round, p50 ~4 min / p95 ~14 min) que les probes n'utilisent pas. Le verdict des probes est inchangé, mais le run ne certifie plus la reproduction du symptôme S22A1 : la sortie porte alors s22a1_status=SKIPPED + s22a1_anchor_available=false — un audit skip ne doit jamais être lu comme un audit complet. Défaut : off pour les missions de reproduction (S22→S28), on pour les audits s40 légers. Le skip est refusé pour tout diagnostic non xy_only (registre capture_requirements.py). Parquet capturé absent en mode skip ⇒ INCONCLUSIVE_TOOLING (jamais de re-capture auto, ADR-25).

Code Signification Signal declencheur
ML_USELESS Modele sans valeur predictive f1_buy <= baseline_f1_buy (delta <= 0)
ML_MARGINAL Gain ML insuffisant pour survivre a l'execution delta +0 a +5 pts vs baseline
ML_EXPLOITABLE Signal exploitable sous reserve d'enveloppe favorable delta +5 a +10 pts vs baseline
ML_SOLID Signal solide delta > +10 pts vs baseline
ML_UNSTABLE Modele instable entre periodes std_f1_buy > 0.10
EXECUTION_MISMATCH Enveloppe defavorable — ML exploitable mais PnL negatif f1_buy > baseline +5pts et Sortino < 0
SL_TOO_TIGHT Suspicion SL trop serre (hypothese dominante, pas prouvee) sl_pct > 50%
TP_TOO_AMBITIOUS Take-profit trop ambitieux tp_pct < 20%
HORIZON_TOO_SHORT Horizon trop court timeout_pct > 40%
SCREENING_OVERFIT Divergence screening/testing screening_f1_buy - last_fold_f1_buy > 0.30
PIPELINE_DEGRADED Sante pipeline degradee success_rate < 70%
INFRA_SATURATED Saturation K8s / PVC / OOM OOMKill > 2, PVC > 85%, CPU > 90%
NO_CANDIDATES Aucune strategie viable 0 candidats passes sur un run complet

Arbre de decision (30 secondes)

1. f1_buy vs baseline_f1_buy ?
   |
   |-- delta <= 0 pts           --> ML_USELESS
   |-- delta +0 a +5 pts        --> ML_MARGINAL
   |-- delta +5 a +10 pts       --> ML_EXPLOITABLE, verifier execution :
   |-- delta > +10 pts          --> ML_SOLID, verifier execution :
       |
       2. Sortino ?
          |
          |-- Sortino > 0.5     --> strategie OK
          |-- Sortino < 0       --> EXECUTION_MISMATCH, verifier :
              |
              3. Exit stats ?
                 |-- sl_pct > 50%      --> SL_TOO_TIGHT
                 |-- tp_pct < 20%      --> TP_TOO_AMBITIOUS
                 |-- timeout_pct > 40% --> HORIZON_TOO_SHORT

8. Niveaux de severite

Niveau Description Exemples Reaction
P0 Critical Systeme down ou corruption Pipeline completement bloque, OOMKill en boucle, PVC full Escalade immediate, platform owner
P1 Urgent Degradation significative 0 candidats sur run complet, success rate < 70% Investigation dans l'heure, MLOps maintainer
P2 Warning Performance degradee Sortino negatif, SL rate eleve, modele stale > 14j Investigation dans la journee, ML owner
P3 Info Optimisation Run plus lent, pass rate en baisse legere Backlog, traiter quand disponible

9. Playbooks par incident

9.1 EXECUTION_MISMATCH — ML utile mais Sortino negatif

Symptome : f1_buy >> baseline, Sortino < 0

Impact : le modele detecte les BUY mais les trades perdent de l'argent

Hypotheses : - SL_TOO_TIGHT : ATR multiplier trop faible - TP_TOO_AMBITIOUS : TP multiplier trop eleve - HORIZON_TOO_SHORT : horizon insuffisant

Verifications : 1. sl_pct, tp_pct, timeout_pct dans CORRELATION DATA 2. Comparer les ATR ranges entre strategies passees et echouees 3. Verifier si le pattern est specifique a une crypto ou generalise

Action : - Si sl_pct > 50% : relancer avec sl_range: "1.2,1.5,1.8,2.0" - Si tp_pct < 20% : reduire tp_range: "1.5,2.0,2.5" - Si timeout_pct > 40% : augmenter horizons

Critere de sortie : Sortino > 0 ou sl_pct < 40%

Escalade : ML owner si 3 tentatives sans amelioration


9.2 ML_USELESS — Modele sans valeur predictive

Symptome : f1_buy <= baseline_f1_buy

Impact : le modele ne fait pas mieux que predire "toujours BUY"

Hypotheses : - Dataset trop desequilibre (buy_rate > 50%) - Features non informatives - Objectif HPO mal aligne

Verifications : 1. buy_rate dans CORRELATION DATA 2. action_rate HPO (le modele predit-il assez de BUY ?) 3. Comparer avec d'autres cryptos du meme run

Action : 1. Verifier buy_rate — si > 50%, le probleme est le labeling 2. Tester avec horizon different (change la distribution des labels) 3. Revoir l'objectif HPO (precision_recall_auc adapte ?) 4. En dernier recours : revoir le feature set

Critere de sortie : f1_buy > baseline + 15%

Escalade : ML owner


9.3 NO_CANDIDATES — Aucune strategie viable

Symptome : 0 candidats passes sur un run complet

Impact : aucune strategie exploitable pour cette crypto/groupe

Verifications : 1. Gate rejection reasons dans Grafana > Testing & Backtest 2. Si tous rejetes par Sortino : probleme PTE (voir 9.1) 3. Si tous rejetes par n_trades : modele trop conservateur (theta trop haut) 4. Si tous rejetes par f1 : probleme ML (voir 9.2)

Action : - Relancer avec grid elargi (plus d'horizons, ATR ranges plus larges) - Tester une autre crypto du meme groupe - Si recurrent sur tout un groupe : le groupe n'est peut-etre pas viable

Critere de sortie : au moins 1 candidat passe les gates


9.4 INFRA_SATURATED — Pod OOMKilled ou PVC plein

Symptome : OOMKill > 0 dans Grafana Infra, ou PVC > 85%

Impact : pipeline instable, runs qui crashent

Verifications : 1. Identifier le pod concerne dans Grafana Infra 2. Verifier si le probleme est recurrent ou ponctuel 3. Verifier la memoire consommee vs limits

Action : - Si OOMKill ponctuel : relancer le run - Si OOMKill recurrent : augmenter memory limits dans Helm values - Si PVC > 85% : nettoyer les anciens artifacts (MLflow / S3)

Critere de sortie : OOMKill = 0 sur 24h, PVC < 70%

Escalade : Platform owner

Dernier recours uniquement :

kubectl delete pod <name> -n cvntrade
A utiliser seulement si le pod est reellement bloque, que le controleur peut le recreer, et que la cause a ete qualifiee.


9.5 Run trop long (> 3h)

Symptome : duree run > p95 + 30% ou > 3h absolu

Verifications : 1. Identifier le step bloquant dans Airflow 2. Si HPO : verifier n_trials (50 = normal, > 100 = suspect) 3. Si data fetch : verifier Binance API / S3 connectivity 4. Si backtest : verifier le nombre de candles (60j x 5min = 17K = normal)

Action : selon la cause identifiee

Critere de sortie : duree < p95 historique


10. Pipelines disponibles

Exploitation courante (operateur)

DAG Airflow Pipeline Role
launch__discovery pte__discovery Screen -> Test -> WFRB -> Register challenger
launch__backtesting pte__backtesting Test -> WFRB a partir de candidats existants
launch__walkforward pte__walkforward Validation walk-forward
launch__retrain pte__retrain Retrain + register
launch__monitoring pte__monitoring Health checks

Administration controlee (MLOps maintainer / ML owner)

DAG Airflow Pipeline Role Approbation requise
launch__meta_training pte__meta_training Entrainement meta-label MLOps maintainer
launch__promotion pte__promotion Challenger -> Champion ML owner + operateur

Urgence (platform owner)

DAG Airflow Role Impact
pte__7_killswitch Quarantaine d'urgence Desactive un modele
pte__8_rollback Retour version precedente Remplace le champion
pte__10_decommission Archivage / nettoyage Supprime un modele

Regle : ne jamais executer killswitch, rollback ou decommission sans validation explicite.


11. Groupes de cryptos

Groupe Univers indicatif
btc-core BTCUSDC, ETHUSDC
defi SOLUSDC, ADAUSDC, BONKUSDC, XRPUSDC

Source de verite : table cvntrade_universes en base PostgreSQL.

Note : les seuils ML/Trading s'interpretent dans le contexte du groupe et de la crypto. Un pass rate de 3% sur un grid de 450 points pour une crypto volatile est different d'un pass rate de 3% sur un grid de 50 points pour BTC.


12. Acces

Service URL Profil Usage Niveau
Grafana grafana.cvntrade.eu Operateur supervision readonly
Airflow airflow.cvntrade.eu Operateur / MLOps execution + logs trigger + readonly
MLflow mlflow.cvntrade.eu MLOps / ML owner modeles registry admin
ZenML zenml.cvntrade.eu MLOps / ML owner lignage readonly
W&B wandb.ai ML owner analyse ML compte organisationnel
K8s kubectl Platform owner depannage admin restreint
Console (IDP) console.cvntrade.eu Operateur service catalog + dashboards readonly (catalog edits via git PR — voir runbooks/catalog-add-service.md)

Regle : ne jamais documenter ou diffuser d'identifiants dans ce document. Tous les acces sont geres par secrets Kubernetes.


13. Regles d'exploitation (policy)

  1. Grafana d'abord — les logs Airflow sont du debug, pas de la decision
  2. Airflow pour executer, pas pour conclure
  3. MLflow pour les modeles, W&B pour l'analyse ML
  4. ZenML pour le lignage
  5. Pas de fallback silencieux dans les diagnostics (ADR-25)
  6. Pas d'action destructive sans qualification du probleme
  7. Toute conclusion doit distinguer ML, PTE/backtest et infra
  8. Comparaison intra-crypto uniquement (ADR-27)
  9. 0 SELL est normal en mode binaire (ADR-28)
  10. Toute metrique ML doit etre comparee a la baseline naive (ADR-29)

14. Dette d'observabilite

Les elements suivants sont aujourd'hui dans les logs Airflow et doivent migrer vers Grafana :

Element Source actuelle Cible
f1_buy vs baseline_f1_buy Logs CORRELATION DATA Dashboard Grafana
screening_f1_buy vs last_fold_f1_buy Logs CORRELATION DATA Dashboard Grafana
tp_pct / sl_pct / timeout_pct Logs CORRELATION DATA Dashboard Grafana
Code diagnostic principal Logs (verdict) Dashboard Grafana
Etat global CVNTrade Calcul humain Dashboard Grafana (automatise)

Prerequis : ecrire les diagnostics dans PostgreSQL (pas seulement dans les logs) pour que Grafana puisse les requeter.


15. Comite d'experts — runbook (ADR-52, ADR-68)

Le comite d'experts est le canal par defaut pour la revue de plan (process step 3) et la revue de PR substantielle (process step 8). 5 experts personas + 1 consolidateur produisent un verdict structure (ACCEPTED / ACCEPTED-WITH-CHANGES / REJECTED + code).

15.1 Quand invoquer

Trigger session_type Mandatoire ?
Plan d'implementation (process step 3) plan_review oui
PR touchant src/commun/pipeline/, finetune/, cache/, backtest/, training, labels, prod trading pr_review oui
Interpretation d'un run experimental (FTF, ablation) experiment_review recommande
Question de strategie ouverte general optionnel

Exemption : docs / dashboards / config seuls (cf ADR-68).

Checklist pr_review — transport inter-taches (ADR-0100, anti-regression Epic CVN-N014-ED) : pour toute PR touchant dags/**, src/commun/finetune/** ou scripts/**, verifier qu'aucune tache ne bricole un stockage intermediaire par-tache (np.savez/np.load/put_object/upload_fileobj/to_parquet/to_pickle/chemin /tmp/manifest S3) pour franchir une frontiere de tache/pod. Exiger return/xcom_pull (le backend object-storage gere l'offload, ADR-0100) ou le pass-by-reference S3 partage (cvntrade_s3_manager, cf. s43_io). Exception : capture single-pod bornee (producteur+consommateur meme pod). C'est le pendant humain du gate CodeRabbit ; guideline : process/INTER_TASK_DATA_TRANSPORT.md.

15.2 Preparer le dossier

Convention : un fichier markdown self-contained dans documentation/reviews/YYYY-MM-DD-<slug>.md.

Sections recommandees : 1. Demande au reviewer — questions explicites a poser (3-7) 2. Contexte projet — 1 paragraphe pour quelqu'un qui ne connait pas CVNTrade 3. Probleme observe — chiffres, pas adjectifs 4. Ce qui a deja ete tente — outcomes explicites 5. Hypothese et plan propose — concret, testable 6. Ce qu'on a ecarte (et pourquoi) — anti-suggestions 7. Risques identifies et mitigations 8. Criteres de succes cochables

Le comite n'a aucun acces filesystem : tout ce qui n'est pas dans le dossier n'existe pas pour lui.

15.3 Lancer la session

source .venv_airflow/bin/activate
python scripts/expert_committee.py \
  --artifact documentation/reviews/2026-04-26-track-a.md \
  --question "Question concise listant les decisions a valider" \
  --session-type plan_review \
  --issue "#690"

Options utiles : - --experts expert-ml-engineer,expert-data-scientist : sous-ensemble si la question est etroite - --no-consolidation : opinions brutes seulement (utile en debug) - --dry-run : compile et affiche les prompts sans appeler les LLMs (gratuit) - --model gemini/gemini-2.5-pro : override le modele par defaut

Sortie : - committee/sessions/{session_id}_committee.json — verdict + opinions complets - committee/sessions/{session_id}_artifact.md — copie archivee de l'artifact - log FinOps dans committee/finops.jsonl

Cout typique : $0.10 – $0.30 / session (Mistral large + Gemini 2.5 flash). Au-dela de $2 → reduire le dossier plutot que retry.

15.4 Lire le verdict

GUI : make committee-gui (port 8502) → naviguer vers la session par ID ou date.

Champs cles dans le JSON : - verdict.status : ACCEPTED / ACCEPTED_WITH_CHANGES / REJECTED - verdict.code : code structure (ex METHODOLOGY_FLAW, INSUFFICIENT_EVIDENCE) - verdict.consensus_strength : strong / weak / split - verdict.blockers : liste a traiter avant resoumission si REJECTED - verdict.recommendations : actions priorisees, numerotees - verdict.areas_of_dissent : ou les experts ne sont PAS d'accord — souvent plus instructif que les zones de consensus - expert_opinions[].score : 0-10 par expert ; un score < 4 isole signale soit un blind spot du dossier soit un desaccord profond - finops.totals : tokens et cout USD

15.5 Apres un REJECTED

Trois reactions admissibles : 1. Adresser les blockers et resoumettre (incrementer le slug dossier -v2) 2. Reduire le scope pour evacuer les blockers (ex : retirer la partie contestee du plan, traiter ailleurs) 3. Waiver explicite dans l'issue avec justification ecrite ("le comite suggere X mais Y because Z")

Silence = blocage du process step suivant (cf ADR-68 invariant "REJECTED is not optional to address").

15.6 Apres un ACCEPTED / ACCEPTED-WITH-CHANGES

  • Logger le session_id dans le commit qui implemente (addresses committee session b2e4c384)
  • Pour les PR reviews : copier le lien session JSON dans la description PR
  • Si ACCEPTED-WITH-CHANGES : adresser les recommendations priorite haute avant merge

15.7 Logger la session comme OP Meeting (ADR-82, obligatoire)

Tout plan_review / pr_review / experiment_review réussi DOIT être loggé comme objet Meeting dans OpenProject, immédiatement après la commande expert_committee.py (≤ 24h max — au-delà la review est considérée non-conforme ADR-68 + ADR-82).

# kubectl access requis (le POST /api/v3/meetings n'est pas exposé via l'API publique en OP 17.3.1)
set -a && . .env && set +a
python scripts/op_save_committee_as_meeting.py \
  --session committee/sessions/{session_id}_committee.json \
  --linked-wp <wp_id>

--linked-wp est OBLIGATOIRE pour les sessions de type plan_review / pr_review / experiment_review (le meeting apparaît sous l'onglet "Meetings" du WP). Seules les sessions general peuvent s'en passer. Note : session_type est lu depuis le JSON de session (session["verdict"]["session_type"]), il n'y a pas de flag CLI --session-type sur ce script.

Le script est idempotent : - 1ère exécution → crée le Meeting + 5 locked-Users (un par expert) + agenda items + uploade les 2 attachments - 2ème exécution sur la même session → EXISTING_MEETING_ID=<n> + skip ... already attached

Sortie attendue :

[op_save_committee_as_meeting] pod=openproject-... session=<id> linked_wp=<wp>
[attach] uploaded <id>_committee.json (attachment id=...)
[attach] uploaded <id>_artifact.md (attachment id=...)
CREATED_MEETING_ID=<n>
CREATED_MEETING_URL=https://openproject.cvntrade.eu/meetings/<n>
AGENDA_ITEMS=7

Vérifier dans la UI : https://openproject.cvntrade.eu/projects/cvntrade/meetings?upcoming=false (onglet Past). Le filtre par défaut ?upcoming=true ne montre PAS les sessions passées (logique : un comité est toujours dans le passé une fois loggé).

Le CREATED_MEETING_URL doit être collé en plus du lien session JSON dans la description PR / commentaire OP Story (les deux références cohabitent — JSON = source-of-truth contenu, Meeting = surface queryable).

15.8 Erreurs frequentes

Symptome Cause Fix
Verdict REJECTED systematique dossier trop court ou questions floues enrichir le dossier, lister explicitement les decisions a valider
Un expert score 0/10 sans raison parse failure JSON (LLM a renvoye markdown au lieu de JSON pur) check committee/sessions/parse_failures/ ; relancer la session
Cout > $2 sur une session artifact > 200k chars tronquer ; le loader truncate automatiquement mais perd le contexte
Langfuse 404 sur review-consolidator prompt pas synchronise vers Langfuse non-bloquant — le fallback local prend le relais

16. Sprint orchestration runbook (ADR-69)

OpenProject est l'orchestrateur du projet. Toute activité de dev démarre en sélectionnant une Story dans une version (sprint) open de la roadmap, et se termine en fermant la Story (et la version quand sa dernière Story ferme).

URLs clés : - Roadmap : https://openproject.cvntrade.eu/projects/cvntrade/roadmap - Versions admin : https://openproject.cvntrade.eu/projects/cvntrade/settings/versions - WPs filtrés par version : https://openproject.cvntrade.eu/projects/cvntrade/work_packages → filtre Version

Convention de nommage des versions : <epic-shortname>-<phase>-<descriptor> (ex: F1B-S1-QW-PhaseA). Une version mélangeant des Stories de plusieurs Epics viole l'invariant ADR-69.

16.1 Picker une Story (start)

  1. Roadmap → identifier la version open couvrant la période courante
  2. Lister ses Stories New ou In progress ; respecter l'ordre indiqué dans le doc Epic (documentation/epics/<epic>.md §3 ou plan canonique §6)
  3. Vérifier la règle single-WIP : aucune autre Story déjà In progress pour soi
  4. Cliquer la Story → StatusIn progress
  5. Récupérer le cvn_id (ex: CVN-N001-EE-S01) et le github_issue_url du panneau Détails
  6. Ouvrir / créer la branche : feat/cvn-n001-ee-s01-<slug> (le cvn_id doit apparaître)
  7. Démarrer le step 1 du dev process CLAUDE.md

Si la Story n'a pas de github_issue_url rempli, créer l'issue puis :

source .venv_airflow/bin/activate
export OPENPROJECT_API_KEY=<...>
python scripts/openproject_import_gh.py --issue <NNN> --type Story \
  --cvn-id <CVN-...-S0X> --parent-cvn-id <CVN-...-EE>

16.2 Pendant l'implémentation

  • Pour TOUTE transition de statut OP (ex. Specified → In progress), utiliser scripts/op_story_transition.py --wp <id> --to "<state>" --verdict "..." --evidence "..." : il applique le rituel ADR-81 (résolution du statut par nom, garde sur les edges légaux, commentaire verdict/evidence obligatoire). Détail : STORY_WORKFLOW §2.
  • Le cvn_id de la Story DOIT apparaître dans : nom de branche, titre PR, et description PR
  • Si la Story s'étire au-delà de la fin de version : ajouter un commentaire OP expliquant le slip + déplacer la Story vers la version suivante (PATCH _links.version ou via UI). Ne PAS la laisser orpheline.
  • Si on doit stopper sans terminer : Story status In progressNew + commentaire OP expliquant l'interruption

16.3 Fermer une Story (close)

DoD obligatoire avant fermeture (cumul des steps CLAUDE.md 12-13) : - [ ] PR mergée sur main - [ ] CI green - [ ] Tests système passés (cf. CLAUDE.md step 12) - [ ] Issue GH fermée avec commentaire de validation - [ ] (Si Story d'un Epic FTF) gate per-track validé : f1_buy ≥ +0.015, expectancy_net ≥ baseline, etc. — cf. l'Epic doc

Puis dans OP : 1. Story → StatusClosed 2. Commenter avec : lien PR, SHA du squash-merge, lien session committee pr_review si applicable 3. Vérifier que la Story disparaît de la liste In progress du tableau

16.4 Fermer une version (sprint roll-over)

Une version se ferme quand toutes ses Stories sont Closed ET le gate de version est validé. Procédure :

  1. Vérifier la complétude : toutes les Stories de la version doivent être en statut Closed (filtrer la liste des WPs par Version: <name> + Status: !Closed → doit être vide)
  2. Vérifier le gate de version : ouvrir l'Epic associé (documentation/epics/<epic>.md §4), confirmer que les acceptance criteria attribuables à cette version sont satisfaits. Pour les Epics FTF c'est typiquement :
  3. f1_buy lift cumulé ≥ seuil (ex: +0.05 avec CI95 excluant 0)
  4. expectancy_net ≥ baseline
  5. sortino ≥ baseline
  6. max_drawdown ≤ baseline + 1%
  7. Si gate OK :
  8. OP UI → Settings → Versions → cliquer la version → Edit
  9. Status openclosed
  10. Description : ajouter une note de clôture en markdown :
    ## Outcome (closed YYYY-MM-DD)
    - Stories shipped: <list cvn_ids>
    - Gate result: PASSED
    - Evidence: <MLflow run_id, baseline ftf_…, committee session id>
    - Lessons: <1-3 bullets>
    
  11. Si gate KO :
  12. Ouvrir une issue GH [gate-failure] <epic> <version> avec le diagnostic
  13. Appliquer §6 escalation de l'Epic (ex: F1_buy boost = "évaluer big-bet bundle si QW < +0.05")
  14. Fermer la version avec note Gate result: FAILED → see issue #NNN
  15. Run la rétrospective : 15 min de revue, focus sur :
  16. Quelles hypothèses falsifiables ont tenu / échoué (ADR-29 baseline naïve)
  17. Quel temps prévu vs réel par Story
  18. Quels guardrails ADR-58 ont déclenché
  19. Une mémoire feedback_*.md à écrire si lesson durable
  20. Ouvrir la version suivante : si pas déjà créée, suivre §16.5

16.5 Créer une nouvelle version

Via API (script type /tmp/op_sprints.py du 2026-04-27) ou UI : 1. UI : Settings → Versions → New version 2. Nom selon convention <epic-shortname>-<phase>-<descriptor> 3. Start date + End date (cadence par défaut 2 semaines, sprint de transition 1 semaine) 4. Description : objectif sprint + Stories planifiées + gate attendu 5. Status open 6. Assigner les Stories prévues : pour chaque WP cible, panneau Details → Version → choisir la nouvelle version

16.5b Sprint version : Stories oui, Epics et Needs non

Convention canonique (formalisée 2026-04-29 après audit constatant 5 Epics CVN-N012-EA → -EE pollués avec F1B-Backlog créés via OP UI) :

  • Story.version : toujours set (Backlog ou sprint actif) — c'est l'unité d'engagement, single-WIP s'applique
  • Epic.version : toujours NONE — un Epic agrège des Stories sur plusieurs sprints, lui assigner une version crée une fausse promesse de fermeture
  • Need.version : toujours NONE — objectif stratégique sur timeline multi-sprint / multi-mois

Vérifier l'état actuel — script Python qui résout les types et le custom field cvn_id dynamiquement (mêmes patterns que scripts/openproject_import_gh.py, voir #668) et pagine la liste des work packages pour ne pas rater de violation au-delà de la première page :

python3 - <<'PY'
"""Audit : Epics + Needs ne doivent jamais avoir de version assignée.

Pas d'IDs codés en dur — les types Epic/Need et la clé du custom field
cvn_id sont résolus à l'exécution via /api/v3/types et /api/v3/work_packages/schemas/...
(robuste à un re-ordering admin UI). Pagine la liste WP pour éviter les
faux "OK" sur des projets > 1 page.
"""
import json
import os
import urllib.parse
import urllib.request

API_KEY = os.environ["OPENPROJECT_API_KEY"]
BASE = "https://openproject.cvntrade.eu"
PROJECT = "cvntrade"
PAGE_SIZE = 100  # OpenProject cap is typically 200, 100 is safe

def op_get(path: str) -> dict:
    req = urllib.request.Request(f"{BASE}{path}")
    auth = "Basic " + __import__("base64").b64encode(f"apikey:{API_KEY}".encode()).decode()
    req.add_header("Authorization", auth)
    with urllib.request.urlopen(req, timeout=30) as r:
        return json.load(r)

# Resolve type IDs by name (Epic, Need)
types = op_get("/api/v3/types").get("_embedded", {}).get("elements", [])
type_id_by_name = {t["name"]: t["id"] for t in types}
epic_id = type_id_by_name.get("Epic")
need_id = type_id_by_name.get("Need")
if not epic_id or not need_id:
    raise SystemExit(f"Could not resolve Epic/Need type IDs (got Epic={epic_id}, Need={need_id})")

# Resolve cvn_id customField key (use the Epic schema as a representative)
project = op_get(f"/api/v3/projects/{PROJECT}")
schema = op_get(f"/api/v3/work_packages/schemas/{project['id']}-{epic_id}")
cvn_id_key = next(
    (k for k, v in schema.items() if k.startswith("customField") and isinstance(v, dict) and v.get("name") == "cvn_id"),
    None,
)
if cvn_id_key is None:
    raise SystemExit("Custom field 'cvn_id' not found on Epic schema — fix in OP admin UI")

# Paginate through Epic + Need work packages.
# OpenProject API v3 uses `offset` as the 1-based PAGE NUMBER (not item index).
# https://www.openproject.org/docs/api/collections/
filters = json.dumps([{"type": {"operator": "=", "values": [str(epic_id), str(need_id)]}}])
violations = []
page_num = 1
seen = 0
while True:
    qs = urllib.parse.urlencode({"filters": filters, "pageSize": PAGE_SIZE, "offset": page_num})
    page = op_get(f"/api/v3/projects/{PROJECT}/work_packages?{qs}")
    elems = page.get("_embedded", {}).get("elements", [])
    if not elems:
        break
    for e in elems:
        v = e.get("_links", {}).get("version", {}).get("title")
        if v:
            violations.append((e["id"], e.get(cvn_id_key) or "?", v))
    seen += len(elems)
    total = page.get("total", 0)
    if seen >= total:
        break
    page_num += 1

if violations:
    print("VIOLATIONS:")
    for wp_id, cvn_id, version in violations:
        print(f"  wp#{wp_id} ({cvn_id}): version={version}")
    raise SystemExit(1)
print("OK — no Epic/Need has a version assigned (audited in full, paginated)")
PY

Stripper une version mal assignée :

WP_ID=<wp_id>
LOCK_VERSION=$(curl -s -u "apikey:$OPENPROJECT_API_KEY" \
  "https://openproject.cvntrade.eu/api/v3/work_packages/$WP_ID" | jq -r .lockVersion)
curl -s -u "apikey:$OPENPROJECT_API_KEY" \
  -X PATCH -H "Content-Type: application/json" \
  -d "{\"lockVersion\": $LOCK_VERSION, \"_links\": {\"version\": {\"href\": null}}}" \
  "https://openproject.cvntrade.eu/api/v3/work_packages/$WP_ID"

Le canonique scripts/openproject_import_gh.py ne set jamais version (correct par défaut). Les scripts ad-hoc (/tmp/_create_*.py ou autre) doivent suivre la convention : version sur Story uniquement, jamais sur Epic / Need.

16.6 Vue Stories par version

Liste rapide via filtre URL :

https://openproject.cvntrade.eu/projects/cvntrade/work_packages?query_props={"f":[{"n":"version","o":"=","v":["<version_id>"]}]}

Ou via API :

curl -s -u "apikey:$OPENPROJECT_API_KEY" \
  "https://openproject.cvntrade.eu/api/v3/projects/cvntrade/work_packages?filters=%5B%7B%22version%22%3A%7B%22operator%22%3A%22%3D%22%2C%22values%22%3A%5B%22<id>%22%5D%7D%7D%5D"

16.7 Erreurs fréquentes

Symptome Cause Fix
Story sans github_issue_url importer pas joué OU cf non attaché au type Story jouer openproject_import_gh.py ; vérifier les CFs sur Settings → Work package types
Plusieurs Stories In progress simultanées violation single-WIP repasser les autres en New avec commentaire ; ne garder qu'une seule
Version reste open après dernière Story closed étape de gate review oubliée exécuter §16.4 (vérifier gate, ajouter note de clôture, fermer)
Story commencée sans entrée OP violation ADR-69 créer la Story rétroactivement dans la version courante avant merge
_links.version PATCH en 422 lockVersion obsolète (concurrent edit dans l'UI) re-GET le WP, prendre le nouveau lockVersion, retry
Story tirée du Backlog sans justification violation invariant ADR-69 ajouter un commentaire OP expliquant pourquoi la priorité a changé ; sinon remettre la Story dans son sprint planifié

16.8 Guardrails CI — bypass + audit (CVN-N011-EA-S12)

Le workflow .github/workflows/pr-workflow-guardrails.yml enforce les 4 gates G1-G4 (PR title / Story ref / plan dossier / MLOps readiness) à l'ouverture de chaque PR. Mécanismes de bypass dans l'ordre de préférence :

  1. Bots auto-bypassdependabot[bot], github-actions[bot], renovate[bot] skip automatiquement (le workflow ne s'exécute pas).
  2. Kill switch global (urgences uniquement)Settings → Variables → Actions → New repository variable avec GUARDRAILS_KILL_SWITCH=true. Désactive le workflow sans supprimer le fichier. À retirer immédiatement après l'urgence.
  3. Waiver par label sur une PR spécifique — appliquer le label guardrails-waiver sur la PR. Le job principal skip mais un job audit guardrails-waiver-audit émet un ::warning:: qui apparaît dans la check-list. Obligation : ajouter une section ## Guardrails waiver dans le body de la PR expliquant quels gates sont bypassés et pourquoi.

Audit mensuel (committee 98ca88b1 reco #2) : pendant la rétro de version (§16.4), recenser les PR mergées avec guardrails-waiver via :

gh pr list --state merged --label guardrails-waiver --search "merged:>=YYYY-MM-DD" --json number,title,body

Pour chaque waiver, vérifier que la justification du body est lisible et que le bypass est rétro-couvert (Story de fix, follow-up issue, etc.). Tendance > 1 waiver / sprint = signal d'un guardrail trop strict ou d'un trou de process à reprioritiser.


17. Incident log

Append-only chronological log of incidents that affected production, FTF runs, or the operator workflow. Entries follow a fixed format so future operators (and AI assistants) can grep / scan quickly. Per committee fd9317be recommendation #4, every CRITICAL severity bug merits an entry here even if the code fix is small ; the goal is organizational learning, not blame.

Severity per §8 :

  • P0 — production trading down or losing money
  • P1 — production degraded, observable in prod metrics
  • P2 — FTF / training pipeline broken, no live trading impact, but lock decisions blocked
  • P3 — operator workflow friction, no data integrity impact

17.1 2026-04-28 — Track 5 FTF sweep type-mismatch failure (P2)

Time : 2026-04-28 12:25 UTC trigger → 12:27 UTC first failure observed → 14:50 UTC hotfix merged → ~14:55 UTC operator re-trigger.

Severity : P2 — FTF sweep cassé, lock decision pour Track 5 (CVN-N001-EE-S01) bloqué. Aucun impact live trading (ADR-71 + EG-S06 flatten_all gate toujours en place ; Track 5 est FTF-only par design).

Symptom :

ValueError: Type requirement mismatch.
Expected X_train:<class 'numpy.ndarray'>
got [...DataFrame with columns open, xgb_accel_amplitude_ratio_24_grp2, ...]
~75 % des 125 rows attendues (toutes les variants Hamilton-driven : label_smoothing in {mild, aggressive} + cleanlab in {filter, reweight}) ont échoué au call apply_label_pipeline. Seul le baseline identity short-circuit (label_smoothing=none × cleanlab=off) a survécu.

Root cause : apply_label_pipeline (Track 5 PR #734 commit 77aa6389) invoke un Hamilton driver dont les nodes typent X_train: np.ndarray. Hamilton's validate_inputs rejette les pd.DataFrame que la prod feed (le trainer XGBoost reçoit DataFrame depuis le cache layer cvntrade_autonomous_orchestrator).

Test gap : tests/unit/training/labels/test_label_pipeline.py:_make_imbalanced_dataset retournait (np.ndarray, np.ndarray). Aucun test n'a couvert le cas DataFrame. Le signal CI était propre (181/181 pre-hotfix) mais ne couvrait pas le codepath production.

Fix : PR #751 commit 3837a886 (mergé 2026-04-28 14:50 UTC). Coercion défensive pd.DataFrame → np.ndarray à l'entrée de apply_label_pipeline, AVANT le Hamilton driver. Plus 4 nouveaux tests TestDataFrameCoercion qui exercent les codepaths avec DataFrame inputs.

Audit trail : - GH issue : #750 - Hotfix PR : #751 (merged 3837a886) - Hotfix dossier : documentation/reviews/2026-04-28-track5-hotfix-dataframe-coercion.md - Committee pr_review : session fd9317be PASSED OK avg 8.1 strong, 0 blocker, 7 forward-looking recos - Original Track 5 commits : implementation 77aa6389 (PR #734), plan dossier 1074891a

Time-to-detect (technical) : ~2 minutes (trigger 12:25 UTC → first failure log at 12:27 UTC). Logs surfaced the failure immediately ; the system's automated detection is fine.

Time-to-acknowledge (human) : ~55 minutes (operator notified Claude / flagged ~13:20 UTC after ~1h of failed-runs accumulating in Airflow). This is the real operator-loop latency — the failure was visible in logs from minute 2, but no alert paged the operator. Highlights the absence of a real-time "FTF run health" alert (cf. documentation/runbooks/cleanlab_cv_systemic_failure.md per ADR-70 §2 P1 alert wired in MLOps readiness for CVN-N001-EE-S01 — would have paged at minute 5 had it been deployed already).

Time-to-mitigate : ~10 minutes from human acknowledgement (operator stopped sweep on flag).

Time-to-fix : ~1.5h from human acknowledgement (diagnosis 5 min + code 5 min + 4 regression tests 10 min + dossier 15 min + committee submission 2 min + committee verdict ~2 min + merge ~5 min + sync delays).

Aggregate time-to-recovery (trigger → fix merged → re-trigger possible) : ~2.5h, dominated by the 55-min human-acknowledgement gap.

Lessons learned :

  1. Test fixture parity : every helper exposed to the production trainer MUST have at least one test fixture using the actual production input types (pd.DataFrame for X, pd.Series for y). Going forward, this is a checklist item in MLOps readiness template (ADR-70 §6 — committee fd9317be reco #2 to be applied as ADR-70 amendment).
  2. Production smoke gate : substantive ML code (touching src/training/, src/commun/{pipeline,inference,filters,labels}/) should pass a 1×1×1 FTF smoke variant on the cluster as a pre-merge gate, not just the local unit + integration tests. Committee fd9317be reco #3 — to be applied as ADR-69 amendment OR new ADR.
  3. Hamilton strict typing is load-bearing : the validation that broke our run is a feature, not a bug. The fix is entry-point coercion, NOT loosening the type hints (which would defer the same class of bug to deeper code). Committee fd9317be reco #5 explicitly endorses this strategy.
  4. Cache layer audit : the upstream cvntrade_autonomous_orchestrator propagates pandas types through the cache without explicit boundary coercion. Committee reco #6 — follow-up issue to add type validation at the cache boundary, reducing future drift surface.

Follow-up actions :

# Action Status
F1 Apply committee reco #2 → ADR-70 amendment for DataFrame fixture parity open
F2 Apply committee reco #3 → ADR-69 / ADR-70 amendment for production smoke gate open
F3 Open issue for cache-layer type audit (committee reco #6) open

17.2 2026-04-28 — Track 5 FTF sweep XGBoost feature-name mismatch (P0, incident #2 same surface as §17.1)

Time : trigger 2026-04-28 ~15:30 UTC → detect <60s → mitigate (operator stops sweep) ~5min → fix (PR #754 merged) ~3h. Severity : P0 — same blocker on the same surface within 24h of incident #1, all FTF variants failing. Symptom :

ValueError: data did not contain feature names, but the following fields are
expected: open, high, low, close, BBL_8_2_0, BBM_8_2_0, BBU_8_2_0, RSI_14, ...
Raised inside xgb.train's _validate_features cross-check between dtrain and dval eval pairs.

Root cause : Hotfix v1 of incident #1 (PR #752 commit 3837a886) coerced X_train from DataFrame → ndarray at the entry of apply_label_pipeline to satisfy Hamilton's validate_inputs. That fix was correct for Hamilton, but it broke an implicit, undocumented contract with XGBoost : the trainer never touches X_val (still DataFrame, with feature_names). When xgb.train was called with evals=[(dtrain, "train"), (dval, "val")], dtrain (no feature names) and dval (has feature names) were inconsistent → crash.

Pattern of failure : Hotfix v1 unit tests (TestDataFrameCoercion) covered apply_label_pipeline in isolation. They satisfied the Hamilton contract but never replayed the full trainer codepath. Methodology gap : unit tests of a transform are not enough — the integration with downstream consumers must be tested at the boundary.

Test gap : no integration test replayed xgb.DMatrix(X_train, ...) + xgb.DMatrix(X_val, ...) + xgb.train(..., evals=[...]) with both eval pairs. The new TestTrainerEndToEndDataFrame class (PR #754) does exactly this — verified to fail without the fix and pass with it.

Fix : PR #754 (commit 6156ff4d).

Layer 1 — apply_label_pipeline now PRESERVES the input type round-trip : - True identity short-circuit BEFORE any coercion (preserves DataFrame index) - For the Hamilton path : capture metadata (column names, Series name, index) → coerce to ndarray → re-wrap to original type with original index when row count matches

Layer 2 — Trainer-side _assert_dmatrix_contract(dtrain, dval) invoked AFTER DMatrix construction and BEFORE xgb.train (committee reco #2). Validates feature_names presence parity, equality, num_col, feature_types. Fail-fast per ADR-25 : if a future transform regresses, the error surfaces here, immediate and readable, not deep in xgb.train.

Layer 3 — 12 new integration tests : - TestTrainerEndToEndDataFrame (6 tests) replays the EXACT trainer codepath end-to-end - TestSiblingFailureModes (5 tests) covers dtype preservation, sample_weights row alignment after filter, NaN handling per ADR-25, trainer assertion regression bar - test_dataframe_index_preserved_when_row_count_unchanged locks DatetimeIndex round-trip

Audit trail : - GH issue #753 - PR #754 (merged commit 6156ff4d) - Committee pr_review session 0e15acc0 — verdict PASSED EXECUTION_RISK (5 experts, consensus strong) - Dossier : documentation/reviews/2026-04-28-track5-hotfix-v2-type-preservation.md - Story : CVN-N001-EE-S01 (OP wp#40)

Time-to-detect / acknowledge / mitigate / fix : - Tech detect : <60s (cluster log surfaced ValueError) - Human ack : ~5min (operator pinged after observing failed variants in Airflow UI) - Mitigate : ~5min (operator stopped sweep) - Fix : ~3h (PR #754 with 3 CR cycles + committee + recos #2 + #3 inline)

Lessons :

  1. An incident on the same surface within 24h is a methodology signal, not a bug accident. Fixing the immediate symptom (incident #1) without integration test parity (#17.1 lesson #1) was insufficient — incident #2 hit the very next operator trigger. The committee 0e15acc0 verdict captures this : "the recurrence of critical production failures on the same surface within 24 hours indicates systemic weaknesses in data contract enforcement, integration testing, and operational readiness."

  2. Defense-in-depth at every contract boundary. The label_pipeline → trainer boundary is now triple-protected : (a) type preservation in the transform, (b) trainer-side assertion before xgb.train, (c) integration tests that replay both. This pattern should generalize to every ML pipeline boundary (committee reco #1, OP Story CVN-N011-EA-S01 #755).

  3. OP-first for backlog. Per ADR-76, the 6 forward-looking recos from session 0e15acc0 were created in OP first (Need CVN-N011, Epic CVN-N011-EA, Stories S01-S06 = wp#69-#74), then GH issues #755-#760 derived from them.

Follow-up actions :

# Action OP GH
F4 ADR explicit data contracts at ML pipeline boundaries (P1) wp#69 #755
F5 ADR amendment — mandate integration testing parity (P2) wp#70 #756
F6 Production smoke gate pre-FTF-sweep (P2) wp#71 #757
F7 Schema validation tooling eval (Great Expectations / Pydantic) (P3) wp#72 #758
F8 Systemic post-mortem — incidents §17.1 + §17.2 (P3) wp#73 #759
F9 Real-time observability — schema/dtypes/NaN at boundaries (P3) wp#74 #760

17.3 2026-04-28 — Track 5 FTF sweep calibration crash on soft labels (P0, incident #3 same surface as §17.1+§17.2)

Time : trigger 2026-04-28 ~15:30 UTC → detect <60s → mitigate (operator stops sweep) ~5min → fix (PR opened ~17:00 UTC) ~3h. Severity : P0 — third blocker on the same surface in 24h, ~50% of FTF sweep variants crashing (all variants with eps_buy > 0 AND calibration != "none"). Symptom :

File "src/training/XGBoost/cvntrade_XGBoost_trainer.py", line 315
  self._apply_calibration(X_train, y_train, config.calibration)
File ".../sklearn/calibration.py", line 319, in fit
  check_classification_targets(y)
ValueError: Unknown label type: continuous. Maybe you are trying to fit a
classifier, which expects discrete classes on a regression target with
continuous values.

Root cause : Track 5 label smoothing (apply_label_pipeline with eps_buy > 0) transforms y_train into soft labels (continuous floats in [0, 1]). xgb.train(..., 'binary:logistic') accepts soft labels natively → training succeeds. But the immediately-following _apply_calibration calls sklearn's CalibratedClassifierCV.fit(X_train, y_train) which calls check_classification_targets(y) and rejects continuous targets.

This is the third incident in 24h on the same conceptual surface — apply_label_pipeline violating an undocumented contract with a downstream consumer of y_train :

Incident Downstream consumer Hotfix
§17.1 Hamilton validate_inputs (X type) PR #752
§17.2 XGBoost _validate_features (feature_names parity) PR #754
§17.3 (this) sklearn CalibratedClassifierCV (discrete targets) PR #767 (CVN-N011-EA-S07)

Test gap : Track 5 integration tests + the new TestTrainerEndToEndDataFrame from PR #754 only exercise the path up to xgb.train — NOT the _apply_calibration step that follows. The committee 0e15acc0 reco #4 ("mandate integration testing parity") predicted exactly this gap, but had not yet been broadly applied.

Fix : PR #767 (CVN-N011-EA-S07).

Adopted Option C (chosen after operator triage and committee 986ea335 plan_review PASSED) : - Calibration runs on (X_val, y_val) instead of (X_train, y_train) — the val split is hard-labeled (never touched by apply_label_pipeline), so soft labels in train no longer impact calibration - Sémantiquement plus correct que les options A (round soft → hard, lossy) and B (preserve original y_train, in-sample calibration) — best-practice ML : calibration toujours sur hold-out (Platt 1999, Niculescu-Mizil & Caruana 2005). Aligne avec ADR-15 (theta calibré OOS) déjà précédent dans le projet. - Defense-in-depth : new _assert_calibration_targets_discrete() helper invoked BEFORE each CalibratedClassifierCV.fit() (per ADR-25 fail-fast). If a future transform breaks the contract, the error surfaces immediately with a readable message pointing at the suspected root cause. - Renamed _apply_calibration(X_train, y_train, ...)_apply_calibration(X_calib, y_calib, ...) to make the new contract explicit (same for _apply_hybrid_calibration).

Tests : 5 new integration tests in TestCalibrationOnSoftLabelsTrain : - 3 parametrized variants (calibration=isotonic/sigmoid/platt) with eps_buy=0.3 end-to-end through trainer.train(...) — all 3 PASS post-fix, FAIL pre-fix with the exact production error message (regression bar verified) - 1 test for the _assert_calibration_targets_discrete helper (healthy + continuous inputs) - 1 baseline sanity (eps=0, calibration=none) — no regression on the no-op path

Audit trail : - Plan dossier : documentation/reviews/2026-04-28-track5-bug1-calibration-refactor-plan.md - Committee plan_review session 986ea335 — PASSED EXECUTION_RISK (5 experts, consensus strong) - GH issue #764 - PR #767 - OP Story CVN-N011-EA-S07 (wp#85)

Time-to-detect / acknowledge / mitigate / fix : - Tech detect : <60s (cluster log surfaced ValueError on first non-baseline variant) - Human ack : ~5min (operator pasted log dump immediately) - Mitigate : ~5min (operator stopped sweep) - Fix : ~3h wall-clock (plan dossier 30min + committee 5min + implementation+tests 1h + PR+commit 30min)

Lessons :

  1. Three same-surface incidents in 24h confirms a systemic gap, not a bug accident. The unit-test surface (label_pipeline in isolation) and even the new trainer-end-to-end test (path up to xgb.train) BOTH missed _apply_calibration because no test exercised the FULL trainer.train(...) lifecycle with soft labels. The committee 0e15acc0 reco #4 ("mandate integration testing parity") and reco #5 ("production smoke gate pre-FTF-sweep") would have caught this — escalating their priority.

  2. Defense-in-depth at every contract boundary, generalized by ADR-mandate. We now have boundary assertions at 2 contracts (dtrain↔dval feature_names from §17.2, calibration target discreteness here). The next 4th incident is the one we don't yet have an assertion for. Story CVN-N011-EA-S01 (data contracts ADR, #755) becomes the systemic answer — accelerate.

  3. Pre-existing methodological weakness exposed by the bug. Before §17.3, calibration ran on training data — methodologically optimistic (in-sample calibration). The fix not only solves the immediate crash but corrects this weakness by moving calibration to hold-out (X_val, y_val), aligning with ADR-15 (theta calibré OOS) precedent. Bugs occasionally surface latent design issues — prefer the structural fix over the tactical one when they coincide.

Follow-up actions :

# Action OP GH
F10 Audit other y_train/y_val consumers in trainer + post-trainer for similar implicit contracts (Risk #5 of plan dossier) TBD TBD if found
F11 Re-prioritize CVN-N011-EA-S01 (data contracts ADR) — move from P1 backlog to next sprint wp#69 #755
F12 Re-prioritize CVN-N011-EA-S03 (production smoke gate) — would have caught all 3 of §17.1+§17.2+§17.3 wp#71 #757

17.4 2026-04-29 — Track 6 focal_loss FTF sweep crash on missing sympy (P1)

Time : 11:35 trigger → 11:37 detect (operator log review) → 11:49 mitigate (PR #775 merged) → 11:52 fix (image deploy SUCCESS). Severity : P1 — every focal_loss trial crashed identically with ModuleNotFoundError: No module named 'sympy'. 5 pods × 50 trials = 250/250 failures, zero rows persisted, best_score=-1000.0 on all variants. Symptom : Loki query event=xgboost_training_failed error=No module named 'sympy' showed identical traceback across all 5 pods, originating at src/training/XGBoost/focal_loss.py:66 → import sympy as sp inside _build_focal_lambdas() called at module import (line 90). Root cause : Track 6 PR #767 added focal_loss.py which imports sympy at module load time, but neither requirements.txt (root) nor airflow_docker/requirements.txt (image build) declared the dependency. Local .venv_airflow had sympy as a transitive dep so all CI tests passed silently. Same gap pattern as PR #762 (cleanlab). Test gap : no end-to-end runtime smoke against the actual airflow image. CI uses .venv_airflow derivatives, so missing prod deps don't surface until first real trial. CodeRabbit doesn't compare module imports against requirements files. Fix : PR #775 squash 510b10db — pinned sympy==1.14.0 in BOTH requirements.txt and airflow_docker/requirements.txt with explicit comment about runtime requirement + ADR-25 fail-loud rationale. Audit trail : GH #776, OP wp#92 (CVN-N011-EA-S11) post-mortem Story tracking the systemic gap, PR #775 hotfix. Time-to-detect (technical) / acknowledge (human) / mitigate / fix : 0min / 2min / 12min / 17min. Detection was fast because the operator was watching the sweep live. Without the live watch, the silent failure (250 trials with best_score=-1000.0) would have looked indistinguishable from "model converged badly" until the results dossier showed every single trial had identical impossible scores. Lessons learned :

  • Module-load-time imports are a runtime trap that the test suite cannot catch — the test framework imports happen in dev environment, where transitive deps are abundant.
  • The "fix" pattern (sync requirements.txt ↔ airflow_docker/requirements.txt) is recurring (#762 cleanlab, this incident sympy) — process gate is needed, not more vigilance.
  • 6 CodeRabbit passes + committee pr_review on PR #767 didn't catch this because none of those layers compares declared deps to actual imports.

Follow-up actions :

# Action Story Issue
F13 Build a CI gate that fails when a module import in src/ is not declared in either requirements file (Hypothesis 1 in S11 plan) wp#92 #776
F14 Add a post-build dockerized smoke test of curated entry points (Hypothesis 2 in S11 plan) wp#92 #776
F15 Update TEMPLATE_mlops_readiness.md §3 with a "new dep declared in BOTH requirements files" checkbox wp#92 #776
F16 Retroactive scan of src/ for other latent missing-dep regressions on main today wp#92 #776

17.5 2026-04-29 — Cleanlab FTF sweep gRPC fork deadlock (P1)

Time : 09:46 trigger → 09:47 first symptom → 10:10 detect (operator review of pod state) → 10:12 mitigate (operator killed pods) → ~12:00 root cause confirmed (focal_loss sweep ran cleanly on same stack) → fix in flight via PR (this Story CVN-N011-EA-S10). Severity : P1 — every cleanlab FTF sweep deadlocks silently after ~4 trials per pod. Pods report Running to K8s, consume RAM at 2-3 GiB each, write zero rows to finetune_results. No alert exists today (this incident is what triggers adding one — see runbook). Symptom : 5 pods stuck identically at cleanlab_cv_probs event after ~4 HPO trials. LDOUSDC pod showed explicit WARNING: All log messages before absl::InitializeLog() is called are written to STDERR then I0000 ... fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers then 24 minutes of complete silence. Root cause : cleanlab.filter.find_label_issues defaulted to n_jobs=cpu_count() and spawned multiprocessing.Pool with the OS default start method (fork on Linux + Python<3.14). The forked children inherited live MLflow autolog gRPC threads in a half-locked state and hung forever on first gRPC call. Diagnostic was confirmed by the focal_loss FTF sweep (2026-04-29 11:08-11:30 UTC) which ran cleanly on the same HPO + MLflow + Optuna stack — the only difference being focal_loss doesn't call cleanlab.find_label_issues. Test gap : no integration test exercised cleanlab CV with concurrent MLflow autolog gRPC threads alive in the parent process. The 8 unit tests in TestSuspectMaskPerClassCap (S08) covered the per-class cap logic but mocked out cleanlab so the fork pattern never fired in CI. Fix : PR #777 CVN-N011-EA-S10 — pin n_jobs=1 on the find_label_issues call in src/training/labels/label_pipeline.py::suspect_mask (eliminates the fork) + defence-in-depth GRPC_ENABLE_FORK_SUPPORT=1 + GRPC_POLL_STRATEGY=poll env vars in Helm + docker-compose. Reproducer test in tests/integration/test_grpc_fork_deadlock_regression.py asserts the contract holds. Audit trail : GH #774, OP wp#91 (CVN-N011-EA-S10), plan dossier 2026-04-29-grpc-fork-deadlock-plan.md, committee plan_review session 7bf612b7 PASSED OK. Time-to-detect (technical) / acknowledge (human) / mitigate / fix : 0min (no alert wired) / 24min (operator manual review) / 2min (pod kill) / fix in flight. The 24-min technical-vs-human gap is exactly the alert-wiring debt that the new runbook + hpo_pod_stuck alert (wired in this Story's MLOps readiness §1) closes. Lessons learned :

  • Forking after gRPC threads are alive is a deterministic deadlock — must be guarded everywhere gRPC is used (i.e. the entire prod codebase via MLflow autolog).
  • Cleanlab's n_jobs default is unsafe in any process that has MLflow autolog active.
  • Detecting "pod stuck but Running" requires a dedicated log-progress alert — Prometheus liveness probes alone don't catch this (process is alive, just hung).

Follow-up actions :

# Action Story Issue
F17 Land the H4 + H2 fix + reproducer test on main wp#91 #774
F18 Wire the hpo_pod_stuck alert in Grafana per the new runbook §1 wp#91 #774
F19 Add gRPC client metrics (latency, errors) to OTel pipeline (deferred from committee 7bf612b7 reco #6) CVN-N010-EA TBD
F20 Liveness probe on hpo_heartbeat events (deferred from reco #1) CVN-N010-EA TBD

17.X — Template for future entries

### 17.X 2026-MM-DD — <one-line title> (severity Pn)

**Time** : trigger → detect → mitigate → fix.
**Severity** : Pn — short justification.
**Symptom** : observable signal (log line, metric, dashboard).
**Root cause** : 1-3 sentences naming the file:line if applicable.
**Test gap** : what the test suite was missing.
**Fix** : PR # commit, files touched, mechanism.
**Audit trail** : GH issue, PR, committee session, related ADR/Story.
**Time-to-detect (technical) / acknowledge (human) / mitigate / fix** : durations. Split tech vs human detection — they are usually different and the gap is operationally interesting (it's where alert-wiring debt lives).
**Lessons learned** : 1-N bullets, action-oriented.
**Follow-up actions** : table of issues/PRs to track each lesson.

Glossaire

Terme Definition
screening Phase 1 : preselection rapide des PTE candidates (grid search)
testing Phase 2 : validation multi-fold HPO + backtest OOS
WFRB Phase 3 : validation walk-forward rolling backtest
challenger Modele enregistre mais non promu — en attente d'approbation
champion Modele actif de reference pour une crypto
stale Modele non rafraichi depuis N jours
baseline_f1_buy Score F1 d'un classifier naif "always BUY" — reference minimale
buy_rate Proportion de labels BUY dans le split test
PTE Parametres de Trade Execution : SL, TP, horizon
golden signal Metrique cle de supervision — le noyau de l'observabilite
last_fold_f1_buy F1 BUY du dernier fold (meme periode que le screening)
CORRELATION DATA Bloc structure dans les logs test_step — interface stable (ADR-30)

Working with the docs site (docs.cvntrade.eu)

The Design System + ADRs + runbooks are published to docs.cvntrade.eu. Source lives in documentation/; build config is mkdocs.yml at the repo root. Phase 2 of #593 (#637 tracks the scaffolding).

Local preview

make docs-install      # one-time: installs mkdocs + plugins into .venv_airflow
make docs-serve        # hot-reload at http://127.0.0.1:8000

Edit any .md file in documentation/ — the browser reloads on save.

Local strict build (same checks as CI)

make docs-build

--strict fails on broken internal links, missing nav entries, or unknown config. If this passes locally, CI will pass too.

Adding a page

  1. Drop the .md file in the right subfolder (needs/, epics/, adr/, …).
  2. Add an entry in mkdocs.yml under nav: so it's discoverable (required unless it's referenced from another page's index).
  3. make docs-build — fix any broken links.
  4. Commit + PR. CI rebuilds on every PR that touches documentation/**.

Deploy

  • main push → .github/workflows/docs.yml → builds strict → GitHub Pages.
  • First deploy: enable GH Pages in repo settings (Source: GitHub Actions), then set CNAME docs.cvntrade.eudococeven.github.io at the registrar.
  • Subsequent deploys are fully automated. No operator action required.

Architecture diagrams

documentation/architecture/workspace.dsl is the single Structurizr DSL source. See documentation/architecture/workspace-reference.md for rendering options (VS Code live preview, structurizr-cli, Structurizr Lite in Docker).

Troubleshooting

Symptom Cause / fix
mkdocs: command not found Run make docs-install.
Strict build fails on broken link Fix the .md link — relative path from the file's location.
New file isn't in the sidebar Add it under nav: in mkdocs.yml.
README.md shows as an empty page Expected — README.md at docs_dir root is excluded in favor of index.md.

OpenProject operator playbook (#593 Phase 1)

URL: https://openproject.cvntrade.eu Role: product source of truth for Needs / Epics / Stories / Releases. Deployed via: infra/helm/openproject/ chart, Helm-managed through the deploy-k8s workflow. Chart setup doc: infra/helm/openproject/README.md §Setup.

Login

  1. Browser → https://openproject.cvntrade.eu/login
  2. Username admin initially; operator account created post-setup
  3. If cert warning appears, cert-manager hasn't yet issued — wait 5 min and refresh

Create a new Need

  1. Top nav → project CVNTrade → Work packages
  2. Create → type Need
  3. Subject: short title, e.g. "Reach F1=0.75 binary classification"
  4. Custom fields:
  5. need_id = next available CVN-N<nnn> (strictly sequential)
  6. github_issue_url = link to the parent GitHub issue
  7. Description: follow documentation/templates/TEMPLATE_need.md
  8. Save → OpenProject auto-generates the work package ID (ignore it, use need_id for referencing)

Create an Epic under a Need

  1. Open the Need work package
  2. Relations tab → Add relation → "includes" → New work package
  3. Type Epic, set epic_id = <need_id>-E<letter> (A, B, C… per Need)
  4. Save

Close an Epic

  1. Open the Epic
  2. Status → Closed
  3. Fill the epic's Closure section (template §7) in the description
  4. The parent Need's % complete updates automatically

Link a PR to a Story

The link is in both directions:

  • OpenProject side: add GitHub PR URL to the Story's Relations tab
  • GitHub side: PR body must contain CVN-N<nnn>-US<m> (enforced by Phase 4 CI gate once live)

Create a Release

  1. Work packages → Create → type Release
  2. release_id = CVN-R<yyyymmdd>-<n>
  3. Add all closed Epics as "part of" relations
  4. Attach backtest report links (URL or file)
  5. When deploy succeeds, set status → Deployed

Import a GitHub issue as an OpenProject work package

Use scripts/openproject_import_gh.py to lift an existing GitHub issue into OpenProject as a Need / Epic / Story / Release. The script is idempotent — re-running on the same issue updates the existing work package instead of creating a duplicate (matched by the github_issue_url custom field).

Prereqs (one-time):

  1. Project CVNTrade exists, types Need/Epic/Story/Release exist, and the custom fields listed in DESIGN_SYSTEM_BLUEPRINT.md §2.2 are attached to those types (done via OpenProject admin UI — the API doesn't expose POST on types or custom fields).
  2. gh CLI authenticated (gh auth status returns OK).
  3. Admin API key generated via OpenProject UI → avatar → My account → Access tokens → API access key. Keep it local; don't commit.

Usage:

# Project convention: every Python invocation goes through .venv_airflow
# (see CLAUDE.md). The session bootstrap activates it automatically; the
# explicit ``source`` line below is the documented form when copy-pasting
# into a fresh shell.
source .venv_airflow/bin/activate
export OPENPROJECT_API_KEY=<your admin api key>

# Import a Need (top-level)
python scripts/openproject_import_gh.py --issue 608 --type Need --cvn-id CVN-N001

# Import an Epic under an existing Need
python scripts/openproject_import_gh.py --issue 624 --type Epic --cvn-id CVN-N001-EB \
  --parent-cvn-id CVN-N001

# Dry-run to inspect the payload without writing
python scripts/openproject_import_gh.py --issue 608 --type Need --cvn-id CVN-N001 --dry-run

What gets populated:

  • subject ← GitHub issue title
  • description ← GitHub issue body (markdown preserved)
  • status ← open → "In progress", closed → "Closed"
  • cvn_id ← CLI arg
  • github_issue_url ← issue URL (Link custom field)
  • Parent relation ← via --parent-cvn-id (WP looked up by its cvn_id)

What is NOT populated (fill via UI afterward):

problem, impact, kpis, out_of_scope, objective, arch_notes, owner, acceptance_criteria, test_plan, models_promoted, epics_closed, backtest_report_url. These are long-form / structured fields the operator owns.

Re-run safety: the script detects an existing work package by github_issue_url and PATCHes it (with the correct lockVersion) — no duplicate, no lost edits as long as your local re-import carries the same or newer content.

Status on re-run: status is set only on the initial creation. A re-run does not touch the status field, so any workflow transitions the operator made in the UI are preserved. OpenProject's workflow policy gates transitions per role, and forcing one from the script would return HTTP 422.

Backups

OpenProject DB carries all operator-entered data. Losses are unrecoverable without a backup.

  • If running on a dedicated Scaleway PG instance: daily snapshots are configured at the PG level (7-day retention by default).
  • If reusing the champollion instance: see infra/helm/openproject/README.md §Backup for the pg_dump CronJob option.