Mode Operatoire CVNTrade MLOps¶

Version : 3.0 Derniere mise a jour : 2026-03-28 Perimetre : exploitation de la plateforme MLOps CVNTrade

Statut operationnel : - Grafana (5 dashboards) : operationnel - Airflow (10 DAGs) : operationnel - ZenML (7 pipelines natifs) : operationnel - MLflow (model registry) : operationnel - Prometheus (metriques K8s) : operationnel - Alerting automatique (Slack/SMS) : cible (issue #397) - Drift monitoring : cible (pas encore implemente) - Diagnostic automatique dans Grafana : cible (logs structurés a migrer)

0. Etat global CVNTrade¶

Toute session d'exploitation commence par determiner l'etat global.

Etat	Definition	Reaction
GREEN	Infra saine, pipelines sains, au moins 1 run utile recent	exploitation normale
YELLOW	Derive controlee — signaux faibles d'anomalie	surveillance renforcee
ORANGE	Degradation metier ou pipeline significative	investigation prioritaire
RED	Indisponibilite ou incapacite a produire	escalade immediate

Regles de calcul¶

Condition	Etat
Infra critical (pod down, PVC full, OOMKill en boucle)	RED
Pipeline success rate < 70%	ORANGE
Pass rate = 0 sur 3 runs consecutifs	ORANGE
Modele stale > 14j sur crypto core (btc-core)	ORANGE
0 candidat utile sur univers complet > 3 jours	ORANGE
Success rate 70-90% ou Sortino negatif recurrent	YELLOW
Modele stale > 7j sur crypto non-core	YELLOW
Tout le reste	GREEN

1. Objet¶

Ce document decrit le mode operatoire pour : - determiner l'etat global du systeme, - superviser les golden signals, - lancer les pipelines standards, - diagnostiquer les resultats, - escalader selon la severite.

Principe directeur : Grafana est le point d'entree unique (ADR-26). Les autres outils servent au drill-down selon la nature du probleme.

2. Golden Signals CVNTrade¶

Ces signaux sont le noyau de la supervision. Toute vue operateur doit en deriver.

Plateforme¶

Signal	Definition	Seuil warning	Seuil critical
`infra_health`	Saturation / OOM / PVC / disponibilite	OOM=1, CPU>70%	OOM>2, PVC>85%, pod down
`pipeline_success_rate`	Taux de succes glissant (7j)	70-90%	< 70%
`pipeline_latency`	Duree glissante vs p95 historique	p95 + 30%	p95 + 100%

ML (par crypto — ADR-27)¶

Signal	Definition	Seuil warning	Seuil critical
`f1_buy_delta_vs_baseline`	`f1_buy - baseline_f1_buy`	delta < 15%	delta <= 0
`screening_testing_gap`	`screening_f1_buy - last_fold_f1_buy`	gap > 0.15	gap > 0.30
`std_f1_buy`	Stabilite inter-fold	0.05-0.10	> 0.10

Trading¶

Signal	Definition	Seuil warning	Seuil critical
`sortino`	KPI financier principal	0 a 0.5	< 0
`sl_pct`	Part des sorties stop-loss	35-50%	> 50%
`timeout_pct`	Part des sorties timeout	30-40%	> 40%

Lifecycle¶

Signal	Definition	Seuil warning	Seuil critical
`model_freshness`	Age du dernier modele utile	> 7j	> 14j
`challenger_count`	Challengers enregistres non promus	> 5 pending	> 10 pending

Regle : ces signaux doivent etre visibles dans Grafana sans lecture de logs.

Dette d'observabilite : aujourd'hui, les signaux ML et Trading ne sont visibles que dans les logs Airflow (section CORRELATION DATA). Migration vers Grafana = cible prioritaire.

3. Architecture des outils¶

GRAFANA (grafana.cvntrade.eu)          <- POINT D'ENTREE UNIQUE
  |-- supervision globale (golden signals)
  |-- dashboards metier, pipeline, modele, infra
  '-- alerting centralise (cible #397)

AIRFLOW (airflow.cvntrade.eu)          <- EXECUTION
  |-- trigger DAG
  |-- suivi des runs
  '-- logs detailles (debug uniquement, pas decision)

ZENML (zenml.cvntrade.eu)              <- DRILL-DOWN PIPELINE
  |-- lignage
  |-- historique des runs
  '-- artifacts pipeline versionnes

W&B (wandb.ai)                         <- DRILL-DOWN ML
  |-- screening matrices
  |-- HPO trials
  '-- comparaison detaillee de runs ML

MLFLOW (mlflow.cvntrade.eu)            <- DRILL-DOWN MODELES
  |-- model registry
  |-- metriques par run
  '-- artifacts lies aux modeles

PROMETHEUS (interne K8s)               <- METRIQUES INFRA
  '-- collecte CPU/RAM/pods, exposee dans Grafana

4. Roles et responsabilites¶

Role	Responsable de	Acces
Operateur	Morning check, lancement standard, lecture Grafana	Grafana (readonly), Airflow (trigger + logs)
MLOps maintainer	Incident pipeline, registry, metriques diagnostiques	Airflow (admin), MLflow (registry admin), ZenML
ML owner	Interpretation modele, features, labels, HPO	W&B, MLflow (readonly), ZenML
Platform owner	K8s, PVC, nodes, limits, ingress, secrets	kubectl (admin restreint), Helm

Escalade¶

Situation	Decideur	Executant	Approbateur
Lancer un discovery	Operateur	Operateur	-
Investiguer un echec pipeline	MLOps maintainer	MLOps maintainer	-
Modifier HPO / features / labels	ML owner	ML owner	MLOps maintainer
Promotion challenger -> champion	ML owner	MLOps maintainer	Operateur (validation metier)
Killswitch / rollback	Operateur (urgence)	Platform owner	-
Modifier Helm / K8s resources	Platform owner	Platform owner	MLOps maintainer

5. Dashboards Grafana¶

Dashboard	Role principal	Usage
MLOps Overview	Vue executive	Etat global des modeles, screenings, pass rate
Testing & Backtest	Vue qualite	Resultats testing, WFRB, gates, rejets
Pipeline Health	Vue execution	Duree des runs, taux de succes, HPO, anomalies
Model Registry	Vue cycle de vie modele	Versions, fraicheur, modeles actifs / stale
Infra Monitoring	Vue plateforme	Pods, CPU, RAM, PVC, OOMKill, saturation

Bon usage : - commencer par MLOps Overview ou Pipeline Health - aller sur Infra Monitoring si suspicion de probleme plateforme - aller sur Testing & Backtest si la question porte sur la qualite des resultats

6. Procedures standard¶

6.1 Verification matinale¶

Duree cible : 2 a 5 minutes

Etape 1 — Determiner l'etat global

Ouvrir Grafana > Infra Monitoring puis Pipeline Health. Determiner l'etat (GREEN / YELLOW / ORANGE / RED) selon les regles de la section 0. Si RED ou ORANGE → passer directement a la section 9 (incidents).

Etape 2 — Sante plateforme

Grafana > Infra Monitoring. Verifier : - pods critiques operationnels, - absence d'OOMKill recent, - saturation CPU/RAM sous seuils, - PVC sous seuil d'alerte.

Etape 3 — Sante pipelines

Grafana > Pipeline Health. Verifier : - runs de la nuit termines, - taux d'echec dans la norme, - pas de derive de duree, - absence de backlog.

Etape 4 — Resultats metier

Grafana > MLOps Overview. Verifier : - nouveaux candidats qualifies, - pass rate dans la norme, - fraicheur des derniers runs utiles.

6.2 Lancer un discovery¶

Outil : Airflow — DAG : launch__discovery

Run groupe :

{"group": "defi"}

Run cible :

{"group": "defi", "crypto": "SOLUSDC"}

Run etendu :

{
  "group": "defi",
  "horizons": "H1,H2,H3,H4,H5,H6",
  "sl_range": "0.8,1.0,1.2,1.5",
  "tp_range": "1.5,2.0,2.5,3.0",
  "hpo_trials": 50,
  "backtest_days": 60
}

Suivi : Airflow (statut DAG run), Grafana (resultats apres completion).

Note : les seuils ML/Trading s'interpretent dans le contexte du run (univers, taille de grille, horizons).

6.3 Analyser les resultats¶

Niveau 1 — Grafana > Testing & Backtest

Chercher : meilleures strategies qualifiees, motifs de rejet, distribution F1/Sortino.

Niveau 2 — Logs Airflow (debug uniquement)

Chercher les blocs structures === CORRELATION DATA ===. Distinguer : - qualite ML : f1_buy vs baseline_f1_buy, precision_buy, recall_buy, buy_rate - qualite financiere : Sortino, tp/sl/timeout %, gates - coherence : screening_f1_buy vs last_fold_f1_buy (meme metrique, meme periode — ADR-28, ADR-29)

Niveau 3 — Drill-down W&B : screening matrix, HPO trials, comparaison de runs.

6.4 Verifier un modele specifique¶

Grafana > Model Registry — identifier crypto / version
MLflow — run source, metriques, artifacts, hyperparams
ZenML — pipeline source, artifacts amont, lignage

6.5 FTF — Fine-Tuning Framework Operations¶

6.5.1 Launching FTF runs¶

Verify no stale runs:

SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain

If stale runs exist → kill them first (6.5.4).

Trigger from Airflow UI or CLI:

kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags trigger finetune__pte \
  --conf '{"pte":"ATR1.5_3.0_H5","factor":"calibration","crypto_group":"defi_top5","n_folds":5,"n_trials":30,"history_months":24,"phase":"manual"}'

Monitor first 2 minutes — verify sample count:

POD=$(kubectl get pods -n cvntrade --no-headers | grep "finetune-pte-run-factor-crypto.*Running" | head -1 | awk '{print $1}')
kubectl logs -n cvntrade "$POD" --tail=20 | grep "Train:.*samples"

Expected: Train: ~1000-2000 samples (CUSUM enabled)
If Train: >10000 samples: STOP — cache is stale or CUSUM misconfigured. See 6.5.3.

6.5.2 After Helm deploy with FTF config changes¶

MANDATORY — pods keep old config until killed.

Verify new code deployed:

kubectl exec -n cvntrade $SCHED -c scheduler -- grep "<KEY_CHANGE>" /opt/airflow/src/...

Kill ALL running FTF pods:

kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force

Wait 30 seconds for termination.
Trigger new runs (6.5.1).

6.5.3 Cache flush (MLflow feature store)¶

When: After CUSUM config change, after feature engineering change, or when sample counts are wrong.

What gets flushed: Feature selection (Level 4) and feature engineering caches. ETL, labels, and models are NOT affected.

Procedure:

Kill all FTF pods first:

kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force

Flush feature selection + feature engineering cache:

SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c "
import os, sys
sys.path.insert(0, '/opt/airflow/src')
import mlflow
mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000'))
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(experiment_ids=['3'], max_results=500)
deleted = 0
for run in runs:
    name = run.data.tags.get('mlflow.runName', '')
    if 'feature_selection' in name or 'feature_eng' in name.lower():
        client.delete_run(run.info.run_id)
        deleted += 1
print(f'Deleted {deleted} cache entries')
"

Verify cache is empty:

kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c "
import os, sys
sys.path.insert(0, '/opt/airflow/src')
import mlflow
mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000'))
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(experiment_ids=['3'], max_results=500)
fs = [r for r in runs if 'feature_selection' in r.data.tags.get('mlflow.runName', '')]
print(f'Remaining feature_selection entries: {len(fs)} (should be 0)')
"

Trigger new runs (6.5.1). Cache will regenerate automatically (~2 min extra on first run per crypto).

WARNING: Do NOT flush experiment 1 (ETL), 4 (Labels), 7 (HPO), or 8 (Models). These are independent of CUSUM config.

6.5.4 Killing stale FTF runs¶

# Kill all pods
kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force

# Mark failed runs in Airflow (optional — they'll stay as 'failed' automatically)
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain

6.5.5 Monitoring FTF runs¶

# Pod status
kubectl get pods -n cvntrade --no-headers | grep finetune

# Progress per pod
for pod in $(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | awk '{print $1}'); do
  count=$(kubectl logs -n cvntrade $pod --tail=5000 | grep "event=weighted_variant_evaluated" | wc -l)
  crypto=$(kubectl logs -n cvntrade $pod | grep "Running factor=" | head -1 | grep -o "crypto=[A-Z]*")
  echo "$pod: $crypto completed=$count"
done

# CPU/memory usage
kubectl top pods -n cvntrade --no-headers | grep finetune

Grafana: Infrastructure Monitoring dashboard shows FTF pods, CPU/memory, throttling. Grafana: Fine-Tuning Results dashboard shows results as they arrive in PostgreSQL.

7. Taxonomie de diagnostic CVNTrade¶

Chaque run de testing/backtest doit produire un diagnostic principal.

Code	Signification	Signal declencheur
`ML_USELESS`	Modele sans valeur predictive	f1_buy <= baseline_f1_buy (delta <= 0)
`ML_MARGINAL`	Gain ML insuffisant pour survivre a l'execution	delta +0 a +5 pts vs baseline
`ML_EXPLOITABLE`	Signal exploitable sous reserve d'enveloppe favorable	delta +5 a +10 pts vs baseline
`ML_SOLID`	Signal solide	delta > +10 pts vs baseline
`ML_UNSTABLE`	Modele instable entre periodes	std_f1_buy > 0.10
`EXECUTION_MISMATCH`	Enveloppe defavorable — ML exploitable mais PnL negatif	f1_buy > baseline +5pts et Sortino < 0
`SL_TOO_TIGHT`	Suspicion SL trop serre (hypothese dominante, pas prouvee)	sl_pct > 50%
`TP_TOO_AMBITIOUS`	Take-profit trop ambitieux	tp_pct < 20%
`HORIZON_TOO_SHORT`	Horizon trop court	timeout_pct > 40%
`SCREENING_OVERFIT`	Divergence screening/testing	screening_f1_buy - last_fold_f1_buy > 0.30
`PIPELINE_DEGRADED`	Sante pipeline degradee	success_rate < 70%
`INFRA_SATURATED`	Saturation K8s / PVC / OOM	OOMKill > 2, PVC > 85%, CPU > 90%
`NO_CANDIDATES`	Aucune strategie viable	0 candidats passes sur un run complet

Arbre de decision (30 secondes)¶

1. f1_buy vs baseline_f1_buy ?
   |
   |-- delta <= 0 pts           --> ML_USELESS
   |-- delta +0 a +5 pts        --> ML_MARGINAL
   |-- delta +5 a +10 pts       --> ML_EXPLOITABLE, verifier execution :
   |-- delta > +10 pts          --> ML_SOLID, verifier execution :
       |
       2. Sortino ?
          |
          |-- Sortino > 0.5     --> strategie OK
          |-- Sortino < 0       --> EXECUTION_MISMATCH, verifier :
              |
              3. Exit stats ?
                 |-- sl_pct > 50%      --> SL_TOO_TIGHT
                 |-- tp_pct < 20%      --> TP_TOO_AMBITIOUS
                 |-- timeout_pct > 40% --> HORIZON_TOO_SHORT

8. Niveaux de severite¶

Niveau	Description	Exemples	Reaction
P0 Critical	Systeme down ou corruption	Pipeline completement bloque, OOMKill en boucle, PVC full	Escalade immediate, platform owner
P1 Urgent	Degradation significative	0 candidats sur run complet, success rate < 70%	Investigation dans l'heure, MLOps maintainer
P2 Warning	Performance degradee	Sortino negatif, SL rate eleve, modele stale > 14j	Investigation dans la journee, ML owner
P3 Info	Optimisation	Run plus lent, pass rate en baisse legere	Backlog, traiter quand disponible

9. Playbooks par incident¶

9.1 EXECUTION_MISMATCH — ML utile mais Sortino negatif¶

Symptome : f1_buy >> baseline, Sortino < 0

Impact : le modele detecte les BUY mais les trades perdent de l'argent

Hypotheses : - SL_TOO_TIGHT : ATR multiplier trop faible - TP_TOO_AMBITIOUS : TP multiplier trop eleve - HORIZON_TOO_SHORT : horizon insuffisant

Verifications : 1. sl_pct, tp_pct, timeout_pct dans CORRELATION DATA 2. Comparer les ATR ranges entre strategies passees et echouees 3. Verifier si le pattern est specifique a une crypto ou generalise

Action : - Si sl_pct > 50% : relancer avec sl_range: "1.2,1.5,1.8,2.0" - Si tp_pct < 20% : reduire tp_range: "1.5,2.0,2.5" - Si timeout_pct > 40% : augmenter horizons

Critere de sortie : Sortino > 0 ou sl_pct < 40%

Escalade : ML owner si 3 tentatives sans amelioration

9.2 ML_USELESS — Modele sans valeur predictive¶

Symptome : f1_buy <= baseline_f1_buy

Impact : le modele ne fait pas mieux que predire "toujours BUY"

Hypotheses : - Dataset trop desequilibre (buy_rate > 50%) - Features non informatives - Objectif HPO mal aligne

Verifications : 1. buy_rate dans CORRELATION DATA 2. action_rate HPO (le modele predit-il assez de BUY ?) 3. Comparer avec d'autres cryptos du meme run

Action : 1. Verifier buy_rate — si > 50%, le probleme est le labeling 2. Tester avec horizon different (change la distribution des labels) 3. Revoir l'objectif HPO (precision_recall_auc adapte ?) 4. En dernier recours : revoir le feature set

Critere de sortie : f1_buy > baseline + 15%

Escalade : ML owner

9.3 NO_CANDIDATES — Aucune strategie viable¶

Symptome : 0 candidats passes sur un run complet

Impact : aucune strategie exploitable pour cette crypto/groupe

Verifications : 1. Gate rejection reasons dans Grafana > Testing & Backtest 2. Si tous rejetes par Sortino : probleme PTE (voir 9.1) 3. Si tous rejetes par n_trades : modele trop conservateur (theta trop haut) 4. Si tous rejetes par f1 : probleme ML (voir 9.2)

Action : - Relancer avec grid elargi (plus d'horizons, ATR ranges plus larges) - Tester une autre crypto du meme groupe - Si recurrent sur tout un groupe : le groupe n'est peut-etre pas viable

Critere de sortie : au moins 1 candidat passe les gates

9.4 INFRA_SATURATED — Pod OOMKilled ou PVC plein¶

Symptome : OOMKill > 0 dans Grafana Infra, ou PVC > 85%

Impact : pipeline instable, runs qui crashent

Verifications : 1. Identifier le pod concerne dans Grafana Infra 2. Verifier si le probleme est recurrent ou ponctuel 3. Verifier la memoire consommee vs limits

Action : - Si OOMKill ponctuel : relancer le run - Si OOMKill recurrent : augmenter memory limits dans Helm values - Si PVC > 85% : nettoyer les anciens artifacts (MLflow / S3)

Critere de sortie : OOMKill = 0 sur 24h, PVC < 70%

Escalade : Platform owner

Dernier recours uniquement :

kubectl delete pod <name> -n cvntrade

A utiliser seulement si le pod est reellement bloque, que le controleur peut le recreer, et que la cause a ete qualifiee.

9.5 Run trop long (> 3h)¶

Symptome : duree run > p95 + 30% ou > 3h absolu

Verifications : 1. Identifier le step bloquant dans Airflow 2. Si HPO : verifier n_trials (50 = normal, > 100 = suspect) 3. Si data fetch : verifier Binance API / S3 connectivity 4. Si backtest : verifier le nombre de candles (60j x 5min = 17K = normal)

Action : selon la cause identifiee

Critere de sortie : duree < p95 historique

10. Pipelines disponibles¶

Exploitation courante (operateur)¶

DAG Airflow	Pipeline	Role
`launch__discovery`	`pte__discovery`	Screen -> Test -> WFRB -> Register challenger
`launch__backtesting`	`pte__backtesting`	Test -> WFRB a partir de candidats existants
`launch__walkforward`	`pte__walkforward`	Validation walk-forward
`launch__retrain`	`pte__retrain`	Retrain + register
`launch__monitoring`	`pte__monitoring`	Health checks

Administration controlee (MLOps maintainer / ML owner)¶

DAG Airflow	Pipeline	Role	Approbation requise
`launch__meta_training`	`pte__meta_training`	Entrainement meta-label	MLOps maintainer
`launch__promotion`	`pte__promotion`	Challenger -> Champion	ML owner + operateur

Urgence (platform owner)¶

DAG Airflow	Role	Impact
`pte__7_killswitch`	Quarantaine d'urgence	Desactive un modele
`pte__8_rollback`	Retour version precedente	Remplace le champion
`pte__10_decommission`	Archivage / nettoyage	Supprime un modele

Regle : ne jamais executer killswitch, rollback ou decommission sans validation explicite.

11. Groupes de cryptos¶

Groupe	Univers indicatif
`btc-core`	BTCUSDC, ETHUSDC
`defi`	SOLUSDC, ADAUSDC, BONKUSDC, XRPUSDC

Source de verite : table cvntrade_universes en base PostgreSQL.

Note : les seuils ML/Trading s'interpretent dans le contexte du groupe et de la crypto. Un pass rate de 3% sur un grid de 450 points pour une crypto volatile est different d'un pass rate de 3% sur un grid de 50 points pour BTC.

12. Acces¶

Service	URL	Profil	Usage	Niveau
Grafana	`grafana.cvntrade.eu`	Operateur	supervision	readonly
Airflow	`airflow.cvntrade.eu`	Operateur / MLOps	execution + logs	trigger + readonly
MLflow	`mlflow.cvntrade.eu`	MLOps / ML owner	modeles	registry admin
ZenML	`zenml.cvntrade.eu`	MLOps / ML owner	lignage	readonly
W&B	`wandb.ai`	ML owner	analyse ML	compte organisationnel
K8s	kubectl	Platform owner	depannage	admin restreint

Regle : ne jamais documenter ou diffuser d'identifiants dans ce document. Tous les acces sont geres par secrets Kubernetes.

13. Regles d'exploitation (policy)¶

Grafana d'abord — les logs Airflow sont du debug, pas de la decision
Airflow pour executer, pas pour conclure
MLflow pour les modeles, W&B pour l'analyse ML
ZenML pour le lignage
Pas de fallback silencieux dans les diagnostics (ADR-25)
Pas d'action destructive sans qualification du probleme
Toute conclusion doit distinguer ML, PTE/backtest et infra
Comparaison intra-crypto uniquement (ADR-27)
0 SELL est normal en mode binaire (ADR-28)
Toute metrique ML doit etre comparee a la baseline naive (ADR-29)

14. Dette d'observabilite¶

Les elements suivants sont aujourd'hui dans les logs Airflow et doivent migrer vers Grafana :

Element	Source actuelle	Cible
`f1_buy vs baseline_f1_buy`	Logs `CORRELATION DATA`	Dashboard Grafana
`screening_f1_buy vs last_fold_f1_buy`	Logs `CORRELATION DATA`	Dashboard Grafana
`tp_pct / sl_pct / timeout_pct`	Logs `CORRELATION DATA`	Dashboard Grafana
Code diagnostic principal	Logs (verdict)	Dashboard Grafana
Etat global CVNTrade	Calcul humain	Dashboard Grafana (automatise)

Prerequis : ecrire les diagnostics dans PostgreSQL (pas seulement dans les logs) pour que Grafana puisse les requeter.

Glossaire¶

Terme	Definition
screening	Phase 1 : preselection rapide des PTE candidates (grid search)
testing	Phase 2 : validation multi-fold HPO + backtest OOS
WFRB	Phase 3 : validation walk-forward rolling backtest
challenger	Modele enregistre mais non promu — en attente d'approbation
champion	Modele actif de reference pour une crypto
stale	Modele non rafraichi depuis N jours
baseline_f1_buy	Score F1 d'un classifier naif "always BUY" — reference minimale
buy_rate	Proportion de labels BUY dans le split test
PTE	Parametres de Trade Execution : SL, TP, horizon
golden signal	Metrique cle de supervision — le noyau de l'observabilite
last_fold_f1_buy	F1 BUY du dernier fold (meme periode que le screening)
CORRELATION DATA	Bloc structure dans les logs test_step — interface stable (ADR-30)

Working with the docs site (docs.cvntrade.eu)¶

The Design System + ADRs + runbooks are published to docs.cvntrade.eu. Source lives in documentation/; build config is mkdocs.yml at the repo root. Phase 2 of #593 (#637 tracks the scaffolding).

Local preview¶

make docs-install      # one-time: installs mkdocs + plugins into .venv_airflow
make docs-serve        # hot-reload at http://127.0.0.1:8000

Edit any .md file in documentation/ — the browser reloads on save.

Local strict build (same checks as CI)¶

make docs-build

--strict fails on broken internal links, missing nav entries, or unknown config. If this passes locally, CI will pass too.

Adding a page¶

Drop the .md file in the right subfolder (needs/, epics/, adr/, …).
Add an entry in mkdocs.yml under nav: so it's discoverable (required unless it's referenced from another page's index).
make docs-build — fix any broken links.
Commit + PR. CI rebuilds on every PR that touches documentation/**.

Deploy¶

main push → .github/workflows/docs.yml → builds strict → GitHub Pages.
First deploy: enable GH Pages in repo settings (Source: GitHub Actions), then set CNAME docs.cvntrade.eu → dococeven.github.io at the registrar.
Subsequent deploys are fully automated. No operator action required.

Architecture diagrams¶

documentation/architecture/workspace.dsl is the single Structurizr DSL source. See documentation/architecture/workspace-reference.md for rendering options (VS Code live preview, structurizr-cli, Structurizr Lite in Docker).

Troubleshooting¶

Symptom	Cause / fix
`mkdocs: command not found`	Run `make docs-install`.
Strict build fails on `broken link`	Fix the `.md` link — relative path from the file's location.
New file isn't in the sidebar	Add it under `nav:` in `mkdocs.yml`.
`README.md` shows as an empty page	Expected — `README.md` at docs_dir root is excluded in favor of `index.md`.

OpenProject operator playbook (#593 Phase 1)¶

URL: https://openproject.cvntrade.eu Role: product source of truth for Needs / Epics / Stories / Releases. Deployed via: infra/helm/openproject/ chart, Helm-managed through the deploy-k8s workflow. Chart setup doc: infra/helm/openproject/README.md §Setup.

Browser → https://openproject.cvntrade.eu/login
Username admin initially; operator account created post-setup
If cert warning appears, cert-manager hasn't yet issued — wait 5 min and refresh

Create a new Need¶

Top nav → project CVNTrade → Work packages
Create → type Need
Subject: short title, e.g. "Reach F1=0.75 binary classification"
Custom fields:
need_id = next available CVN-N<nnn> (strictly sequential)
github_issue_url = link to the parent GitHub issue
Description: follow documentation/templates/TEMPLATE_need.md
Save → OpenProject auto-generates the work package ID (ignore it, use need_id for referencing)

Create an Epic under a Need¶

Open the Need work package
Relations tab → Add relation → "includes" → New work package
Type Epic, set epic_id = <need_id>-E<letter> (A, B, C… per Need)
Save

Close an Epic¶

Open the Epic
Status → Closed
Fill the epic's Closure section (template §7) in the description
The parent Need's % complete updates automatically

Link a PR to a Story¶

The link is in both directions:

OpenProject side: add GitHub PR URL to the Story's Relations tab
GitHub side: PR body must contain CVN-N<nnn>-US<m> (enforced by Phase 4 CI gate once live)

Create a Release¶

Work packages → Create → type Release
release_id = CVN-R<yyyymmdd>-<n>
Add all closed Epics as "part of" relations
Attach backtest report links (URL or file)
When deploy succeeds, set status → Deployed

Backups¶

OpenProject DB carries all operator-entered data. Losses are unrecoverable without a backup.

If running on a dedicated Scaleway PG instance: daily snapshots are configured at the PG level (7-day retention by default).
If reusing the champollion instance: see infra/helm/openproject/README.md §Backup for the pg_dump CronJob option.