Mode Operatoire CVNTrade MLOps¶
Version : 3.0 Derniere mise a jour : 2026-03-28 Perimetre : exploitation de la plateforme MLOps CVNTrade
Statut operationnel : - Grafana (5 dashboards) : operationnel - Airflow (10 DAGs) : operationnel - ZenML (7 pipelines natifs) : operationnel - MLflow (model registry) : operationnel - Prometheus (metriques K8s) : operationnel - Alerting automatique (Slack/SMS) : cible (issue #397) - Drift monitoring : cible (pas encore implemente) - Diagnostic automatique dans Grafana : cible (logs structurés a migrer)
0. Etat global CVNTrade¶
Toute session d'exploitation commence par determiner l'etat global.
| Etat | Definition | Reaction |
|---|---|---|
| GREEN | Infra saine, pipelines sains, au moins 1 run utile recent | exploitation normale |
| YELLOW | Derive controlee — signaux faibles d'anomalie | surveillance renforcee |
| ORANGE | Degradation metier ou pipeline significative | investigation prioritaire |
| RED | Indisponibilite ou incapacite a produire | escalade immediate |
Regles de calcul¶
| Condition | Etat |
|---|---|
| Infra critical (pod down, PVC full, OOMKill en boucle) | RED |
| Pipeline success rate < 70% | ORANGE |
| Pass rate = 0 sur 3 runs consecutifs | ORANGE |
| Modele stale > 14j sur crypto core (btc-core) | ORANGE |
| 0 candidat utile sur univers complet > 3 jours | ORANGE |
| Success rate 70-90% ou Sortino negatif recurrent | YELLOW |
| Modele stale > 7j sur crypto non-core | YELLOW |
| Tout le reste | GREEN |
1. Objet¶
Ce document decrit le mode operatoire pour : - determiner l'etat global du systeme, - superviser les golden signals, - lancer les pipelines standards, - diagnostiquer les resultats, - escalader selon la severite.
Principe directeur : Grafana est le point d'entree unique (ADR-26). Les autres outils servent au drill-down selon la nature du probleme.
2. Golden Signals CVNTrade¶
Ces signaux sont le noyau de la supervision. Toute vue operateur doit en deriver.
Plateforme¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
infra_health |
Saturation / OOM / PVC / disponibilite | OOM=1, CPU>70% | OOM>2, PVC>85%, pod down |
pipeline_success_rate |
Taux de succes glissant (7j) | 70-90% | < 70% |
pipeline_latency |
Duree glissante vs p95 historique | p95 + 30% | p95 + 100% |
ML (par crypto — ADR-27)¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
f1_buy_delta_vs_baseline |
f1_buy - baseline_f1_buy |
delta < 15% | delta <= 0 |
screening_testing_gap |
screening_f1_buy - last_fold_f1_buy |
gap > 0.15 | gap > 0.30 |
std_f1_buy |
Stabilite inter-fold | 0.05-0.10 | > 0.10 |
Trading¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
sortino |
KPI financier principal | 0 a 0.5 | < 0 |
sl_pct |
Part des sorties stop-loss | 35-50% | > 50% |
timeout_pct |
Part des sorties timeout | 30-40% | > 40% |
Lifecycle¶
| Signal | Definition | Seuil warning | Seuil critical |
|---|---|---|---|
model_freshness |
Age du dernier modele utile | > 7j | > 14j |
challenger_count |
Challengers enregistres non promus | > 5 pending | > 10 pending |
Regle : ces signaux doivent etre visibles dans Grafana sans lecture de logs.
Dette d'observabilite : aujourd'hui, les signaux ML et Trading ne sont visibles que dans les logs Airflow (section CORRELATION DATA). Migration vers Grafana = cible prioritaire.
3. Architecture des outils¶
GRAFANA (grafana.cvntrade.eu) <- POINT D'ENTREE UNIQUE
|-- supervision globale (golden signals)
|-- dashboards metier, pipeline, modele, infra
'-- alerting centralise (cible #397)
AIRFLOW (airflow.cvntrade.eu) <- EXECUTION
|-- trigger DAG
|-- suivi des runs
'-- logs detailles (debug uniquement, pas decision)
ZENML (zenml.cvntrade.eu) <- DRILL-DOWN PIPELINE
|-- lignage
|-- historique des runs
'-- artifacts pipeline versionnes
W&B (wandb.ai) <- DRILL-DOWN ML
|-- screening matrices
|-- HPO trials
'-- comparaison detaillee de runs ML
MLFLOW (mlflow.cvntrade.eu) <- DRILL-DOWN MODELES
|-- model registry
|-- metriques par run
'-- artifacts lies aux modeles
PROMETHEUS (interne K8s) <- METRIQUES INFRA
'-- collecte CPU/RAM/pods, exposee dans Grafana
4. Roles et responsabilites¶
| Role | Responsable de | Acces |
|---|---|---|
| Operateur | Morning check, lancement standard, lecture Grafana | Grafana (readonly), Airflow (trigger + logs) |
| MLOps maintainer | Incident pipeline, registry, metriques diagnostiques | Airflow (admin), MLflow (registry admin), ZenML |
| ML owner | Interpretation modele, features, labels, HPO | W&B, MLflow (readonly), ZenML |
| Platform owner | K8s, PVC, nodes, limits, ingress, secrets | kubectl (admin restreint), Helm |
Escalade¶
| Situation | Decideur | Executant | Approbateur |
|---|---|---|---|
| Lancer un discovery | Operateur | Operateur | - |
| Investiguer un echec pipeline | MLOps maintainer | MLOps maintainer | - |
| Modifier HPO / features / labels | ML owner | ML owner | MLOps maintainer |
| Promotion challenger -> champion | ML owner | MLOps maintainer | Operateur (validation metier) |
| Killswitch / rollback | Operateur (urgence) | Platform owner | - |
| Modifier Helm / K8s resources | Platform owner | Platform owner | MLOps maintainer |
5. Dashboards Grafana¶
| Dashboard | Role principal | Usage |
|---|---|---|
| MLOps Overview | Vue executive | Etat global des modeles, screenings, pass rate |
| Testing & Backtest | Vue qualite | Resultats testing, WFRB, gates, rejets |
| Pipeline Health | Vue execution | Duree des runs, taux de succes, HPO, anomalies |
| Model Registry | Vue cycle de vie modele | Versions, fraicheur, modeles actifs / stale |
| Infra Monitoring | Vue plateforme | Pods, CPU, RAM, PVC, OOMKill, saturation |
Bon usage : - commencer par MLOps Overview ou Pipeline Health - aller sur Infra Monitoring si suspicion de probleme plateforme - aller sur Testing & Backtest si la question porte sur la qualite des resultats
6. Procedures standard¶
6.1 Verification matinale¶
Duree cible : 2 a 5 minutes
Etape 1 — Determiner l'etat global
Ouvrir Grafana > Infra Monitoring puis Pipeline Health. Determiner l'etat (GREEN / YELLOW / ORANGE / RED) selon les regles de la section 0. Si RED ou ORANGE → passer directement a la section 9 (incidents).
Etape 2 — Sante plateforme
Grafana > Infra Monitoring. Verifier : - pods critiques operationnels, - absence d'OOMKill recent, - saturation CPU/RAM sous seuils, - PVC sous seuil d'alerte.
Etape 3 — Sante pipelines
Grafana > Pipeline Health. Verifier : - runs de la nuit termines, - taux d'echec dans la norme, - pas de derive de duree, - absence de backlog.
Etape 4 — Resultats metier
Grafana > MLOps Overview. Verifier : - nouveaux candidats qualifies, - pass rate dans la norme, - fraicheur des derniers runs utiles.
6.2 Lancer un discovery¶
Outil : Airflow — DAG : launch__discovery
Run groupe :
Run cible :
Run etendu :
{
"group": "defi",
"horizons": "H1,H2,H3,H4,H5,H6",
"sl_range": "0.8,1.0,1.2,1.5",
"tp_range": "1.5,2.0,2.5,3.0",
"hpo_trials": 50,
"backtest_days": 60
}
Suivi : Airflow (statut DAG run), Grafana (resultats apres completion).
Note : les seuils ML/Trading s'interpretent dans le contexte du run (univers, taille de grille, horizons).
6.3 Analyser les resultats¶
Niveau 1 — Grafana > Testing & Backtest
Chercher : meilleures strategies qualifiees, motifs de rejet, distribution F1/Sortino.
Niveau 2 — Logs Airflow (debug uniquement)
Chercher les blocs structures === CORRELATION DATA ===. Distinguer :
- qualite ML : f1_buy vs baseline_f1_buy, precision_buy, recall_buy, buy_rate
- qualite financiere : Sortino, tp/sl/timeout %, gates
- coherence : screening_f1_buy vs last_fold_f1_buy (meme metrique, meme periode — ADR-28, ADR-29)
Niveau 3 — Drill-down W&B : screening matrix, HPO trials, comparaison de runs.
6.4 Verifier un modele specifique¶
- Grafana > Model Registry — identifier crypto / version
- MLflow — run source, metriques, artifacts, hyperparams
- ZenML — pipeline source, artifacts amont, lignage
6.5 FTF — Fine-Tuning Framework Operations¶
6.5.1 Launching FTF runs¶
-
Verify no stale runs:
If stale runs exist → kill them first (6.5.4). -
Trigger from Airflow UI or CLI:
-
Monitor first 2 minutes — verify sample count:
- Expected:
Train: ~1000-2000 samples(CUSUM enabled) - If
Train: >10000 samples: STOP — cache is stale or CUSUM misconfigured. See 6.5.3.
6.5.2 After Helm deploy with FTF config changes¶
MANDATORY — pods keep old config until killed.
-
Verify new code deployed:
-
Kill ALL running FTF pods:
-
Wait 30 seconds for termination.
-
Trigger new runs (6.5.1).
6.5.3 Cache flush (MLflow feature store)¶
When: After CUSUM config change, after feature engineering change, or when sample counts are wrong.
What gets flushed: Feature selection (Level 4) and feature engineering caches. ETL, labels, and models are NOT affected.
Procedure:
-
Kill all FTF pods first:
-
Flush feature selection + feature engineering cache:
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}') kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c " import os, sys sys.path.insert(0, '/opt/airflow/src') import mlflow mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000')) client = mlflow.tracking.MlflowClient() runs = client.search_runs(experiment_ids=['3'], max_results=500) deleted = 0 for run in runs: name = run.data.tags.get('mlflow.runName', '') if 'feature_selection' in name or 'feature_eng' in name.lower(): client.delete_run(run.info.run_id) deleted += 1 print(f'Deleted {deleted} cache entries') " -
Verify cache is empty:
kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c " import os, sys sys.path.insert(0, '/opt/airflow/src') import mlflow mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000')) client = mlflow.tracking.MlflowClient() runs = client.search_runs(experiment_ids=['3'], max_results=500) fs = [r for r in runs if 'feature_selection' in r.data.tags.get('mlflow.runName', '')] print(f'Remaining feature_selection entries: {len(fs)} (should be 0)') " -
Trigger new runs (6.5.1). Cache will regenerate automatically (~2 min extra on first run per crypto).
WARNING: Do NOT flush experiment 1 (ETL), 4 (Labels), 7 (HPO), or 8 (Models). These are independent of CUSUM config.
6.5.4 Killing stale FTF runs¶
# Kill all pods
kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
# Mark failed runs in Airflow (optional — they'll stay as 'failed' automatically)
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain
6.5.5 Monitoring FTF runs¶
# Pod status
kubectl get pods -n cvntrade --no-headers | grep finetune
# Progress per pod
for pod in $(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | awk '{print $1}'); do
count=$(kubectl logs -n cvntrade $pod --tail=5000 | grep "event=weighted_variant_evaluated" | wc -l)
crypto=$(kubectl logs -n cvntrade $pod | grep "Running factor=" | head -1 | grep -o "crypto=[A-Z]*")
echo "$pod: $crypto completed=$count"
done
# CPU/memory usage
kubectl top pods -n cvntrade --no-headers | grep finetune
Grafana: Infrastructure Monitoring dashboard shows FTF pods, CPU/memory, throttling. Grafana: Fine-Tuning Results dashboard shows results as they arrive in PostgreSQL.
7. Taxonomie de diagnostic CVNTrade¶
Chaque run de testing/backtest doit produire un diagnostic principal.
| Code | Signification | Signal declencheur |
|---|---|---|
ML_USELESS |
Modele sans valeur predictive | f1_buy <= baseline_f1_buy (delta <= 0) |
ML_MARGINAL |
Gain ML insuffisant pour survivre a l'execution | delta +0 a +5 pts vs baseline |
ML_EXPLOITABLE |
Signal exploitable sous reserve d'enveloppe favorable | delta +5 a +10 pts vs baseline |
ML_SOLID |
Signal solide | delta > +10 pts vs baseline |
ML_UNSTABLE |
Modele instable entre periodes | std_f1_buy > 0.10 |
EXECUTION_MISMATCH |
Enveloppe defavorable — ML exploitable mais PnL negatif | f1_buy > baseline +5pts et Sortino < 0 |
SL_TOO_TIGHT |
Suspicion SL trop serre (hypothese dominante, pas prouvee) | sl_pct > 50% |
TP_TOO_AMBITIOUS |
Take-profit trop ambitieux | tp_pct < 20% |
HORIZON_TOO_SHORT |
Horizon trop court | timeout_pct > 40% |
SCREENING_OVERFIT |
Divergence screening/testing | screening_f1_buy - last_fold_f1_buy > 0.30 |
PIPELINE_DEGRADED |
Sante pipeline degradee | success_rate < 70% |
INFRA_SATURATED |
Saturation K8s / PVC / OOM | OOMKill > 2, PVC > 85%, CPU > 90% |
NO_CANDIDATES |
Aucune strategie viable | 0 candidats passes sur un run complet |
Arbre de decision (30 secondes)¶
1. f1_buy vs baseline_f1_buy ?
|
|-- delta <= 0 pts --> ML_USELESS
|-- delta +0 a +5 pts --> ML_MARGINAL
|-- delta +5 a +10 pts --> ML_EXPLOITABLE, verifier execution :
|-- delta > +10 pts --> ML_SOLID, verifier execution :
|
2. Sortino ?
|
|-- Sortino > 0.5 --> strategie OK
|-- Sortino < 0 --> EXECUTION_MISMATCH, verifier :
|
3. Exit stats ?
|-- sl_pct > 50% --> SL_TOO_TIGHT
|-- tp_pct < 20% --> TP_TOO_AMBITIOUS
|-- timeout_pct > 40% --> HORIZON_TOO_SHORT
8. Niveaux de severite¶
| Niveau | Description | Exemples | Reaction |
|---|---|---|---|
| P0 Critical | Systeme down ou corruption | Pipeline completement bloque, OOMKill en boucle, PVC full | Escalade immediate, platform owner |
| P1 Urgent | Degradation significative | 0 candidats sur run complet, success rate < 70% | Investigation dans l'heure, MLOps maintainer |
| P2 Warning | Performance degradee | Sortino negatif, SL rate eleve, modele stale > 14j | Investigation dans la journee, ML owner |
| P3 Info | Optimisation | Run plus lent, pass rate en baisse legere | Backlog, traiter quand disponible |
9. Playbooks par incident¶
9.1 EXECUTION_MISMATCH — ML utile mais Sortino negatif¶
Symptome : f1_buy >> baseline, Sortino < 0
Impact : le modele detecte les BUY mais les trades perdent de l'argent
Hypotheses :
- SL_TOO_TIGHT : ATR multiplier trop faible
- TP_TOO_AMBITIOUS : TP multiplier trop eleve
- HORIZON_TOO_SHORT : horizon insuffisant
Verifications : 1. sl_pct, tp_pct, timeout_pct dans CORRELATION DATA 2. Comparer les ATR ranges entre strategies passees et echouees 3. Verifier si le pattern est specifique a une crypto ou generalise
Action :
- Si sl_pct > 50% : relancer avec sl_range: "1.2,1.5,1.8,2.0"
- Si tp_pct < 20% : reduire tp_range: "1.5,2.0,2.5"
- Si timeout_pct > 40% : augmenter horizons
Critere de sortie : Sortino > 0 ou sl_pct < 40%
Escalade : ML owner si 3 tentatives sans amelioration
9.2 ML_USELESS — Modele sans valeur predictive¶
Symptome : f1_buy <= baseline_f1_buy
Impact : le modele ne fait pas mieux que predire "toujours BUY"
Hypotheses : - Dataset trop desequilibre (buy_rate > 50%) - Features non informatives - Objectif HPO mal aligne
Verifications : 1. buy_rate dans CORRELATION DATA 2. action_rate HPO (le modele predit-il assez de BUY ?) 3. Comparer avec d'autres cryptos du meme run
Action :
1. Verifier buy_rate — si > 50%, le probleme est le labeling
2. Tester avec horizon different (change la distribution des labels)
3. Revoir l'objectif HPO (precision_recall_auc adapte ?)
4. En dernier recours : revoir le feature set
Critere de sortie : f1_buy > baseline + 15%
Escalade : ML owner
9.3 NO_CANDIDATES — Aucune strategie viable¶
Symptome : 0 candidats passes sur un run complet
Impact : aucune strategie exploitable pour cette crypto/groupe
Verifications : 1. Gate rejection reasons dans Grafana > Testing & Backtest 2. Si tous rejetes par Sortino : probleme PTE (voir 9.1) 3. Si tous rejetes par n_trades : modele trop conservateur (theta trop haut) 4. Si tous rejetes par f1 : probleme ML (voir 9.2)
Action : - Relancer avec grid elargi (plus d'horizons, ATR ranges plus larges) - Tester une autre crypto du meme groupe - Si recurrent sur tout un groupe : le groupe n'est peut-etre pas viable
Critere de sortie : au moins 1 candidat passe les gates
9.4 INFRA_SATURATED — Pod OOMKilled ou PVC plein¶
Symptome : OOMKill > 0 dans Grafana Infra, ou PVC > 85%
Impact : pipeline instable, runs qui crashent
Verifications : 1. Identifier le pod concerne dans Grafana Infra 2. Verifier si le probleme est recurrent ou ponctuel 3. Verifier la memoire consommee vs limits
Action : - Si OOMKill ponctuel : relancer le run - Si OOMKill recurrent : augmenter memory limits dans Helm values - Si PVC > 85% : nettoyer les anciens artifacts (MLflow / S3)
Critere de sortie : OOMKill = 0 sur 24h, PVC < 70%
Escalade : Platform owner
Dernier recours uniquement :
A utiliser seulement si le pod est reellement bloque, que le controleur peut le recreer, et que la cause a ete qualifiee.9.5 Run trop long (> 3h)¶
Symptome : duree run > p95 + 30% ou > 3h absolu
Verifications : 1. Identifier le step bloquant dans Airflow 2. Si HPO : verifier n_trials (50 = normal, > 100 = suspect) 3. Si data fetch : verifier Binance API / S3 connectivity 4. Si backtest : verifier le nombre de candles (60j x 5min = 17K = normal)
Action : selon la cause identifiee
Critere de sortie : duree < p95 historique
10. Pipelines disponibles¶
Exploitation courante (operateur)¶
| DAG Airflow | Pipeline | Role |
|---|---|---|
launch__discovery |
pte__discovery |
Screen -> Test -> WFRB -> Register challenger |
launch__backtesting |
pte__backtesting |
Test -> WFRB a partir de candidats existants |
launch__walkforward |
pte__walkforward |
Validation walk-forward |
launch__retrain |
pte__retrain |
Retrain + register |
launch__monitoring |
pte__monitoring |
Health checks |
Administration controlee (MLOps maintainer / ML owner)¶
| DAG Airflow | Pipeline | Role | Approbation requise |
|---|---|---|---|
launch__meta_training |
pte__meta_training |
Entrainement meta-label | MLOps maintainer |
launch__promotion |
pte__promotion |
Challenger -> Champion | ML owner + operateur |
Urgence (platform owner)¶
| DAG Airflow | Role | Impact |
|---|---|---|
pte__7_killswitch |
Quarantaine d'urgence | Desactive un modele |
pte__8_rollback |
Retour version precedente | Remplace le champion |
pte__10_decommission |
Archivage / nettoyage | Supprime un modele |
Regle : ne jamais executer killswitch, rollback ou decommission sans validation explicite.
11. Groupes de cryptos¶
| Groupe | Univers indicatif |
|---|---|
btc-core |
BTCUSDC, ETHUSDC |
defi |
SOLUSDC, ADAUSDC, BONKUSDC, XRPUSDC |
Source de verite : table cvntrade_universes en base PostgreSQL.
Note : les seuils ML/Trading s'interpretent dans le contexte du groupe et de la crypto. Un pass rate de 3% sur un grid de 450 points pour une crypto volatile est different d'un pass rate de 3% sur un grid de 50 points pour BTC.
12. Acces¶
| Service | URL | Profil | Usage | Niveau |
|---|---|---|---|---|
| Grafana | grafana.cvntrade.eu |
Operateur | supervision | readonly |
| Airflow | airflow.cvntrade.eu |
Operateur / MLOps | execution + logs | trigger + readonly |
| MLflow | mlflow.cvntrade.eu |
MLOps / ML owner | modeles | registry admin |
| ZenML | zenml.cvntrade.eu |
MLOps / ML owner | lignage | readonly |
| W&B | wandb.ai |
ML owner | analyse ML | compte organisationnel |
| K8s | kubectl | Platform owner | depannage | admin restreint |
Regle : ne jamais documenter ou diffuser d'identifiants dans ce document. Tous les acces sont geres par secrets Kubernetes.
13. Regles d'exploitation (policy)¶
- Grafana d'abord — les logs Airflow sont du debug, pas de la decision
- Airflow pour executer, pas pour conclure
- MLflow pour les modeles, W&B pour l'analyse ML
- ZenML pour le lignage
- Pas de fallback silencieux dans les diagnostics (ADR-25)
- Pas d'action destructive sans qualification du probleme
- Toute conclusion doit distinguer ML, PTE/backtest et infra
- Comparaison intra-crypto uniquement (ADR-27)
- 0 SELL est normal en mode binaire (ADR-28)
- Toute metrique ML doit etre comparee a la baseline naive (ADR-29)
14. Dette d'observabilite¶
Les elements suivants sont aujourd'hui dans les logs Airflow et doivent migrer vers Grafana :
| Element | Source actuelle | Cible |
|---|---|---|
f1_buy vs baseline_f1_buy |
Logs CORRELATION DATA |
Dashboard Grafana |
screening_f1_buy vs last_fold_f1_buy |
Logs CORRELATION DATA |
Dashboard Grafana |
tp_pct / sl_pct / timeout_pct |
Logs CORRELATION DATA |
Dashboard Grafana |
| Code diagnostic principal | Logs (verdict) | Dashboard Grafana |
| Etat global CVNTrade | Calcul humain | Dashboard Grafana (automatise) |
Prerequis : ecrire les diagnostics dans PostgreSQL (pas seulement dans les logs) pour que Grafana puisse les requeter.
Glossaire¶
| Terme | Definition |
|---|---|
| screening | Phase 1 : preselection rapide des PTE candidates (grid search) |
| testing | Phase 2 : validation multi-fold HPO + backtest OOS |
| WFRB | Phase 3 : validation walk-forward rolling backtest |
| challenger | Modele enregistre mais non promu — en attente d'approbation |
| champion | Modele actif de reference pour une crypto |
| stale | Modele non rafraichi depuis N jours |
| baseline_f1_buy | Score F1 d'un classifier naif "always BUY" — reference minimale |
| buy_rate | Proportion de labels BUY dans le split test |
| PTE | Parametres de Trade Execution : SL, TP, horizon |
| golden signal | Metrique cle de supervision — le noyau de l'observabilite |
| last_fold_f1_buy | F1 BUY du dernier fold (meme periode que le screening) |
| CORRELATION DATA | Bloc structure dans les logs test_step — interface stable (ADR-30) |
Working with the docs site (docs.cvntrade.eu)¶
The Design System + ADRs + runbooks are published to
docs.cvntrade.eu. Source lives in documentation/;
build config is mkdocs.yml at the repo root. Phase 2 of #593
(#637 tracks the scaffolding).
Local preview¶
make docs-install # one-time: installs mkdocs + plugins into .venv_airflow
make docs-serve # hot-reload at http://127.0.0.1:8000
Edit any .md file in documentation/ — the browser reloads on save.
Local strict build (same checks as CI)¶
--strict fails on broken internal links, missing nav entries, or unknown config.
If this passes locally, CI will pass too.
Adding a page¶
- Drop the
.mdfile in the right subfolder (needs/,epics/,adr/, …). - Add an entry in
mkdocs.ymlundernav:so it's discoverable (required unless it's referenced from another page's index). make docs-build— fix any broken links.- Commit + PR. CI rebuilds on every PR that touches
documentation/**.
Deploy¶
mainpush →.github/workflows/docs.yml→ builds strict → GitHub Pages.- First deploy: enable GH Pages in repo settings (Source: GitHub Actions),
then set CNAME
docs.cvntrade.eu→dococeven.github.ioat the registrar. - Subsequent deploys are fully automated. No operator action required.
Architecture diagrams¶
documentation/architecture/workspace.dsl is the single Structurizr DSL source.
See documentation/architecture/workspace-reference.md for rendering options
(VS Code live preview, structurizr-cli, Structurizr Lite in Docker).
Troubleshooting¶
| Symptom | Cause / fix |
|---|---|
mkdocs: command not found |
Run make docs-install. |
Strict build fails on broken link |
Fix the .md link — relative path from the file's location. |
| New file isn't in the sidebar | Add it under nav: in mkdocs.yml. |
README.md shows as an empty page |
Expected — README.md at docs_dir root is excluded in favor of index.md. |
OpenProject operator playbook (#593 Phase 1)¶
URL: https://openproject.cvntrade.eu
Role: product source of truth for Needs / Epics / Stories / Releases.
Deployed via: infra/helm/openproject/ chart, Helm-managed through the deploy-k8s workflow.
Chart setup doc: infra/helm/openproject/README.md §Setup.
Login¶
- Browser →
https://openproject.cvntrade.eu/login - Username
admininitially; operator account created post-setup - If cert warning appears, cert-manager hasn't yet issued — wait 5 min and refresh
Create a new Need¶
- Top nav → project
CVNTrade→ Work packages - Create → type
Need - Subject: short title, e.g. "Reach F1=0.75 binary classification"
- Custom fields:
need_id= next availableCVN-N<nnn>(strictly sequential)github_issue_url= link to the parent GitHub issue- Description: follow
documentation/templates/TEMPLATE_need.md - Save → OpenProject auto-generates the work package ID (ignore it, use
need_idfor referencing)
Create an Epic under a Need¶
- Open the Need work package
- Relations tab → Add relation → "includes" → New work package
- Type
Epic, setepic_id = <need_id>-E<letter>(A, B, C… per Need) - Save
Close an Epic¶
- Open the Epic
- Status →
Closed - Fill the epic's Closure section (template §7) in the description
- The parent Need's % complete updates automatically
Link a PR to a Story¶
The link is in both directions:
- OpenProject side: add GitHub PR URL to the Story's
Relationstab - GitHub side: PR body must contain
CVN-N<nnn>-US<m>(enforced by Phase 4 CI gate once live)
Create a Release¶
- Work packages → Create → type
Release release_id = CVN-R<yyyymmdd>-<n>- Add all closed Epics as "part of" relations
- Attach backtest report links (URL or file)
- When deploy succeeds, set status →
Deployed
Backups¶
OpenProject DB carries all operator-entered data. Losses are unrecoverable without a backup.
- If running on a dedicated Scaleway PG instance: daily snapshots are configured at the PG level (7-day retention by default).
- If reusing the
champollioninstance: seeinfra/helm/openproject/README.md§Backup for thepg_dumpCronJob option.