Skip to content

Mode Operatoire CVNTrade MLOps

Version : 3.0 Derniere mise a jour : 2026-03-28 Perimetre : exploitation de la plateforme MLOps CVNTrade

Statut operationnel : - Grafana (5 dashboards) : operationnel - Airflow (10 DAGs) : operationnel - ZenML (7 pipelines natifs) : operationnel - MLflow (model registry) : operationnel - Prometheus (metriques K8s) : operationnel - Alerting automatique (Slack/SMS) : cible (issue #397) - Drift monitoring : cible (pas encore implemente) - Diagnostic automatique dans Grafana : cible (logs structurés a migrer)


0. Etat global CVNTrade

Toute session d'exploitation commence par determiner l'etat global.

Etat Definition Reaction
GREEN Infra saine, pipelines sains, au moins 1 run utile recent exploitation normale
YELLOW Derive controlee — signaux faibles d'anomalie surveillance renforcee
ORANGE Degradation metier ou pipeline significative investigation prioritaire
RED Indisponibilite ou incapacite a produire escalade immediate

Regles de calcul

Condition Etat
Infra critical (pod down, PVC full, OOMKill en boucle) RED
Pipeline success rate < 70% ORANGE
Pass rate = 0 sur 3 runs consecutifs ORANGE
Modele stale > 14j sur crypto core (btc-core) ORANGE
0 candidat utile sur univers complet > 3 jours ORANGE
Success rate 70-90% ou Sortino negatif recurrent YELLOW
Modele stale > 7j sur crypto non-core YELLOW
Tout le reste GREEN

1. Objet

Ce document decrit le mode operatoire pour : - determiner l'etat global du systeme, - superviser les golden signals, - lancer les pipelines standards, - diagnostiquer les resultats, - escalader selon la severite.

Principe directeur : Grafana est le point d'entree unique (ADR-26). Les autres outils servent au drill-down selon la nature du probleme.


2. Golden Signals CVNTrade

Ces signaux sont le noyau de la supervision. Toute vue operateur doit en deriver.

Plateforme

Signal Definition Seuil warning Seuil critical
infra_health Saturation / OOM / PVC / disponibilite OOM=1, CPU>70% OOM>2, PVC>85%, pod down
pipeline_success_rate Taux de succes glissant (7j) 70-90% < 70%
pipeline_latency Duree glissante vs p95 historique p95 + 30% p95 + 100%

ML (par crypto — ADR-27)

Signal Definition Seuil warning Seuil critical
f1_buy_delta_vs_baseline f1_buy - baseline_f1_buy delta < 15% delta <= 0
screening_testing_gap screening_f1_buy - last_fold_f1_buy gap > 0.15 gap > 0.30
std_f1_buy Stabilite inter-fold 0.05-0.10 > 0.10

Trading

Signal Definition Seuil warning Seuil critical
sortino KPI financier principal 0 a 0.5 < 0
sl_pct Part des sorties stop-loss 35-50% > 50%
timeout_pct Part des sorties timeout 30-40% > 40%

Lifecycle

Signal Definition Seuil warning Seuil critical
model_freshness Age du dernier modele utile > 7j > 14j
challenger_count Challengers enregistres non promus > 5 pending > 10 pending

Regle : ces signaux doivent etre visibles dans Grafana sans lecture de logs.

Dette d'observabilite : aujourd'hui, les signaux ML et Trading ne sont visibles que dans les logs Airflow (section CORRELATION DATA). Migration vers Grafana = cible prioritaire.


3. Architecture des outils

GRAFANA (grafana.cvntrade.eu)          <- POINT D'ENTREE UNIQUE
  |-- supervision globale (golden signals)
  |-- dashboards metier, pipeline, modele, infra
  '-- alerting centralise (cible #397)

AIRFLOW (airflow.cvntrade.eu)          <- EXECUTION
  |-- trigger DAG
  |-- suivi des runs
  '-- logs detailles (debug uniquement, pas decision)

ZENML (zenml.cvntrade.eu)              <- DRILL-DOWN PIPELINE
  |-- lignage
  |-- historique des runs
  '-- artifacts pipeline versionnes

W&B (wandb.ai)                         <- DRILL-DOWN ML
  |-- screening matrices
  |-- HPO trials
  '-- comparaison detaillee de runs ML

MLFLOW (mlflow.cvntrade.eu)            <- DRILL-DOWN MODELES
  |-- model registry
  |-- metriques par run
  '-- artifacts lies aux modeles

PROMETHEUS (interne K8s)               <- METRIQUES INFRA
  '-- collecte CPU/RAM/pods, exposee dans Grafana

4. Roles et responsabilites

Role Responsable de Acces
Operateur Morning check, lancement standard, lecture Grafana Grafana (readonly), Airflow (trigger + logs)
MLOps maintainer Incident pipeline, registry, metriques diagnostiques Airflow (admin), MLflow (registry admin), ZenML
ML owner Interpretation modele, features, labels, HPO W&B, MLflow (readonly), ZenML
Platform owner K8s, PVC, nodes, limits, ingress, secrets kubectl (admin restreint), Helm

Escalade

Situation Decideur Executant Approbateur
Lancer un discovery Operateur Operateur -
Investiguer un echec pipeline MLOps maintainer MLOps maintainer -
Modifier HPO / features / labels ML owner ML owner MLOps maintainer
Promotion challenger -> champion ML owner MLOps maintainer Operateur (validation metier)
Killswitch / rollback Operateur (urgence) Platform owner -
Modifier Helm / K8s resources Platform owner Platform owner MLOps maintainer

5. Dashboards Grafana

Dashboard Role principal Usage
MLOps Overview Vue executive Etat global des modeles, screenings, pass rate
Testing & Backtest Vue qualite Resultats testing, WFRB, gates, rejets
Pipeline Health Vue execution Duree des runs, taux de succes, HPO, anomalies
Model Registry Vue cycle de vie modele Versions, fraicheur, modeles actifs / stale
Infra Monitoring Vue plateforme Pods, CPU, RAM, PVC, OOMKill, saturation

Bon usage : - commencer par MLOps Overview ou Pipeline Health - aller sur Infra Monitoring si suspicion de probleme plateforme - aller sur Testing & Backtest si la question porte sur la qualite des resultats


6. Procedures standard

6.1 Verification matinale

Duree cible : 2 a 5 minutes

Etape 1 — Determiner l'etat global

Ouvrir Grafana > Infra Monitoring puis Pipeline Health. Determiner l'etat (GREEN / YELLOW / ORANGE / RED) selon les regles de la section 0. Si RED ou ORANGE → passer directement a la section 9 (incidents).

Etape 2 — Sante plateforme

Grafana > Infra Monitoring. Verifier : - pods critiques operationnels, - absence d'OOMKill recent, - saturation CPU/RAM sous seuils, - PVC sous seuil d'alerte.

Etape 3 — Sante pipelines

Grafana > Pipeline Health. Verifier : - runs de la nuit termines, - taux d'echec dans la norme, - pas de derive de duree, - absence de backlog.

Etape 4 — Resultats metier

Grafana > MLOps Overview. Verifier : - nouveaux candidats qualifies, - pass rate dans la norme, - fraicheur des derniers runs utiles.

6.2 Lancer un discovery

Outil : Airflow — DAG : launch__discovery

Run groupe :

{"group": "defi"}

Run cible :

{"group": "defi", "crypto": "SOLUSDC"}

Run etendu :

{
  "group": "defi",
  "horizons": "H1,H2,H3,H4,H5,H6",
  "sl_range": "0.8,1.0,1.2,1.5",
  "tp_range": "1.5,2.0,2.5,3.0",
  "hpo_trials": 50,
  "backtest_days": 60
}

Suivi : Airflow (statut DAG run), Grafana (resultats apres completion).

Note : les seuils ML/Trading s'interpretent dans le contexte du run (univers, taille de grille, horizons).

6.3 Analyser les resultats

Niveau 1 — Grafana > Testing & Backtest

Chercher : meilleures strategies qualifiees, motifs de rejet, distribution F1/Sortino.

Niveau 2 — Logs Airflow (debug uniquement)

Chercher les blocs structures === CORRELATION DATA ===. Distinguer : - qualite ML : f1_buy vs baseline_f1_buy, precision_buy, recall_buy, buy_rate - qualite financiere : Sortino, tp/sl/timeout %, gates - coherence : screening_f1_buy vs last_fold_f1_buy (meme metrique, meme periode — ADR-28, ADR-29)

Niveau 3 — Drill-down W&B : screening matrix, HPO trials, comparaison de runs.

6.4 Verifier un modele specifique

  1. Grafana > Model Registry — identifier crypto / version
  2. MLflow — run source, metriques, artifacts, hyperparams
  3. ZenML — pipeline source, artifacts amont, lignage

6.5 FTF — Fine-Tuning Framework Operations

6.5.1 Launching FTF runs

  1. Verify no stale runs:

    SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
    kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain
    
    If stale runs exist → kill them first (6.5.4).

  2. Trigger from Airflow UI or CLI:

    kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags trigger finetune__pte \
      --conf '{"pte":"ATR1.5_3.0_H5","factor":"calibration","crypto_group":"defi_top5","n_folds":5,"n_trials":30,"history_months":24,"phase":"manual"}'
    

  3. Monitor first 2 minutes — verify sample count:

    POD=$(kubectl get pods -n cvntrade --no-headers | grep "finetune-pte-run-factor-crypto.*Running" | head -1 | awk '{print $1}')
    kubectl logs -n cvntrade "$POD" --tail=20 | grep "Train:.*samples"
    

  4. Expected: Train: ~1000-2000 samples (CUSUM enabled)
  5. If Train: >10000 samples: STOP — cache is stale or CUSUM misconfigured. See 6.5.3.

6.5.2 After Helm deploy with FTF config changes

MANDATORY — pods keep old config until killed.

  1. Verify new code deployed:

    kubectl exec -n cvntrade $SCHED -c scheduler -- grep "<KEY_CHANGE>" /opt/airflow/src/...
    

  2. Kill ALL running FTF pods:

    kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
    

  3. Wait 30 seconds for termination.

  4. Trigger new runs (6.5.1).

6.5.3 Cache flush (MLflow feature store)

When: After CUSUM config change, after feature engineering change, or when sample counts are wrong.

What gets flushed: Feature selection (Level 4) and feature engineering caches. ETL, labels, and models are NOT affected.

Procedure:

  1. Kill all FTF pods first:

    kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force
    

  2. Flush feature selection + feature engineering cache:

    SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
    kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c "
    import os, sys
    sys.path.insert(0, '/opt/airflow/src')
    import mlflow
    mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000'))
    client = mlflow.tracking.MlflowClient()
    runs = client.search_runs(experiment_ids=['3'], max_results=500)
    deleted = 0
    for run in runs:
        name = run.data.tags.get('mlflow.runName', '')
        if 'feature_selection' in name or 'feature_eng' in name.lower():
            client.delete_run(run.info.run_id)
            deleted += 1
    print(f'Deleted {deleted} cache entries')
    "
    

  3. Verify cache is empty:

    kubectl exec -n cvntrade $SCHED -c scheduler -- python3 -c "
    import os, sys
    sys.path.insert(0, '/opt/airflow/src')
    import mlflow
    mlflow.set_tracking_uri(os.environ.get('MLFLOW_TRACKING_URI', 'http://mlflow:5000'))
    client = mlflow.tracking.MlflowClient()
    runs = client.search_runs(experiment_ids=['3'], max_results=500)
    fs = [r for r in runs if 'feature_selection' in r.data.tags.get('mlflow.runName', '')]
    print(f'Remaining feature_selection entries: {len(fs)} (should be 0)')
    "
    

  4. Trigger new runs (6.5.1). Cache will regenerate automatically (~2 min extra on first run per crypto).

WARNING: Do NOT flush experiment 1 (ETL), 4 (Labels), 7 (HPO), or 8 (Models). These are independent of CUSUM config.

6.5.4 Killing stale FTF runs

# Kill all pods
kubectl delete pods -n cvntrade -l dag_id=finetune__pte --force

# Mark failed runs in Airflow (optional — they'll stay as 'failed' automatically)
SCHED=$(kubectl get pods -n cvntrade --no-headers | grep scheduler | grep Running | awk '{print $1}')
kubectl exec -n cvntrade $SCHED -c scheduler -- airflow dags list-runs -d finetune__pte --state running --state queued -o plain

6.5.5 Monitoring FTF runs

# Pod status
kubectl get pods -n cvntrade --no-headers | grep finetune

# Progress per pod
for pod in $(kubectl get pods -n cvntrade --no-headers | grep "finetune.*Running" | awk '{print $1}'); do
  count=$(kubectl logs -n cvntrade $pod --tail=5000 | grep "event=weighted_variant_evaluated" | wc -l)
  crypto=$(kubectl logs -n cvntrade $pod | grep "Running factor=" | head -1 | grep -o "crypto=[A-Z]*")
  echo "$pod: $crypto completed=$count"
done

# CPU/memory usage
kubectl top pods -n cvntrade --no-headers | grep finetune

Grafana: Infrastructure Monitoring dashboard shows FTF pods, CPU/memory, throttling. Grafana: Fine-Tuning Results dashboard shows results as they arrive in PostgreSQL.


7. Taxonomie de diagnostic CVNTrade

Chaque run de testing/backtest doit produire un diagnostic principal.

Code Signification Signal declencheur
ML_USELESS Modele sans valeur predictive f1_buy <= baseline_f1_buy (delta <= 0)
ML_MARGINAL Gain ML insuffisant pour survivre a l'execution delta +0 a +5 pts vs baseline
ML_EXPLOITABLE Signal exploitable sous reserve d'enveloppe favorable delta +5 a +10 pts vs baseline
ML_SOLID Signal solide delta > +10 pts vs baseline
ML_UNSTABLE Modele instable entre periodes std_f1_buy > 0.10
EXECUTION_MISMATCH Enveloppe defavorable — ML exploitable mais PnL negatif f1_buy > baseline +5pts et Sortino < 0
SL_TOO_TIGHT Suspicion SL trop serre (hypothese dominante, pas prouvee) sl_pct > 50%
TP_TOO_AMBITIOUS Take-profit trop ambitieux tp_pct < 20%
HORIZON_TOO_SHORT Horizon trop court timeout_pct > 40%
SCREENING_OVERFIT Divergence screening/testing screening_f1_buy - last_fold_f1_buy > 0.30
PIPELINE_DEGRADED Sante pipeline degradee success_rate < 70%
INFRA_SATURATED Saturation K8s / PVC / OOM OOMKill > 2, PVC > 85%, CPU > 90%
NO_CANDIDATES Aucune strategie viable 0 candidats passes sur un run complet

Arbre de decision (30 secondes)

1. f1_buy vs baseline_f1_buy ?
   |
   |-- delta <= 0 pts           --> ML_USELESS
   |-- delta +0 a +5 pts        --> ML_MARGINAL
   |-- delta +5 a +10 pts       --> ML_EXPLOITABLE, verifier execution :
   |-- delta > +10 pts          --> ML_SOLID, verifier execution :
       |
       2. Sortino ?
          |
          |-- Sortino > 0.5     --> strategie OK
          |-- Sortino < 0       --> EXECUTION_MISMATCH, verifier :
              |
              3. Exit stats ?
                 |-- sl_pct > 50%      --> SL_TOO_TIGHT
                 |-- tp_pct < 20%      --> TP_TOO_AMBITIOUS
                 |-- timeout_pct > 40% --> HORIZON_TOO_SHORT

8. Niveaux de severite

Niveau Description Exemples Reaction
P0 Critical Systeme down ou corruption Pipeline completement bloque, OOMKill en boucle, PVC full Escalade immediate, platform owner
P1 Urgent Degradation significative 0 candidats sur run complet, success rate < 70% Investigation dans l'heure, MLOps maintainer
P2 Warning Performance degradee Sortino negatif, SL rate eleve, modele stale > 14j Investigation dans la journee, ML owner
P3 Info Optimisation Run plus lent, pass rate en baisse legere Backlog, traiter quand disponible

9. Playbooks par incident

9.1 EXECUTION_MISMATCH — ML utile mais Sortino negatif

Symptome : f1_buy >> baseline, Sortino < 0

Impact : le modele detecte les BUY mais les trades perdent de l'argent

Hypotheses : - SL_TOO_TIGHT : ATR multiplier trop faible - TP_TOO_AMBITIOUS : TP multiplier trop eleve - HORIZON_TOO_SHORT : horizon insuffisant

Verifications : 1. sl_pct, tp_pct, timeout_pct dans CORRELATION DATA 2. Comparer les ATR ranges entre strategies passees et echouees 3. Verifier si le pattern est specifique a une crypto ou generalise

Action : - Si sl_pct > 50% : relancer avec sl_range: "1.2,1.5,1.8,2.0" - Si tp_pct < 20% : reduire tp_range: "1.5,2.0,2.5" - Si timeout_pct > 40% : augmenter horizons

Critere de sortie : Sortino > 0 ou sl_pct < 40%

Escalade : ML owner si 3 tentatives sans amelioration


9.2 ML_USELESS — Modele sans valeur predictive

Symptome : f1_buy <= baseline_f1_buy

Impact : le modele ne fait pas mieux que predire "toujours BUY"

Hypotheses : - Dataset trop desequilibre (buy_rate > 50%) - Features non informatives - Objectif HPO mal aligne

Verifications : 1. buy_rate dans CORRELATION DATA 2. action_rate HPO (le modele predit-il assez de BUY ?) 3. Comparer avec d'autres cryptos du meme run

Action : 1. Verifier buy_rate — si > 50%, le probleme est le labeling 2. Tester avec horizon different (change la distribution des labels) 3. Revoir l'objectif HPO (precision_recall_auc adapte ?) 4. En dernier recours : revoir le feature set

Critere de sortie : f1_buy > baseline + 15%

Escalade : ML owner


9.3 NO_CANDIDATES — Aucune strategie viable

Symptome : 0 candidats passes sur un run complet

Impact : aucune strategie exploitable pour cette crypto/groupe

Verifications : 1. Gate rejection reasons dans Grafana > Testing & Backtest 2. Si tous rejetes par Sortino : probleme PTE (voir 9.1) 3. Si tous rejetes par n_trades : modele trop conservateur (theta trop haut) 4. Si tous rejetes par f1 : probleme ML (voir 9.2)

Action : - Relancer avec grid elargi (plus d'horizons, ATR ranges plus larges) - Tester une autre crypto du meme groupe - Si recurrent sur tout un groupe : le groupe n'est peut-etre pas viable

Critere de sortie : au moins 1 candidat passe les gates


9.4 INFRA_SATURATED — Pod OOMKilled ou PVC plein

Symptome : OOMKill > 0 dans Grafana Infra, ou PVC > 85%

Impact : pipeline instable, runs qui crashent

Verifications : 1. Identifier le pod concerne dans Grafana Infra 2. Verifier si le probleme est recurrent ou ponctuel 3. Verifier la memoire consommee vs limits

Action : - Si OOMKill ponctuel : relancer le run - Si OOMKill recurrent : augmenter memory limits dans Helm values - Si PVC > 85% : nettoyer les anciens artifacts (MLflow / S3)

Critere de sortie : OOMKill = 0 sur 24h, PVC < 70%

Escalade : Platform owner

Dernier recours uniquement :

kubectl delete pod <name> -n cvntrade
A utiliser seulement si le pod est reellement bloque, que le controleur peut le recreer, et que la cause a ete qualifiee.


9.5 Run trop long (> 3h)

Symptome : duree run > p95 + 30% ou > 3h absolu

Verifications : 1. Identifier le step bloquant dans Airflow 2. Si HPO : verifier n_trials (50 = normal, > 100 = suspect) 3. Si data fetch : verifier Binance API / S3 connectivity 4. Si backtest : verifier le nombre de candles (60j x 5min = 17K = normal)

Action : selon la cause identifiee

Critere de sortie : duree < p95 historique


10. Pipelines disponibles

Exploitation courante (operateur)

DAG Airflow Pipeline Role
launch__discovery pte__discovery Screen -> Test -> WFRB -> Register challenger
launch__backtesting pte__backtesting Test -> WFRB a partir de candidats existants
launch__walkforward pte__walkforward Validation walk-forward
launch__retrain pte__retrain Retrain + register
launch__monitoring pte__monitoring Health checks

Administration controlee (MLOps maintainer / ML owner)

DAG Airflow Pipeline Role Approbation requise
launch__meta_training pte__meta_training Entrainement meta-label MLOps maintainer
launch__promotion pte__promotion Challenger -> Champion ML owner + operateur

Urgence (platform owner)

DAG Airflow Role Impact
pte__7_killswitch Quarantaine d'urgence Desactive un modele
pte__8_rollback Retour version precedente Remplace le champion
pte__10_decommission Archivage / nettoyage Supprime un modele

Regle : ne jamais executer killswitch, rollback ou decommission sans validation explicite.


11. Groupes de cryptos

Groupe Univers indicatif
btc-core BTCUSDC, ETHUSDC
defi SOLUSDC, ADAUSDC, BONKUSDC, XRPUSDC

Source de verite : table cvntrade_universes en base PostgreSQL.

Note : les seuils ML/Trading s'interpretent dans le contexte du groupe et de la crypto. Un pass rate de 3% sur un grid de 450 points pour une crypto volatile est different d'un pass rate de 3% sur un grid de 50 points pour BTC.


12. Acces

Service URL Profil Usage Niveau
Grafana grafana.cvntrade.eu Operateur supervision readonly
Airflow airflow.cvntrade.eu Operateur / MLOps execution + logs trigger + readonly
MLflow mlflow.cvntrade.eu MLOps / ML owner modeles registry admin
ZenML zenml.cvntrade.eu MLOps / ML owner lignage readonly
W&B wandb.ai ML owner analyse ML compte organisationnel
K8s kubectl Platform owner depannage admin restreint

Regle : ne jamais documenter ou diffuser d'identifiants dans ce document. Tous les acces sont geres par secrets Kubernetes.


13. Regles d'exploitation (policy)

  1. Grafana d'abord — les logs Airflow sont du debug, pas de la decision
  2. Airflow pour executer, pas pour conclure
  3. MLflow pour les modeles, W&B pour l'analyse ML
  4. ZenML pour le lignage
  5. Pas de fallback silencieux dans les diagnostics (ADR-25)
  6. Pas d'action destructive sans qualification du probleme
  7. Toute conclusion doit distinguer ML, PTE/backtest et infra
  8. Comparaison intra-crypto uniquement (ADR-27)
  9. 0 SELL est normal en mode binaire (ADR-28)
  10. Toute metrique ML doit etre comparee a la baseline naive (ADR-29)

14. Dette d'observabilite

Les elements suivants sont aujourd'hui dans les logs Airflow et doivent migrer vers Grafana :

Element Source actuelle Cible
f1_buy vs baseline_f1_buy Logs CORRELATION DATA Dashboard Grafana
screening_f1_buy vs last_fold_f1_buy Logs CORRELATION DATA Dashboard Grafana
tp_pct / sl_pct / timeout_pct Logs CORRELATION DATA Dashboard Grafana
Code diagnostic principal Logs (verdict) Dashboard Grafana
Etat global CVNTrade Calcul humain Dashboard Grafana (automatise)

Prerequis : ecrire les diagnostics dans PostgreSQL (pas seulement dans les logs) pour que Grafana puisse les requeter.


Glossaire

Terme Definition
screening Phase 1 : preselection rapide des PTE candidates (grid search)
testing Phase 2 : validation multi-fold HPO + backtest OOS
WFRB Phase 3 : validation walk-forward rolling backtest
challenger Modele enregistre mais non promu — en attente d'approbation
champion Modele actif de reference pour une crypto
stale Modele non rafraichi depuis N jours
baseline_f1_buy Score F1 d'un classifier naif "always BUY" — reference minimale
buy_rate Proportion de labels BUY dans le split test
PTE Parametres de Trade Execution : SL, TP, horizon
golden signal Metrique cle de supervision — le noyau de l'observabilite
last_fold_f1_buy F1 BUY du dernier fold (meme periode que le screening)
CORRELATION DATA Bloc structure dans les logs test_step — interface stable (ADR-30)

Working with the docs site (docs.cvntrade.eu)

The Design System + ADRs + runbooks are published to docs.cvntrade.eu. Source lives in documentation/; build config is mkdocs.yml at the repo root. Phase 2 of #593 (#637 tracks the scaffolding).

Local preview

make docs-install      # one-time: installs mkdocs + plugins into .venv_airflow
make docs-serve        # hot-reload at http://127.0.0.1:8000

Edit any .md file in documentation/ — the browser reloads on save.

Local strict build (same checks as CI)

make docs-build

--strict fails on broken internal links, missing nav entries, or unknown config. If this passes locally, CI will pass too.

Adding a page

  1. Drop the .md file in the right subfolder (needs/, epics/, adr/, …).
  2. Add an entry in mkdocs.yml under nav: so it's discoverable (required unless it's referenced from another page's index).
  3. make docs-build — fix any broken links.
  4. Commit + PR. CI rebuilds on every PR that touches documentation/**.

Deploy

  • main push → .github/workflows/docs.yml → builds strict → GitHub Pages.
  • First deploy: enable GH Pages in repo settings (Source: GitHub Actions), then set CNAME docs.cvntrade.eudococeven.github.io at the registrar.
  • Subsequent deploys are fully automated. No operator action required.

Architecture diagrams

documentation/architecture/workspace.dsl is the single Structurizr DSL source. See documentation/architecture/workspace-reference.md for rendering options (VS Code live preview, structurizr-cli, Structurizr Lite in Docker).

Troubleshooting

Symptom Cause / fix
mkdocs: command not found Run make docs-install.
Strict build fails on broken link Fix the .md link — relative path from the file's location.
New file isn't in the sidebar Add it under nav: in mkdocs.yml.
README.md shows as an empty page Expected — README.md at docs_dir root is excluded in favor of index.md.

OpenProject operator playbook (#593 Phase 1)

URL: https://openproject.cvntrade.eu Role: product source of truth for Needs / Epics / Stories / Releases. Deployed via: infra/helm/openproject/ chart, Helm-managed through the deploy-k8s workflow. Chart setup doc: infra/helm/openproject/README.md §Setup.

Login

  1. Browser → https://openproject.cvntrade.eu/login
  2. Username admin initially; operator account created post-setup
  3. If cert warning appears, cert-manager hasn't yet issued — wait 5 min and refresh

Create a new Need

  1. Top nav → project CVNTrade → Work packages
  2. Create → type Need
  3. Subject: short title, e.g. "Reach F1=0.75 binary classification"
  4. Custom fields:
  5. need_id = next available CVN-N<nnn> (strictly sequential)
  6. github_issue_url = link to the parent GitHub issue
  7. Description: follow documentation/templates/TEMPLATE_need.md
  8. Save → OpenProject auto-generates the work package ID (ignore it, use need_id for referencing)

Create an Epic under a Need

  1. Open the Need work package
  2. Relations tab → Add relation → "includes" → New work package
  3. Type Epic, set epic_id = <need_id>-E<letter> (A, B, C… per Need)
  4. Save

Close an Epic

  1. Open the Epic
  2. Status → Closed
  3. Fill the epic's Closure section (template §7) in the description
  4. The parent Need's % complete updates automatically

Link a PR to a Story

The link is in both directions:

  • OpenProject side: add GitHub PR URL to the Story's Relations tab
  • GitHub side: PR body must contain CVN-N<nnn>-US<m> (enforced by Phase 4 CI gate once live)

Create a Release

  1. Work packages → Create → type Release
  2. release_id = CVN-R<yyyymmdd>-<n>
  3. Add all closed Epics as "part of" relations
  4. Attach backtest report links (URL or file)
  5. When deploy succeeds, set status → Deployed

Backups

OpenProject DB carries all operator-entered data. Losses are unrecoverable without a backup.

  • If running on a dedicated Scaleway PG instance: daily snapshots are configured at the PG level (7-day retention by default).
  • If reusing the champollion instance: see infra/helm/openproject/README.md §Backup for the pg_dump CronJob option.