Diagnosing Learning Problems in Machine Learning Models¶

A generic scientific method for distinguishing missing signal, broken model and corrupted pipeline

Document type: Strategic and technical white paper · Version: 1.0 · Date: 2026-05-20

Audience: ML engineers, data scientists, MLOps and platform teams, technical leadership. The paper is methodological, not framework-specific — every example would apply equally to tabular ML, time-series, or any predictive system where a "the model is not learning" verdict can plausibly come from any of three very different root causes.

Scope & disclaimer: This is a vendor-neutral methodological essay. Code snippets use scikit-learn and LightGBM as illustrative implementations of universal concepts (logistic regression, shallow trees, mutual information, bootstrap confidence intervals) — substitute your stack of choice. The discipline being described is the substantive content ; the libraries are accidents.

Contents¶

Executive summary
Why ML investigations so often go wrong
The three axes of causality
Warning signs that should trigger a structured diagnostic
Step 1 — Establish a healthy reference
Step 2 — Use simple models as scientific probes
Step 3 — Check whether a signal even exists
Step 4 — Introduce negative controls
Step 5 — Audit the splits and the drift
Step 6 — Pre-specify the conclusions
Step 7 — Quantify uncertainty
Step 8 — Apply a diagnostic decision tree
The real paradigm shift
Five foundational lessons
Conclusion

1. Executive summary¶

When a machine-learning model "isn't learning", the reflex is almost always the same. Tune the hyperparameters. Add a few features. Change the architecture. Re-run with another seed. Stare at the validation curve. Re-run again. After two days of this, the team has produced a great deal of activity and very little knowledge — usually because most of the experiments were testing several hypotheses at once.

The fundamental problem is structural, not numerical. An underperforming model can fail for genuinely different reasons : the data may contain no exploitable signal ; the model may be unable to learn from data that does ; the experimental pipeline may be quietly corrupted by a leakage, a row-mismatch, or an invalid split. Those three causes are profoundly different and require profoundly different remedies — yet they produce symptoms that look almost identical on a dashboard. That is what makes diagnosis hard.

This paper proposes a generic, reproducible, scientifically rigorous method for investigating learning failures. The objective is not to squeeze a few more points of AUC out of a hyperparameter sweep. The objective is the much harder one of restoring scientific confidence in the experimental system itself — making sure that when the system says a model learns, it actually does, for the reason it claims to.

2. Why ML investigations so often go wrong¶

In many organisations, an ML investigation looks like an iterative escalation : the model fails, hyperparameters get tweaked, features get added, the model architecture gets replaced, a different seed is tried, and somewhere around the fifth re-run the team can no longer say what they were originally testing.

flowchart TD
    A[Model is not learning] --> B[Tune hyperparameters]
    B --> C[Add features]
    C --> D[Swap model]
    D --> E[Try another seed]
    E --> F[Retrain]
    F --> G[Total confusion]

The problem is not laziness or inexperience. The problem is that the team has unintentionally conflated four very different activities. Debugging, which seeks the cause of a specific malfunction. Optimisation, which seeks the best value of a metric within a fixed setup. Scientific validation, which seeks to confirm or refute a hypothesis. Exploratory research, which seeks to map a poorly understood phenomenon. Each requires a different discipline. Conflating them produces experiments nobody can interpret, moving hypotheses, narrative conclusions, and eventually pipelines that have lost the ability to claim that anything has been tested.

The deepest ML crises rarely come from a bad model. They come from an experimental system that has become impossible to reason about. The remedy is not more compute — it is more discipline.

3. The three axes of causality¶

The single most important rule of the method proposed here is this : never mix multiple possible causes in the same experiment. Every investigation must be decomposed into three axes that are tested separately, in sequence, and ideally with explicit pre-registered decision criteria.

flowchart TD
    A[Learning failure observed]
    A --> B[Axis 1 — Can the model learn at all ?]
    A --> C[Axis 2 — Does the data actually contain signal ?]
    A --> D[Axis 3 — Is the experimental pipeline corrupted ?]

Axis 1 asks whether the learner itself is healthy. The relevant evidence lives in the optimisation : does the loss decrease at a reasonable rate, do the gradients behave coherently, is the convergence non-trivial, are the training dynamics stable across folds and seeds ? An Axis-1 failure typically points at a hyperparameter regime that prevents learning (catastrophically large learning rate, mis-set class weights), or at a learner that is simply ill-suited to the geometry of the problem.

Axis 2 asks whether the data carries a predictive structure at all. Notice that this question does not mention the model. It is a statement about the joint distribution of features and target. Are there simple predictive baselines that detect any structure at all ? Does the signal survive standard perturbations (label shuffling, feature permutation) the way a real signal must ? Does train and validation behave like draws from the same distribution ? An Axis-2 failure means the problem is not solvable in the form it is currently posed — better engineering will not save it.

Axis 3 is the most insidious. The system can appear excellent precisely because something has gone subtly wrong : future labels have leaked into past features, labels are misaligned with their feature rows by an off-by-one, a feature implicitly encodes the target, an invalid shuffle=True split has destroyed the chronological order required for honest evaluation. An Axis-3 failure produces models that look spectacularly successful and are completely scientifically invalid. The history of applied ML is littered with such examples.

The principle of three-axis separation is what gives the rest of this method its force. Once an axis has been tested in isolation, with the other two held constant, the resulting evidence actually means something.

4. Warning signs that should trigger a structured diagnostic¶

A few characteristic behaviours should function as alarm bells — the moment any one of them appears, the team is no longer in optimisation territory and must shift to structured diagnosis.

The first is abnormally fast convergence. When a gradient-boosted ensemble reports best_iter = 1 or a deep model essentially stops improving after the first epoch, something is wrong. Real predictive problems of any non-trivial complexity require several rounds of incremental improvement. A model that converges instantly is rarely converging — it is most often hitting either a degenerate optimum (no signal to fit), a leakage path that lets it solve the problem trivially, or a pipeline defect that aborts the learning curve before it begins.

flowchart LR
    A[Healthy model] --> B[Progressive learning over many iterations]
    C[Suspect pipeline] --> D[Immediate convergence ; best_iter = 1]

The second is near-constant predicted probabilities. A model that returns 0.5001, 0.4998, 0.5002 on a binary task has not learned to separate the classes — its calibration is collapsed to the prior. This often coexists with a flat validation loss and is a signature of either an absent signal or a learner stuck in a flat region of the loss landscape.

The third is train and validation losses that move together almost identically. In a healthy run, the two curves separate gradually as the model starts to memorise the training set. When they overlap throughout the run, either the model is not learning anything specific (no signal), or the pipeline is feeding train and validation the same thing in a way that collapses the diagnostic distinction.

The fourth is enormous variance between cross-validation folds. AUC ranging from 0.45 to 0.78 across folds, on the same data, is not a model property — it is a sample-size or contamination property, sometimes a localised leakage that triggers only in some folds. The methodology of the rest of this paper is what discriminates between those possibilities.

5. Step 1 — Establish a healthy reference¶

Before investigating anything, prove that something can learn. Without a healthy reference, every subsequent comparison is ambiguous : you cannot tell whether a new finding is informative or whether the entire experimental setup is broken in a way that makes everything look broken. The reference does not need to be the production model — it should be the simplest reasonable thing that works.

from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

model = LGBMClassifier(n_estimators=300, learning_rate=0.05, num_leaves=31)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric="auc")

proba = model.predict_proba(X_val)[:, 1]
print(f"AUC: {roc_auc_score(y_val, proba):.3f}")
print(f"Best iteration: {model.best_iteration_}")

A healthy reference exhibits four signals together. Loss decreases progressively over many iterations rather than collapsing at iteration one. Predicted probabilities show genuine dispersion rather than crowding around the prior. Train and validation losses follow coherent trajectories that diverge in the expected way. The best iteration is somewhere reasonable — not 1, not the maximum allowed (which would suggest the model wants more rounds), but a finite point in between that says learning genuinely happened. If you cannot get a healthy reference even once, on any subset of the data, with any reasonable configuration, then no subsequent diagnostic can be interpreted. Stop the investigation and re-examine the data and the harness before going any further.

6. Step 2 — Use simple models as scientific probes¶

Modern ML toolkits are dominated by powerful, flexible models — gradient-boosted forests, deep networks, large transformers. These are often the wrong instrument for diagnosis. A complex model can mask the problem you are trying to find : it can memorise leakage, exploit residual correlations no one knew were there, or fit noise so successfully that you mistake it for signal. Simple models, by contrast, are scientifically useful precisely because they cannot do those things. They are the multimeter of ML diagnosis : not the production tool, but the instrument that tells you the truth about the wiring.

A logistic regression is the cleanest probe for monotonic signal. If the data contains a univariate-or-additive structure where a positive value of some feature predicts a positive class, logistic regression will find it. If logistic regression cannot get above 0.55 AUC on a hold-out, the data probably does not carry a strong monotonic signal — though it may still carry interaction-heavy or non-linear signal that a tree would find.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_val)[:, 1]
print(f"LogReg AUC: {roc_auc_score(y_val, proba):.3f}")

A shallow tree is the cleanest probe for non-monotonic / interaction signal. A depth-3 decision tree cannot memorise — it has only seven leaves, so it must pick a small number of threshold-based splits that genuinely separate the classes. If a shallow tree finds signal where logistic regression does not, you are looking at a non-monotonic or interaction-heavy structure (very common in financial time series and other complex domains). If neither finds signal but a deep gradient-boosted model claims a 0.95 AUC, you should be sceptical of the deep model rather than impressed by it.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)
proba = model.predict_proba(X_val)[:, 1]

The discipline is to read the two AUCs side by side. Both above 0.55 ⇒ signal exists, in a form both linear and tree-discoverable. Logistic regression above 0.55 but tree below ⇒ monotonic-only signal, the data carries direction but not interaction structure. Tree above 0.55 but logistic regression below ⇒ non-monotonic / interaction-heavy signal, which is the regime where deep models earn their keep but also where leakage is hardest to detect. Both below ⇒ either no signal or a signal that lives in a representation neither probe can express, in which case Step 3 is needed before drawing any conclusion.

7. Step 3 — Check whether a signal even exists¶

The two probes from Step 2 measure what a model can do with the data. Step 3 asks a deeper question : independent of any model, does the data carry information about the target ?

The classical tool here is mutual information — Shannon's measure of how much knowing one variable reduces the uncertainty about another, in bits or nats. Mutual information is non-parametric, captures arbitrary dependence (not just linear correlation), and is zero exactly when the variables are statistically independent. For each feature, computing MI(feature, target) produces a ranked table that tells you which features individually carry signal and which are noise.

from sklearn.feature_selection import mutual_info_classif
import numpy as np

mi = mutual_info_classif(X_train, y_train, random_state=42)
print(f"max MI = {mi.max():.4f}  |  features with MI > 0.05 : {(mi > 0.05).sum()}")

The interesting case — and the one most often misread — is when the max per-feature MI is low (say below 0.05) and the multivariate models from Step 2 nevertheless detect signal above 0.55 AUC. This combination is the signature of a diffuse, multivariate signal : no single feature carries enough information to be individually predictive, but the model exploits subtle joint structure across many of them. It is extremely common in trading, time-series forecasting and any genuinely complex system. The naïve diagnosis ("max-MI is low ⇒ no signal") is wrong in this regime, which is why the multi-tool approach matters. You need the model-free MI scan and the simple-model probes ; neither alone is enough.

A subtler probe is to look at the k-nearest-neighbour AUC on a PCA projection that retains 95 % of the feature-space variance. A standard kNN suffers from the curse of dimensionality once you exceed roughly fifty features ; PCA-then-kNN is the cleaner variant. When all three (logistic regression, shallow tree, PCA-kNN) agree that AUC is at chance level, the data really does not carry exploitable signal. When the three disagree, the disagreement itself is informative about the shape of the underlying structure.

8. Step 4 — Introduce negative controls¶

This is, scientifically speaking, the most important step in the method. A reliable system is not just a system that works ; it is a system that fails exactly when it should. The way to demonstrate that is to deliberately break the relationship the model is supposed to be learning and verify that performance collapses to chance. If it does not, something else is doing the work — and that something is almost always a leak.

The first negative control is label shuffling. Replace the training labels with a random permutation of themselves, keeping the marginal class distribution intact, and refit the model. If the model can no longer learn — AUC collapses to roughly 0.50 on validation — then your training is genuinely picking up the label/feature relationship. If the model still performs well on shuffled labels, then a feature in your training set is encoding the label by some other path, and you have a leakage problem of the most dangerous kind : invisible until you look for it.

import numpy as np
rng = np.random.default_rng(42)

y_shuffled = rng.permutation(y_train.values)
model.fit(X_train, y_shuffled)
proba_shuffled = model.predict_proba(X_val)[:, 1]
print(f"Shuffled-label AUC : {roc_auc_score(y_val, proba_shuffled):.3f}")

The second negative control is feature permutation, applied to whole groups of related features rather than to one feature at a time (otherwise correlated features mask each other's contribution). Permute the rows of a feature group within the training set so that the row-feature alignment is destroyed, refit, evaluate. If validation AUC remains substantially above chance, then the features in that group are not what the model is actually using — something else is. Permuting different groups in turn produces a fingerprint of which feature families genuinely contribute. A model whose AUC barely moves no matter which group you scramble is, almost certainly, exploiting an artefact.

A third, less obvious negative control swaps the roles of training and validation : retrain on what was previously the validation set, evaluate on what was previously the training set. The resulting AUC should be similar to the canonical one — if it is dramatically different (in either direction), the two splits are not statistically equivalent, which itself is a strong signal that the construction of the splits is biased.

These controls take time to run. They are not optional. A serious investigation that skips them is a narrative dressed up as science.

9. Step 5 — Audit the splits and the drift¶

The splits are where most pipeline failures hide, especially in time-series contexts. The single most destructive mistake is temporal leakage — when the model has access, at training time, to information from the future. It is so easy to introduce that it deserves a permanent place in any code review checklist : a train_test_split(X, y, shuffle=True) on a time-indexed dataset is the canonical example. The model can implicitly learn from data points that lie chronologically after the validation examples, and the resulting AUC is meaningless.

The honest pattern is a strict chronological split, with the validation set strictly later than the training set and ideally separated from it by a small purge window to defeat label leakage from the labelling horizon :

train_mask = X["date"] < split_date
val_mask  = X["date"] >= split_date + purge_window

Beyond the basic split, you should also test for distributional drift between training and validation. Compute, for each feature, the mean and standard deviation on each split, and verify that the deltas are within tolerance. Large drifts indicate either a regime change in the data (a real signal that the world has moved) or contamination (a leakage that has corrupted the comparison). The serious version of this check uses statistical distance metrics — KL divergence, Wasserstein distance, or population stability index — but even a basic moment check catches most catastrophic cases.

For a properly chronological time-series setup, walk-forward validation is the gold standard : the training window slides forward through time, with each step producing a fresh model evaluated on the next out-of-sample window. The structure prevents any reuse of future information and exposes regime changes naturally. The cost is computational, the discipline is real, but the alternative — a single static train/val split with optimistic AUC — is a fictional benchmark, not science.

flowchart LR
    A[Past data] --> B[Training window]
    B --> C[Validation in the strict future]

10. Step 6 — Pre-specify the conclusions¶

This is the discipline that separates investigation from rationalisation. Before running the experiments described in steps 1 through 5, write down explicitly what each possible outcome will mean. The format is mechanical : if pattern X appears, then conclusion Y, and action Z. The exact thresholds, the exact tests, and the exact follow-up decisions must be committed in writing — ideally in a peer-reviewed dossier — before any number is observed.

The reason this matters has been documented in dozens of fields. Without pre-specification, humans rationalise. The brain notices a partial pattern in the data, constructs a plausible narrative around it, and the narrative becomes the conclusion. By the time the team writes up the result, the hypothesis has quietly moved to fit the evidence. The published claim is no longer a test ; it is a story dressed in numbers.

Pre-specification breaks the feedback loop. If the dossier said "if max-MI is below 0.05 and both simple-model AUCs are below 0.55, conclude NO_SIGNAL and escalate to data-design review", and the observed pattern is exactly that, then the conclusion follows without negotiation. If the dossier said "if max-MI is below 0.05 but both simple models are above 0.55, conclude UNCOVERED and route to pipeline diagnostics", and that is what the data shows, the conclusion is again automatic. The thresholds were not chosen to fit the result. They were chosen before the result existed.

This discipline transforms what was intuitive debugging into controlled scientific investigation. The dossier is the trace. The trace is what makes the conclusion auditable.

11. Step 7 — Quantify uncertainty¶

A point estimate without an uncertainty bound is more dangerous than no estimate at all, because it invites interpretation it cannot support. An AUC of 0.62 with a 95 % confidence interval of [0.60, 0.64] is a finding ; an AUC of 0.62 with an interval of [0.48, 0.76] is a measurement-error report dressed as a finding. The two patterns require radically different responses, and dashboards rarely show the second.

Three uncertainty quantification techniques cover most of what you need. Cross-validation variance captures sensitivity to the specific train/val cut by running the whole pipeline on multiple folds and reporting the spread of the metric. Bootstrap confidence intervals capture sampling uncertainty on a fixed evaluation set by resampling the validation rows with replacement and recomputing the metric many times :

from sklearn.utils import resample

scores = []
for _ in range(1000):
    idx = resample(range(len(y_val)))
    scores.append(roc_auc_score(y_val.iloc[idx], proba[idx]))
ci_low, ci_high = np.percentile(scores, [2.5, 97.5])
print(f"AUC = {np.mean(scores):.3f}  95% CI = [{ci_low:.3f}, {ci_high:.3f}]")

Seed sensitivity captures variability introduced by training randomness — re-run the full pipeline with several random seeds, and verify that the metric does not depend on the specific draw. If three seeds produce AUCs of 0.55, 0.71 and 0.49, the model is not really learning at AUC 0.58 ; it is unstable, and the mean is misleading. The right reaction is to widen the seed pool and report the dispersion, not to pick the favourable run.

12. Step 8 — Apply a diagnostic decision tree¶

The final step assembles the evidence from Steps 1 to 7 into a structured verdict. The decision tree must be pre-specified (Step 6), exhaustive (every plausible outcome falls under some branch), and machine-applicable (no narrative reinterpretation at sign-off time). A typical structure :

if logreg_auc < 0.55 and tree_auc < 0.55 and pca_knn_auc < 0.55 and max_mi < 0.05:
    verdict = "NO_SIGNAL"           # data does not carry exploitable structure

elif (logreg_auc > 0.55 or tree_auc > 0.55) and best_iter <= 3:
    verdict = "PIPELINE_SUSPECT"    # simple models find signal, but full model converges instantly

elif shuffled_label_auc >= original_auc - 0.05:
    verdict = "LEAKAGE_PROBABLE"    # model performs almost as well when labels are random

elif logreg_auc > 0.55 and tree_auc > 0.55 and max_mi < 0.05:
    verdict = "MULTIVARIATE_DIFFUSE"  # signal exists but no individual feature is informative

elif logreg_auc > 0.55 and tree_auc < 0.55:
    verdict = "MONOTONIC_ONLY"

elif logreg_auc < 0.55 and tree_auc > 0.55:
    verdict = "NON_MONOTONIC"

else:
    verdict = "INCONCLUSIVE_UNCOVERED"  # the §3.3-style 4-row table is non-total ;
                                       # escalate per the M7 discipline

Each terminal verdict carries a routing : NO_SIGNAL ⇒ data engineering review (the problem is not solvable as posed) ; PIPELINE_SUSPECT ⇒ feature-leak audit + harness row-mismatch check ; LEAKAGE_PROBABLE ⇒ feature-by-feature leakage hunt ; MULTIVARIATE_DIFFUSE ⇒ expected on complex domains, proceed with the production model but verify out-of-sample stability ; MONOTONIC_ONLY and NON_MONOTONIC ⇒ feature-engineering signal, route accordingly ; INCONCLUSIVE_UNCOVERED ⇒ committee escalation with the full evidence dossier rather than autonomous re-interpretation.

The point is not the specific thresholds, which depend on the domain. The point is that the verdict is mechanical. The tree was written before the data was observed ; the data either matches one of the branches or it does not, and "matches none of the branches" is itself a conclusion with a defined escalation path. Nobody re-interprets the table after the fact.

13. The real paradigm shift¶

Most ML organisations spend almost all of their experimental energy on optimisation : another sweep, another architecture, another loss function, another set of features. The method described here proposes something much more uncomfortable. Before you optimise, you must first prove that the experimental system itself is credible. Not just that it produces numbers, but that the numbers mean what the dashboard says they mean.

This is not pedantry. It is the difference between an ML organisation that compounds knowledge over years and one that re-runs the same investigation every quarter because nobody remembered why the previous answer was wrong. The cost of the methodology is real — a serious diagnostic takes days of disciplined work for every "the model isn't learning" episode, rather than hours of intuitive sweeps. The dividend is that, when the methodology is followed, the conclusions actually hold.

It is also the discipline that makes ML reviewable by humans who are not the original authors. A committee, a regulator, an audit, a new team member six months later — anyone can re-derive the conclusion from the dossier and the data, because the conclusion was constrained by a pre-specified rule rather than emerging from a narrative. In domains where the cost of being silently wrong is high (medicine, finance, safety-critical systems), this auditability is the prerequisite for deployment, not a luxury.

14. Five foundational lessons¶

Across enough investigations, this method distils into a small number of principles, each of which is more substantive than it looks.

Simple models are scientific instruments. A logistic regression and a depth-three decision tree, used as probes rather than products, often diagnose a pipeline failure faster and more reliably than a gradient-boosted forest with two hundred trees. The complex model is too clever to be diagnostic. The simple model tells the truth about the data.

Negative controls are not optional. A pipeline that does not fail under label shuffling and feature permutation is not a pipeline that has been validated ; it is a pipeline that has not been tested. The cost of running the controls is small compared to the cost of deploying a model that depends on a leak.

ML pipelines must be auditable. Every model artefact must be traceable to a feature version, a label definition, a split design, a hyperparameter resolution, and a code commit. The traceability is what makes the verdict reproducible, and reproducibility is what makes science possible.

Conclusions must be pre-specified. The decision tree, the thresholds, the escalation paths, all written down before the experiment runs. The dossier is the contract between the team and itself.

Modern ML systems are experimental systems. They are not engineered artefacts in the way a database or a web service is. They are scientific apparatuses with the operational characteristics of production software, and they must be treated with the rigour of both. Cutting either side of that duality produces fragile systems.

Conclusion¶

Learning failures in machine-learning systems are rarely simple bugs. They are failures of experimental causality — situations where multiple plausible causes produce the same observable symptoms, and where intuition is reliably misleading. The only robust way to investigate them is to separate the causal axes, hold variables constant, use simple probes, introduce negative controls, quantify uncertainty, and reason like a scientist running a controlled experiment.

The deepest reward of this method is not better models. It is a different kind of confidence — the confidence that comes from a system whose claims can be checked. Once a team has lived with that confidence, the cost of returning to intuitive optimisation becomes obvious. A few extra points of AUC matter, sometimes a great deal. They matter less than knowing whether the AUC means what the dashboard says it means. In modern complex systems, that credibility is often the most valuable thing the team can build.