The operations director of a 60-staff UK financial services firm sat through a vendor pitch last week for an AI-powered credit-decision model. The slide deck reported 94 per cent accuracy on the historical test set. The procurement question on the table was whether to roll the model into the onboarding flow on Monday. She knew that 94 per cent on a curated test set rarely survives contact with live data, and she was thinking about a peer firm whose lending model dropped from 94 to 71 per cent in production within six weeks, with the lower performance falling unevenly across protected demographic groups and triggering an FCA enforcement file.
The £2 million engagement value mattered. The reputational exposure of a biased lending decision in production mattered more. Two failure modes sit behind that procurement question, and they fail in opposite directions.
What is the difference between overfitting and underfitting?
Overfitting is when a model fits training data too closely, learning noise rather than the pattern. It looks brilliant on data it has seen and collapses on data it has not. Underfitting is the opposite, the model is too simple to capture meaningful patterns, so it performs badly everywhere. The diagnostic tell is the gap between training and test accuracy, wide for overfitting, near-zero for underfitting because both numbers are poor.
A recruitment example sharpens it. An overfit model learns that “candidate wore blue suit on Monday interviews” predicts success, only because one top performer happened to wear blue. The model aces the test data and rejects qualified candidates in production. An underfit model predicts success based only on years of experience, ignoring education, industry, and problem-solving signal. It misses the obvious everywhere.
For a UK SME working with fewer than 10,000 labelled examples, the structural risk weights heavily towards overfitting, because complex models have too much freedom on too little data.
When you’re seeing overfitting: the model that works too well
Overfitting announces itself through one specific pattern, spectacular performance on training data followed by abrupt collapse on new, unseen data. A model might report 98 per cent accuracy on validation and drop to 65 per cent on last month’s live transactions. That divergence is the signature, the signal that the model has memorised specific details rather than learned general principles. Three further symptoms an SME owner can recognise tend to appear alongside it.
The model produces unrealistically confident predictions. Every classification carries near-certainty scores. Real patterns include ambiguity, so a model that never expresses doubt has likely learned training-set quirks.
Accuracy degrades quickly as live data drifts from training conditions. A fraud model works for six weeks then fails on transactions outside its original distribution. A churn model misses signals for customer segments underrepresented during training.
Feature-importance patterns look implausible. A recruitment tool overweights candidate email domain. A sales model overweights CRM-update frequency rather than engagement signals. AIE Works documented a fintech case where a fraud model hit 99.8 per cent accuracy in testing and saw actual fraud losses jump 300 per cent within 24 hours of production deployment.
When you’re seeing underfitting: the model that learns nothing
Underfitting is the inverse failure. The model performs poorly even on training data, and performance stays equally poor on new data. There is no divergence between train and test accuracy because both are bad. The giveaway is uniformity, predictions cluster in a narrow band, a churn model predicts roughly the same probability for every customer, a forecasting model returns similar revenue predictions for every prospect. Patterns business experts can articulate remain invisible to the model.
Three causes recur in SME settings. The model is too simple for the actual relationships, a linear model on a non-linear problem, or a decision tree capped at three levels. Feature engineering is insufficient, the model was never given the right input variables, often because the team built on demographics when the real signal lives in behavioural data. Or the model was undertrained, stopped before it had time to learn.
A care services case illustrates the cost. The team built a model to predict client hospitalisation risk. Constrained to simple linear relationships, it hit 62 per cent accuracy on both training and test data. A human clinician using the same features achieved 78 per cent. The business deployed it anyway, reasoning that 62 per cent beat random. The uniform risk scores prevented the clinical team from prioritising interventions, and preventable hospitalisations rose.
What it costs to misdiagnose
The two failures generate different cost profiles, and the asymmetry is what owners need to internalise. An underfit model fails visibly, staff learn to ignore it, and decisions revert to spreadsheets. For a £3 million services business, the lost-opportunity cost runs to £50,000 to £150,000 a year in forgone optimisation of resource allocation, pricing, or customer targeting. The wider response is to lose faith in AI as a category.
An overfit model is more dangerous because the initial performance is deceptively good. The team deploys with confidence. Production data arrives with its own distribution, accuracy collapses, and the team often does not notice because the API still returns predictions and the system still looks operational. If the model drives customer-facing pricing, recommendations, or eligibility, customers experience abrupt changes in behaviour and complaints rise before anyone diagnoses the cause. Total exposure can run to £50,000 to £200,000 in lost margin and churn before remediation completes.
The 2026 compliance pressure adds a third layer. The EU AI Act now bites for UK firms with EU customer or output exposure, and HMRC real-time compliance applies to payroll and HR systems. An overfit payroll classifier that learned quirks specific to past employees can systematically misclassify new ones, triggering remediation duties and penalties. The financial services case at the top of this post, 94 per cent test-set accuracy dropping to 71 per cent in production with adverse impact on protected groups, is the canonical example of overfitting becoming a regulatory file.
How to decide for your business: the ten procurement questions
The procurement gate is where this risk gets managed. Ten questions, each requiring evidence rather than vendor reassurance, surface whether a model has been honestly validated or benchmarked against itself. They are not technical, but they are specific, and the answers separate vendors who have done the work from vendors who have not.
First, what is the gap between training accuracy and test accuracy? Demand specific numbers. Above five to seven percentage points is a red flag for classification. Second, how was the dataset split? The vendor should describe train, validation, and test splits explicitly, with the test set held out from the start. Third, what validation technique? For SME data sizes, repeated k-fold cross-validation, five-fold repeated ten times is the standard. Fourth, has the model been tested on data from a different time period, segment, or channel than the training data? At least one out-of-distribution test is the minimum. Fifth, what regularisation techniques were used? L1 or L2, dropout for neural networks, early stopping where applicable. A complex model on small data without regularisation is the textbook overfitting setup.
Sixth, was there any data leakage in feature engineering? Features must be created after the train-test split, not before. Seventh, how will performance be monitored in production? Specific metrics, alert thresholds, escalation paths. Eighth, what is the rollback procedure if performance degrades? Fallback models or human escalation defined before deployment. Ninth, was final evaluation on a fully independent holdout the team has not touched? Nested cross-validation or a true holdout, not the same test set used for model selection. Tenth, would a simpler baseline perform nearly as well? If logistic regression or a shallow decision tree sits within two to three percentage points on test accuracy, use the simpler model.
Two safeguards follow from those questions. Default to simpler, interpretable models on small data, the team can read the rules and the model generalises better. Invest in monitoring infrastructure from day one, learning curves, cross-validation, holdout, production monitoring. The cost is small compared with an undetected overfit model failing in production for three months. If you want to walk through whether a specific vendor pitch passes these ten questions for your firm, book a conversation.



