What is overfitting vs underfitting? The procurement gate before deployment

A woman at a meeting room table tapping a printed report with a pen, asking a question to a vendor presenter across from her
TL;DR

Overfitting is when a model memorises training-set noise and collapses on real data. Underfitting is when the model is too simple to learn the pattern in the first place. UK SMEs working with fewer than 10,000 labelled examples sit in the overfitting risk zone, and most vendor pitches obscure it. Ten procurement questions, each backed by evidence the vendor must produce, surface the failure mode before deployment, not after.

Key takeaways

- Overfit models look great in testing and collapse on live data. Underfit models look bad everywhere. The diagnostic difference is the gap between training accuracy and test accuracy. - UK SMEs typically work with fewer than 10,000 labelled examples, often fewer than 1,000, which makes overfitting the dominant structural risk when complex models are applied without regularisation. - The asymmetric cost: an underfit model fails visibly and gets ignored. An overfit model fails silently in production and can cost £50,000 to £200,000 in margin loss before anyone notices. - Procurement defence is ten specific questions, each requiring evidence: train-test gap, dataset split, validation technique, drift testing, regularisation, leakage, monitoring, rollback, independent holdout, simpler-baseline comparison. - On small data, default to simpler interpretable models and invest in monitoring from day one. A logistic regression you can read beats a neural network you cannot, and the cost of monitoring is a fraction of the cost of an undetected production failure.

The operations director of a 60-staff UK financial services firm sat through a vendor pitch last week for an AI-powered credit-decision model. The slide deck reported 94 per cent accuracy on the historical test set. The procurement question on the table was whether to roll the model into the onboarding flow on Monday. She knew that 94 per cent on a curated test set rarely survives contact with live data, and she was thinking about a peer firm whose lending model dropped from 94 to 71 per cent in production within six weeks, with the lower performance falling unevenly across protected demographic groups and triggering an FCA enforcement file.

The £2 million engagement value mattered. The reputational exposure of a biased lending decision in production mattered more. Two failure modes sit behind that procurement question, and they fail in opposite directions.

What is the difference between overfitting and underfitting?

Overfitting is when a model fits training data too closely, learning noise rather than the pattern. It looks brilliant on data it has seen and collapses on data it has not. Underfitting is the opposite, the model is too simple to capture meaningful patterns, so it performs badly everywhere. The diagnostic tell is the gap between training and test accuracy, wide for overfitting, near-zero for underfitting because both numbers are poor.

A recruitment example sharpens it. An overfit model learns that “candidate wore blue suit on Monday interviews” predicts success, only because one top performer happened to wear blue. The model aces the test data and rejects qualified candidates in production. An underfit model predicts success based only on years of experience, ignoring education, industry, and problem-solving signal. It misses the obvious everywhere.

For a UK SME working with fewer than 10,000 labelled examples, the structural risk weights heavily towards overfitting, because complex models have too much freedom on too little data.

When you’re seeing overfitting: the model that works too well

Overfitting announces itself through one specific pattern, spectacular performance on training data followed by abrupt collapse on new, unseen data. A model might report 98 per cent accuracy on validation and drop to 65 per cent on last month’s live transactions. That divergence is the signature, the signal that the model has memorised specific details rather than learned general principles. Three further symptoms an SME owner can recognise tend to appear alongside it.

The model produces unrealistically confident predictions. Every classification carries near-certainty scores. Real patterns include ambiguity, so a model that never expresses doubt has likely learned training-set quirks.

Accuracy degrades quickly as live data drifts from training conditions. A fraud model works for six weeks then fails on transactions outside its original distribution. A churn model misses signals for customer segments underrepresented during training.

Feature-importance patterns look implausible. A recruitment tool overweights candidate email domain. A sales model overweights CRM-update frequency rather than engagement signals. AIE Works documented a fintech case where a fraud model hit 99.8 per cent accuracy in testing and saw actual fraud losses jump 300 per cent within 24 hours of production deployment.

When you’re seeing underfitting: the model that learns nothing

Underfitting is the inverse failure. The model performs poorly even on training data, and performance stays equally poor on new data. There is no divergence between train and test accuracy because both are bad. The giveaway is uniformity, predictions cluster in a narrow band, a churn model predicts roughly the same probability for every customer, a forecasting model returns similar revenue predictions for every prospect. Patterns business experts can articulate remain invisible to the model.

Three causes recur in SME settings. The model is too simple for the actual relationships, a linear model on a non-linear problem, or a decision tree capped at three levels. Feature engineering is insufficient, the model was never given the right input variables, often because the team built on demographics when the real signal lives in behavioural data. Or the model was undertrained, stopped before it had time to learn.

A care services case illustrates the cost. The team built a model to predict client hospitalisation risk. Constrained to simple linear relationships, it hit 62 per cent accuracy on both training and test data. A human clinician using the same features achieved 78 per cent. The business deployed it anyway, reasoning that 62 per cent beat random. The uniform risk scores prevented the clinical team from prioritising interventions, and preventable hospitalisations rose.

What it costs to misdiagnose

The two failures generate different cost profiles, and the asymmetry is what owners need to internalise. An underfit model fails visibly, staff learn to ignore it, and decisions revert to spreadsheets. For a £3 million services business, the lost-opportunity cost runs to £50,000 to £150,000 a year in forgone optimisation of resource allocation, pricing, or customer targeting. The wider response is to lose faith in AI as a category.

An overfit model is more dangerous because the initial performance is deceptively good. The team deploys with confidence. Production data arrives with its own distribution, accuracy collapses, and the team often does not notice because the API still returns predictions and the system still looks operational. If the model drives customer-facing pricing, recommendations, or eligibility, customers experience abrupt changes in behaviour and complaints rise before anyone diagnoses the cause. Total exposure can run to £50,000 to £200,000 in lost margin and churn before remediation completes.

The 2026 compliance pressure adds a third layer. The EU AI Act now bites for UK firms with EU customer or output exposure, and HMRC real-time compliance applies to payroll and HR systems. An overfit payroll classifier that learned quirks specific to past employees can systematically misclassify new ones, triggering remediation duties and penalties. The financial services case at the top of this post, 94 per cent test-set accuracy dropping to 71 per cent in production with adverse impact on protected groups, is the canonical example of overfitting becoming a regulatory file.

How to decide for your business: the ten procurement questions

The procurement gate is where this risk gets managed. Ten questions, each requiring evidence rather than vendor reassurance, surface whether a model has been honestly validated or benchmarked against itself. They are not technical, but they are specific, and the answers separate vendors who have done the work from vendors who have not.

First, what is the gap between training accuracy and test accuracy? Demand specific numbers. Above five to seven percentage points is a red flag for classification. Second, how was the dataset split? The vendor should describe train, validation, and test splits explicitly, with the test set held out from the start. Third, what validation technique? For SME data sizes, repeated k-fold cross-validation, five-fold repeated ten times is the standard. Fourth, has the model been tested on data from a different time period, segment, or channel than the training data? At least one out-of-distribution test is the minimum. Fifth, what regularisation techniques were used? L1 or L2, dropout for neural networks, early stopping where applicable. A complex model on small data without regularisation is the textbook overfitting setup.

Sixth, was there any data leakage in feature engineering? Features must be created after the train-test split, not before. Seventh, how will performance be monitored in production? Specific metrics, alert thresholds, escalation paths. Eighth, what is the rollback procedure if performance degrades? Fallback models or human escalation defined before deployment. Ninth, was final evaluation on a fully independent holdout the team has not touched? Nested cross-validation or a true holdout, not the same test set used for model selection. Tenth, would a simpler baseline perform nearly as well? If logistic regression or a shallow decision tree sits within two to three percentage points on test accuracy, use the simpler model.

Two safeguards follow from those questions. Default to simpler, interpretable models on small data, the team can read the rules and the model generalises better. Invest in monitoring infrastructure from day one, learning curves, cross-validation, holdout, production monitoring. The cost is small compared with an undetected overfit model failing in production for three months. If you want to walk through whether a specific vendor pitch passes these ten questions for your firm, book a conversation.

Sources

Coursera (2024). Overfitting vs Underfitting: What's the Difference? The canonical plain-English framing of the bias-variance trade-off used here for the recruitment-tool examples and the diagnostic distinction. https://www.coursera.org/articles/overfitting-vs-underfitting AIE Works (2024). Day 41: Overfitting and Underfitting. The fintech case study where a fraud detection model achieved 99.8 per cent accuracy in testing and saw fraud losses jump 300 per cent within 24 hours of production deployment. https://aieworks.substack.com/p/day-41-overfitting-and-underfitting Milvus (2024). How do you handle overfitting in small datasets? The reference for why complex models with thousands of parameters memorise noise on SME-scale data and why default complexity settings are the wrong starting point. https://milvus.io/ai-quick-reference/how-do-you-handle-overfitting-in-small-datasets Splunk (2024). What is Model Drift? The framing for data drift versus concept drift used here to argue for out-of-distribution testing as a procurement requirement. https://www.splunk.com/en_us/blog/learn/model-drift.html InterviewNode (2024). How Machine Learning Models Fail in Production and What to Do About It. Source for the silent-error and monitoring-gap framing, used here for the financial services 94-to-71 per cent drop and the regulatory consequences. https://www.interviewnode.com/post/how-machine-learning-models-fail-in-production-and-what-to-do-about-it PMC (2022). A study on cross-validation for small datasets. Peer-reviewed source for the repeated k-fold technique recommended for SME-scale data. https://pmc.ncbi.nlm.nih.gov/articles/PMC8905023/ Built In (2024). Model Validation: Train, Validation, Test Split. The reference for the 70-15-15 split discipline and the rule that the test set must be data the model has never seen. https://builtin.com/data-science/model-validation-test DotData (2024). Preventing Data Leakage in Feature Engineering. The reference for why features must be created after the train-test split, used here for the data-leakage procurement question. https://dotdata.com/blog/preventing-data-leakage-in-feature-engineering-strategies-and-solutions/ Employment Hero (2026). Why AI Fails UK SMEs in 2026 and How to Fix It. The 35 per cent UK SME adoption figure and the "broken foundations" diagnosis used here for the monitoring-infrastructure argument. https://employmenthero.com/uk/news/why-ai-fails-uk-smes-2026-how-to-fix/ Reiter (2022). Simple vs Complex Models. The argument that simpler models often outperform complex ones on small datasets, used here for the baseline-comparison procurement question. https://ehudreiter.com/2022/10/26/simple-vs-complex-models/

Frequently asked questions

How do I know if a vendor's accuracy claim is overfit?

Ask for the gap between training accuracy and test accuracy as specific numbers. If the vendor reports 94 per cent on testing but cannot or will not state training accuracy, that is the warning sign. A gap of more than five to seven percentage points is a red flag for classification problems. Then ask how the test set was held out and whether the model has been evaluated on data from a different time period or customer segment than the training data. Vendors who have done the work answer fluently.

We only have about 1,500 labelled examples. Is AI even viable for us?

Yes, but the model architecture matters more than the algorithm. On a 1,500-row dataset a logistic regression or shallow decision tree will often generalise better than a neural network, and your team can read the rules. If the use case suits a foundation model with prompt engineering, that bypasses the small-data problem entirely. Where you do build a custom model, insist on repeated k-fold cross-validation and a held-out test set the team has not touched.

What is the single most important question to ask before signing?

How will you monitor performance in production, and what triggers a rollback? An overfit model can fail silently for weeks before anyone notices. If the vendor cannot describe the metrics tracked continuously, the alert thresholds, and the fallback path when accuracy degrades, you are buying a system that has no early warning. Monitoring is not a nice-to-have, it is the only way to detect the failure mode that actually costs you money.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation