The slide said 95 per cent accuracy. The vendor was confident, the demo was smooth, and the founder signed the contract. Three months later, the tool was flagging one in five inbound enquiries as high priority when they were straightforwardly junk, and the team had learned to ignore the flags entirely. The headline number was real. It just wasn’t describing anything that mattered for this business.
Accuracy is one of the most-cited metrics in AI product pitches and one of the most casually misunderstood. Understanding what it actually measures, when it is a reasonable proxy for quality, and when it isn’t, is the difference between a useful vendor conversation and an expensive experiment.
What is accuracy in AI evaluation?
Accuracy measures how often an AI system produces the correct answer across the full set of predictions it makes. The standard formula is correct predictions divided by all predictions: for classification tasks, that is true positives plus true negatives divided by the total count. If a model reviews 60 records and gets 52 right, its accuracy is 87 per cent. That figure describes average performance, but nothing about which 8 it got wrong or why.
The figure is satisfying because it is easy to read. One number, clear scale, higher is better. The problem is that accuracy is a summary, and summaries flatten things that should not be flattened.
The example machine learning courses return to repeatedly is fraud detection. If 98 per cent of transactions are legitimate, a model that predicts “not fraud” for every single transaction achieves 98 per cent accuracy without ever catching a fraudulent payment. By the accuracy measure alone, it looks strong. For the fraud team, it is worthless.
Accuracy is a starting point. The right question is always what you need alongside it.
Why does accuracy matter for your business?
Accuracy matters for your business because it is the most commonly cited metric in vendor pitches, and accepting it at face value leaves you at a disadvantage. The headline figure describes average performance across whatever dataset the vendor used to test the model. It says nothing about performance on your data, your workflows, or the specific mistakes that carry the highest cost in your context.
The data gap is significant. If you run a small professional services firm and you are evaluating a tool that categorises client documents, the vendor’s test data was almost certainly not drawn from your sector, your client base, or your document formats. A figure derived from generic web content or a US legal corpus tells you very little about how the model will perform on the contracts, briefs, and correspondence your team handles every day.
There is also a regulatory dimension. The ICO’s guidance on AI and data protection, updated in May 2024, expects organisations to be able to explain how AI decisions are made and to demonstrate that outputs are accurate and fair. That obligation sits with you, not with the vendor. A 95 per cent headline figure does not discharge it.
Monitoring matters as much as the launch figure. Accuracy can degrade quietly as customer language shifts, document formats change, or conditions evolve. A model that performed well at go-live may be failing six months later, and without a review cadence you will not know.
Where will you actually meet accuracy as a metric?
For small service firms, accuracy comes up mainly at three moments: evaluating an off-the-shelf tool during procurement, reviewing a vendor’s pitch materials, and checking whether a live tool is still performing well enough to keep. Each setting calls for a slightly different question, but the underlying concern is the same: accurate on what, tested on what data, and what happens when it is wrong.
During procurement, ask the vendor to break down accuracy by class, meaning separately for positive and negative predictions. If a lead-scoring tool is 95 per cent accurate overall but only 60 per cent accurate at identifying leads likely to convert, the headline figure is hiding the problem that matters. Ask what the false positive rate is, and what the false negative rate is, for your specific use case.
When reviewing pitch materials without the vendor present, treat any standalone accuracy figure as incomplete. Look for the evaluation dataset description. Was it independent data the model had not seen before? Was it drawn from your sector? The Google Machine Learning Crash Course describes accuracy as a coarse-grained metric that becomes less informative as class imbalance increases, and the class balance in your actual workflows is likely different from whatever the vendor used for testing.
For tools already running in production, build a simple review cadence. Sample a batch of outputs monthly, check them against the ground truth, and track whether accuracy is holding or drifting. Production accuracy matters more than benchmark accuracy.
When should you ask about accuracy, and when should you set it aside?
Accuracy is the right metric when the AI is doing a classification job: accept or reject, flag or pass, match or no match. If the task is binary, the data is reasonably balanced, and both types of mistake carry similar costs, accuracy gives you a clear and useful picture. Ask for it directly in any vendor conversation where the tool is making a yes-or-no judgement on something that affects your business.
Set it aside when the task is generative. If the AI is writing email drafts, summarising documents, or suggesting responses to customer enquiries, there is no single correct answer to measure against. Asking for an accuracy figure in a generative context will either produce a meaningless number or a metric that has been redefined to sound more like accuracy than it is. Factuality, consistency, and structured human review are more relevant controls for generative tools.
The same applies for AI agents: tools that carry out multi-step workflows, search databases, and take actions on your behalf. For agentic systems, what matters is whether the agent chose the right action at each step and completed the task safely. A headline accuracy figure tells you none of that.
Set accuracy aside also when false positive and false negative costs are very different. A spam filter can afford to be aggressive because the cost of letting junk through outweighs the cost of occasionally blocking a real email. A compliance alert in a legal firm may need to bias in the opposite direction. In these cases, ask for precision and recall separately rather than accepting a single accuracy number.
What other metrics sit alongside accuracy?
Three companion metrics explain what accuracy leaves out. Precision measures how often the AI is right when it flags something: 80 per cent precision means two in every ten flags are wrong. Recall measures how many real cases the AI caught: 70 per cent recall means three in ten genuine positives slipped through. The F1 score balances both, useful when false alarms and missed cases matter in equal measure.
These numbers interact in ways specific to your business. A compliance-checking tool at a small law firm faces a high cost for false negatives, missing a real compliance issue, and a lower cost for false positives, flagging something that turns out to be fine. That firm should weight recall heavily and accept lower precision as a trade-off. A marketing qualification tool faces roughly the opposite balance: a false positive costs a sales conversation, while a false negative costs a lead.
The EU AI Act, which entered into force in July 2024, classifies AI systems by risk level and expects organisations using high-risk systems to carry out proper testing and maintain technical documentation. For firms in regulated sectors, the FCA has made clear that governance, testing, and monitoring of AI-supported processes is the firm’s responsibility, not the vendor’s. Accuracy, precision, and recall are the metrics that make that monitoring concrete and auditable.
For generative AI, other measures apply. Factuality, faithfulness to source material, and consistency across repeated queries are more relevant than classification accuracy. Periodic sampling, comparison against human-verified outputs, and structured review cycles are the practical controls. The right metric is always the one that reflects what the tool is actually doing, not the one that makes the vendor’s pitch deck look impressive.



