What accuracy means when evaluating AI outputs and tools

Person reviewing data on a laptop at a tidy desk in natural light
TL;DR

Accuracy tells you how often an AI gets the answer right, but it is a summary metric that can look strong even when the system is failing at the cases that carry the highest cost for your firm. For owner-operated service businesses evaluating AI tools, accuracy is most meaningful for classification tasks with balanced data. For generative AI and agentic systems, precision, recall, F1, and human review carry more weight than any headline percentage.

Key takeaways

- Accuracy measures how often an AI gets the right answer, but as a single summary figure it hides how errors are distributed across different prediction types. - A high accuracy score can mask poor performance on the cases that matter most to your firm, particularly when one class of outcome is far rarer than the other in your data. - Ask vendors for precision, recall, and F1 alongside accuracy, and require a description of the evaluation dataset, not just the headline number. - For generative AI tools such as drafting assistants and summarisers, accuracy is the wrong frame; factuality, consistency, and structured human review are more relevant controls. - UK regulators including the ICO and FCA expect firms to govern AI outputs, which means tracking performance metrics, documenting tests, and keeping a human in the loop for high-stakes decisions.

The slide said 95 per cent accuracy. The vendor was confident, the demo was smooth, and the founder signed the contract. Three months later, the tool was flagging one in five inbound enquiries as high priority when they were straightforwardly junk, and the team had learned to ignore the flags entirely. The headline number was real. It just wasn’t describing anything that mattered for this business.

Accuracy is one of the most-cited metrics in AI product pitches and one of the most casually misunderstood. Understanding what it actually measures, when it is a reasonable proxy for quality, and when it isn’t, is the difference between a useful vendor conversation and an expensive experiment.

What is accuracy in AI evaluation?

Accuracy measures how often an AI system produces the correct answer across the full set of predictions it makes. The standard formula is correct predictions divided by all predictions: for classification tasks, that is true positives plus true negatives divided by the total count. If a model reviews 60 records and gets 52 right, its accuracy is 87 per cent. That figure describes average performance, but nothing about which 8 it got wrong or why.

The figure is satisfying because it is easy to read. One number, clear scale, higher is better. The problem is that accuracy is a summary, and summaries flatten things that should not be flattened.

The example machine learning courses return to repeatedly is fraud detection. If 98 per cent of transactions are legitimate, a model that predicts “not fraud” for every single transaction achieves 98 per cent accuracy without ever catching a fraudulent payment. By the accuracy measure alone, it looks strong. For the fraud team, it is worthless.

Accuracy is a starting point. The right question is always what you need alongside it.

Why does accuracy matter for your business?

Accuracy matters for your business because it is the most commonly cited metric in vendor pitches, and accepting it at face value leaves you at a disadvantage. The headline figure describes average performance across whatever dataset the vendor used to test the model. It says nothing about performance on your data, your workflows, or the specific mistakes that carry the highest cost in your context.

The data gap is significant. If you run a small professional services firm and you are evaluating a tool that categorises client documents, the vendor’s test data was almost certainly not drawn from your sector, your client base, or your document formats. A figure derived from generic web content or a US legal corpus tells you very little about how the model will perform on the contracts, briefs, and correspondence your team handles every day.

There is also a regulatory dimension. The ICO’s guidance on AI and data protection, updated in May 2024, expects organisations to be able to explain how AI decisions are made and to demonstrate that outputs are accurate and fair. That obligation sits with you, not with the vendor. A 95 per cent headline figure does not discharge it.

Monitoring matters as much as the launch figure. Accuracy can degrade quietly as customer language shifts, document formats change, or conditions evolve. A model that performed well at go-live may be failing six months later, and without a review cadence you will not know.

Where will you actually meet accuracy as a metric?

For small service firms, accuracy comes up mainly at three moments: evaluating an off-the-shelf tool during procurement, reviewing a vendor’s pitch materials, and checking whether a live tool is still performing well enough to keep. Each setting calls for a slightly different question, but the underlying concern is the same: accurate on what, tested on what data, and what happens when it is wrong.

During procurement, ask the vendor to break down accuracy by class, meaning separately for positive and negative predictions. If a lead-scoring tool is 95 per cent accurate overall but only 60 per cent accurate at identifying leads likely to convert, the headline figure is hiding the problem that matters. Ask what the false positive rate is, and what the false negative rate is, for your specific use case.

When reviewing pitch materials without the vendor present, treat any standalone accuracy figure as incomplete. Look for the evaluation dataset description. Was it independent data the model had not seen before? Was it drawn from your sector? The Google Machine Learning Crash Course describes accuracy as a coarse-grained metric that becomes less informative as class imbalance increases, and the class balance in your actual workflows is likely different from whatever the vendor used for testing.

For tools already running in production, build a simple review cadence. Sample a batch of outputs monthly, check them against the ground truth, and track whether accuracy is holding or drifting. Production accuracy matters more than benchmark accuracy.

When should you ask about accuracy, and when should you set it aside?

Accuracy is the right metric when the AI is doing a classification job: accept or reject, flag or pass, match or no match. If the task is binary, the data is reasonably balanced, and both types of mistake carry similar costs, accuracy gives you a clear and useful picture. Ask for it directly in any vendor conversation where the tool is making a yes-or-no judgement on something that affects your business.

Set it aside when the task is generative. If the AI is writing email drafts, summarising documents, or suggesting responses to customer enquiries, there is no single correct answer to measure against. Asking for an accuracy figure in a generative context will either produce a meaningless number or a metric that has been redefined to sound more like accuracy than it is. Factuality, consistency, and structured human review are more relevant controls for generative tools.

The same applies for AI agents: tools that carry out multi-step workflows, search databases, and take actions on your behalf. For agentic systems, what matters is whether the agent chose the right action at each step and completed the task safely. A headline accuracy figure tells you none of that.

Set accuracy aside also when false positive and false negative costs are very different. A spam filter can afford to be aggressive because the cost of letting junk through outweighs the cost of occasionally blocking a real email. A compliance alert in a legal firm may need to bias in the opposite direction. In these cases, ask for precision and recall separately rather than accepting a single accuracy number.

What other metrics sit alongside accuracy?

Three companion metrics explain what accuracy leaves out. Precision measures how often the AI is right when it flags something: 80 per cent precision means two in every ten flags are wrong. Recall measures how many real cases the AI caught: 70 per cent recall means three in ten genuine positives slipped through. The F1 score balances both, useful when false alarms and missed cases matter in equal measure.

These numbers interact in ways specific to your business. A compliance-checking tool at a small law firm faces a high cost for false negatives, missing a real compliance issue, and a lower cost for false positives, flagging something that turns out to be fine. That firm should weight recall heavily and accept lower precision as a trade-off. A marketing qualification tool faces roughly the opposite balance: a false positive costs a sales conversation, while a false negative costs a lead.

The EU AI Act, which entered into force in July 2024, classifies AI systems by risk level and expects organisations using high-risk systems to carry out proper testing and maintain technical documentation. For firms in regulated sectors, the FCA has made clear that governance, testing, and monitoring of AI-supported processes is the firm’s responsibility, not the vendor’s. Accuracy, precision, and recall are the metrics that make that monitoring concrete and auditable.

For generative AI, other measures apply. Factuality, faithfulness to source material, and consistency across repeated queries are more relevant than classification accuracy. Periodic sampling, comparison against human-verified outputs, and structured review cycles are the practical controls. The right metric is always the one that reflects what the tool is actually doing, not the one that makes the vendor’s pitch deck look impressive.

Sources

- ICO (2024). AI and data protection. UK data protection regulator's guidance for organisations deploying AI under UK GDPR, covering transparency, fairness, accuracy, and governance requirements. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - ICO (2024). Guidance on AI and data protection. Detailed ICO guidance published May 2024 on legal obligations when using AI, including the accuracy principle and explainability requirements under UK GDPR. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/guidance-on-ai-and-data-protection/ - FCA (2024). AI for firms. Financial Conduct Authority guidance for regulated firms on using AI responsibly, covering outsourcing governance, consumer harm, and operational resilience obligations. https://www.fca.org.uk/firms/technology-data-innovation/ai - NCSC (2023). Guidelines for secure AI system development. National Cyber Security Centre guidance published October 2023 on threats including data poisoning, prompt injection, and model manipulation that affect real-world AI accuracy. https://www.ncsc.gov.uk/guidance/secure-ai-systems - European Parliament and Council of the EU (2024). Regulation (EU) 2024/1689, the EU AI Act. Risk-based framework for AI systems entered into force July 2024, requiring technical documentation and testing for high-risk AI. https://eur-lex.europa.eu/eli/reg/2024/1689/oj - Google (2024). Machine Learning Crash Course: Classification accuracy, precision, recall. Google's reference course describing accuracy as a coarse-grained metric that becomes less informative with class imbalance, with the standard TP/TN/FP/FN formula and worked examples. https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall - Evidently AI (2024). Accuracy, precision, recall in classification. Practitioner guide including a worked example of 52 out of 60 correct predictions yielding 87 per cent accuracy, with guidance on when precision and recall add value. https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall - Galileo AI (2024). Accuracy metrics for AI evaluation. Analysis of when accuracy is the right evaluation metric and when factuality, faithfulness, and human review better describe generative AI performance. https://www.galileo.ai/blog/accuracy-metrics-ai-evaluation - Testing Experts (2024). AI agent evaluation. Research on evaluation dimensions for agentic AI systems, covering task completion, reliability, and safety as more meaningful measures than point-in-time classification accuracy. https://www.testingxperts.com/blog/ai-agent-evaluation/

Frequently asked questions

What does an AI accuracy score actually mean?

An accuracy score is the percentage of predictions an AI system gets right across all the predictions it makes. If a model reviews 100 records and correctly classifies 87, its accuracy is 87 per cent. The figure is easy to read but limited: it averages across all prediction types and says nothing about whether the errors are distributed in a way that matters for your specific use case.

Why is accuracy not always enough when evaluating an AI tool?

Accuracy alone fails when the classes in your data are imbalanced, meaning one outcome is much rarer than the other. In a fraud detection scenario where 98 per cent of transactions are legitimate, a model that always predicts "not fraud" achieves 98 per cent accuracy without ever catching a fraud case. Precision, recall, and F1 reveal this problem where a single accuracy figure hides it.

What should I ask an AI vendor about accuracy before signing a contract?

Ask for the evaluation dataset description, not just the headline number. Find out whether the test data was independent, drawn from your sector, and whether the class balance matches your actual workflows. Ask for precision and recall separately, and define the cost of false positives and false negatives in your context. Also ask how performance will be monitored after go-live, and what the vendor's obligations are if accuracy drops.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation