Three months after signing off on an AI document tool, the operations manager at a 20-person firm flagged a problem. An invoice had been processed automatically, the captured amount was wrong, and the tool had marked the extraction as “high confidence.” When she dug into the settings, she found a field labelled confidence threshold. Nobody had touched it. Nobody had explained what the number meant or what it should be set to.
That gap is common. Confidence scores appear in a wide range of AI tools that owner-managed businesses are buying and deploying now. This post explains what the number actually means, where you will meet it, and what it should change about how you run your deployment.
What is an AI confidence score?
A confidence score is the AI system’s estimate of how likely its output is correct, expressed on a scale of 0 to 1 or as a percentage. At 0.8 or 80 per cent, the model is placing the probability of correctness at roughly that level. That correspondence only holds when the vendor has calibrated the scores against real-world data similar to yours, and many have not.
Microsoft Azure’s Q&A service returns a confidence score between 0 and 100 for each answer it generates from a knowledge base, with higher scores indicating a better match to the query. ServiceNow’s Document Intelligence displays results as colour-coded bands: green for scores between 76 and 100, yellow for 50 to 75, and red for anything below. Box AI Extract returns per-field probability scores whenever it processes a document, using them to flag which extractions warrant a second look.
The important caveat is calibration. Interactions, a conversational AI provider, explains that a well-calibrated 80 per cent confidence should correspond to the system being right about 80 per cent of the time in that score band. If the vendor has not validated their model against data similar to yours, the score may be a less reliable guide than it appears.
Why should it change what you choose to automate?
The practical power of a confidence score is the threshold decision it enables. You can configure many AI tools to auto-accept outputs above a defined score and route everything below it to a human reviewer. That single setting determines how many documents, transactions, or enquiries move through your business without any human check, and what the error rate in that unreviewed stream will be.
LlamaIndex describes this threshold as a “quality gate” between automated and manual processing. Infrrd, a document AI platform, recommends accepting LLM outputs above a pre-defined level of around 80 per cent and routing everything below that to manual review. For a firm processing 500 invoices a month, the threshold setting has direct business implications: it determines how many invoices a person looks at and how many pass straight to payment.
Mindee recommends testing different threshold levels on your actual data before settling on one, because the cost of a wrong extraction on your documents may differ from what the vendor assumed when they set their defaults. Errors in customer-facing processing carry different consequences than errors in internal records.
The Financial Conduct Authority’s 2023 AI discussion paper makes a parallel point for regulated firms: businesses must understand how AI errors affect consumers and ensure appropriate oversight is in place. Even for firms outside financial services, that principle is sound. What threshold you have set, and why, is a governance question you may one day need to answer.
Where will you actually meet confidence scores?
Confidence scores appear most commonly in document processing tools, conversational AI platforms, and customer support systems. In each category, the score plays a similar role: it tells the system whether to handle the output automatically or escalate it to a person. Knowing this is in your toolkit and understanding what it controls is what separates a managed deployment from one where the defaults are quietly doing the work.
In document processing, ServiceNow’s Document Intelligence assigns a numeric confidence to every field it extracts from a form or invoice. Box AI Extract returns per-field scores. Both use these values to guide which outputs move straight into a workflow and which get flagged for review.
Conversational AI tools used in customer support or query triage also use confidence scores to decide when to hand off to a human agent. Interactions explains that their system derives scores from posterior probabilities in the model and uses them to route conversations accordingly.
One challenge for buyers is that some tools score internally but surface only a traffic-light indicator or a pass/fail label, without exposing the underlying number. Where that is the case, you cannot set your own threshold or examine the distribution of scores, so you are relying entirely on the vendor’s defaults. If you are evaluating any AI tool for production use, asking “do you expose confidence scores, and on what scale?” is a useful early question.
When is the score a reliable signal, and when is it not?
Confidence scores are most reliable for tasks with a clear correct answer: extracting a specific field from a document, classifying an enquiry, or answering a question from a defined knowledge base. They are less useful for open-ended generation tasks, where there is no single right answer against which the model’s probability estimate can be verified, so the number is more speculative than it appears.
Mindee notes that confidence scoring requires calibration to specific business goals and works best when predictions can be checked against labelled data. Using AI to write a client summary or generate a proposal draft involves no clear ground truth, so a high confidence score in those settings may reflect the model’s internal calculation while carrying little reliable information about whether the output is genuinely good.
Domain transfer is a second risk. A confidence model built and validated on one category of data may not be well-calibrated on yours. A tool trained primarily on US invoice formats may give misleading confidence signals when applied to UK documents, different date conventions, or sector-specific terminology. If the vendor’s validation set differs significantly from your real data, recalibrate before trusting the defaults.
The ICO’s AI and data protection risk toolkit emphasises that organisations should monitor AI system performance over time and have mechanisms to detect and address errors. A confidence score you cannot see or configure cannot support that monitoring.
What governance sits alongside confidence scores?
Confidence thresholds are one part of a broader set of controls you need when deploying AI for anything that affects customers or processes personal data. The ICO’s guidance, the FCA’s oversight expectations, and UK GDPR Article 22 all point in the same direction: meaningful human review, documented policies, and the ability to explain how the AI arrives at its output and how errors are caught.
Under UK GDPR Article 22, using AI to auto-approve or reject decisions that significantly affect individuals may constitute automated decision-making. Where that applies, you need transparency in how the decision is made and, in some cases, a mechanism for human review. If your AI is scoring customer enquiries or approving applications, check whether Article 22 applies before setting any threshold that removes human review entirely.
Confidence-based sampling, where low-confidence outputs are always reviewed by a person, generates the performance data you need for ICO-style monitoring and gives you material to show an auditor if questions arise.
A practical starting point is a short written policy: all low-confidence outputs must be checked before action is taken, outputs that look obviously wrong must be overridden and flagged regardless of confidence level, and patterns of systematic error must be reported to whoever owns the deployment. Paired with an instrumented pilot where you record the AI’s score, the actual accuracy, and the cost of corrections, that policy gives you a calibration picture within weeks.
If you would like help building an evaluation framework that includes these questions, book a conversation.



