AI escalation thresholds: design on confidence, not hope

The operations lead I spoke with last month had a problem she could not quite name. Her team was using an AI tool to draft client responses, summarise meeting notes, and flag invoice anomalies. Sometimes they acted on suggestions that turned out to be wrong. Sometimes they ignored suggestions that turned out to be right. When I asked her how the team decided which AI outputs to check and which to trust, she said, “We check the ones that look off.” That is the rule almost every owner-managed business is running today, and it is the wrong rule.

The reason it is the wrong rule is simple. “Looks off” only catches the AI outputs that happen to catch someone’s eye. The outputs that matter are the ones that look right and are wrong. A confidently worded response that misquotes a policy. A summary that gets the figures slightly off. An invoice flag that should have been raised and was not. Those are the failures that cost money, lose clients, or attract a regulator’s attention, and they are the failures a “looks off” rule reliably misses.

There is a better rule, and it is now possible to design rather than guess. Recent research on AI confidence calibration, plus what is already known from human factors work in regulated industries, gives owner-managers a practical way to set escalation thresholds that match the risk of each AI output to the level of human checking applied to it. The rest of this piece walks through what an escalation threshold actually is, why your current implicit rule fails, where you will meet thresholds in practice, when to insist on them, and what to read next.

What is an AI escalation threshold?

An AI escalation threshold is the explicit rule that decides when an AI output goes to a human for review before it acts on the business. It draws on three inputs together, the model’s calibrated confidence score, behavioural signal from your team (latency, edit rates, overrides), and the business consequence of the output being wrong. The threshold is the line where those three combined trip a check.

The key point is that confidence and correctness are not the same thing. A 2024 method from MIT called Thermometer addressed this directly, after researchers found that the confidence scores produced by large language models out of the box were often badly calibrated against actual correctness. A model might say it was 95% confident in an answer that was wrong four times out of ten. The Thermometer technique recalibrates that score so the number is meaningful. You do not need to implement Thermometer yourself, but you do need to know whether your vendor has done anything equivalent.

Why does this matter for your business?

It matters because the dominant failure mode in human-AI teams is overconfidence, not under-use. Research published in Harvard Business School Working Knowledge in 2024 found that when an AI flagged its own outputs as in-range or out-of-range, user error rates dropped by roughly half. Same outputs, same team. The only change was the cue telling the human where to look.

That finding has a direct read-across for SMEs. If your team is acting on AI outputs without any consequence-weighted check, you are sitting on a known doubling of preventable error. The UK Information Commissioner’s Office made the regulatory point in its 2026 guidance on AI and data protection. When AI outputs influence decisions about individuals, the business is expected to document the calibration and oversight processes it applies, not merely the tool it uses. That documentation is the escalation policy, and it is no longer optional in sectors where personal data is in play.

Where will you actually meet escalation thresholds?

You meet escalation thresholds anywhere an AI output is about to be acted on. In a professional services firm that is client communications, draft advice, billing entries, and meeting summaries. In an operational firm it is scheduling, inventory decisions, supplier selection, and document classification. Anywhere an AI agent acts automatically, the threshold is the rule deciding whether a human sees the output first.

The simplest place to meet your first threshold is in a tool you already use. If your AI assistant exposes a confidence indicator, look for it. If it does not, ask your vendor whether one is available and whether it has been calibrated. If neither is on offer, you can still build escalation from behavioural signal alone. Track how often your team edits the AI’s first draft before sending, how often they override its recommendation, and how long they spend reviewing. Those patterns are your threshold inputs even when the model itself is silent.

When to insist on thresholds and when to leave well alone

Sort your use cases into three consequence tiers and the question answers itself. Tier one is irreversible, regulated, or client-facing output (tax filings, contract clauses, regulatory submissions). Always check, no exceptions. Tier two is operational and reversible (scheduling, draft summaries, supplier shortlists). Apply sampled checks, with the rate set higher when staff edit rates climb. Tier three is routine and low-impact. Trust by default and review monthly.

The tier framework also tells you when to leave well alone. If you find yourself writing escalation rules for tier three outputs, you have over-engineered the policy and your team will route around it. The point of an escalation threshold is to make the riskiest outputs the ones that get checked. If a low-impact use case has the same checking burden as a regulated one, the regulated one will end up under-checked when the team runs out of time. Match the rigour to the consequence and the policy will hold.

Escalation thresholds sit alongside a small family of ideas about reading AI output well. The two-person review threshold extends the logic when single-check evaluation is not enough. The veto check covers outputs where no model confidence justifies action. The owner’s two-question evaluation method is the faster informal version. The wider problem of plausible nonsense in AI numbers is why confidence calibration matters at all.

If you want to take the work out of the threshold-setting itself, the practical move is to write one paragraph per use case naming the consequence tier, the threshold rule, the cue your team will use, and the person who is the human in the loop. One page covers the typical owner-managed firm. If that exercise feels like a stretch, that is the signal to book a conversation. The cost of getting escalation wrong is higher than the cost of designing it properly, and the design itself is a one-afternoon job, not a multi-month programme.

Designing AI escalation thresholds on confidence, not hope

Key takeaways

What is an AI escalation threshold?

Why does this matter for your business?

Where will you actually meet escalation thresholds?

When to insist on thresholds and when to leave well alone

Sources

Frequently asked questions

Is the confidence score the AI tool shows me actually trustworthy?

How do I set the threshold number itself?

What if our AI tool does not expose a confidence score at all?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Designing AI escalation thresholds on confidence, not hope

Key takeaways

What is an AI escalation threshold?

Why does this matter for your business?

Where will you actually meet escalation thresholds?

When to insist on thresholds and when to leave well alone

Related concepts to read next

Sources

Frequently asked questions

Is the confidence score the AI tool shows me actually trustworthy?

How do I set the threshold number itself?

What if our AI tool does not expose a confidence score at all?

Ready to talk it through?

Related reading

AI theatre or real progress: how a founder tells the difference

How safe is AI for business use, and where do the risks sit?

How accurate is AI translation for business documents?

If any of this sounds familiar, let's talk.