Designing AI escalation thresholds on confidence, not hope

A woman at an office desk reviewing a printed document beside an open laptop, pen in hand
TL;DR

An AI escalation threshold is the explicit rule that says when an AI output goes to a human before it acts on the business. The reliable design uses three inputs together: the model's calibrated confidence score, behavioural signals from your team, and the business consequence of getting that decision wrong. Built that way, the riskiest AI outputs are the ones most likely to be checked.

Key takeaways

- Model self-confidence and actual correctness are different things, and recent calibration research shows the gap is wide enough to halve user error when alerts are added. - Three signals belong in any escalation rule: model-reported confidence, team behaviour (latency, edits, overrides), and the business consequence of being wrong. - Sort use cases into three consequence tiers (always-check, sampled-check, trust by default) before you write any policy. - You do not need to implement calibration techniques yourself, but you do need to ask vendors whether their confidence scores have been calibrated and against what. - An escalation policy that fits on one page per use case is better than a thirty-page framework that nobody reads.

The operations lead I spoke with last month had a problem she could not quite name. Her team was using an AI tool to draft client responses, summarise meeting notes, and flag invoice anomalies. Sometimes they acted on suggestions that turned out to be wrong. Sometimes they ignored suggestions that turned out to be right. When I asked her how the team decided which AI outputs to check and which to trust, she said, “We check the ones that look off.” That is the rule almost every owner-managed business is running today, and it is the wrong rule.

The reason it is the wrong rule is simple. “Looks off” only catches the AI outputs that happen to catch someone’s eye. The outputs you most need to catch are the ones that look right and are wrong. A confidently worded response that misquotes a policy. A summary that gets the figures slightly off. An invoice flag that should have been raised and was not. Those are the failures that cost money, lose clients, or attract a regulator’s attention, and they are the failures a “looks off” rule reliably misses.

There is a better rule, and it is now possible to design rather than guess. Recent research on AI confidence calibration, plus what is already known from human factors work in regulated industries, gives owner-managers a practical way to set escalation thresholds that match the risk of each AI output to the level of human checking applied to it. The rest of this piece walks through what an escalation threshold actually is, why your current implicit rule fails, where you will meet thresholds in practice, when to insist on them, and what to read next.

What is an AI escalation threshold?

An AI escalation threshold is the explicit rule that decides when an AI output goes to a human for review before it acts on the business. It uses three inputs together: the model’s calibrated confidence score, behavioural signal from your team (latency, edit rates, overrides), and the business consequence of the output being wrong. The threshold is the line where those three combined trip a check.

The crucial point is that confidence and correctness are not the same thing. A 2024 method from MIT called Thermometer addressed this directly, after researchers found that the confidence scores produced by large language models out of the box were often badly calibrated against actual correctness. A model might say it was 95% confident in an answer that was wrong four times out of ten. The Thermometer technique recalibrates that score so the number is meaningful. You do not need to implement Thermometer yourself, but you do need to know whether your vendor has done anything equivalent.

Why does this matter for your business?

It matters because the dominant failure mode in human-AI teams is overconfidence, not under-use. Research published in Harvard Business School Working Knowledge in 2024 found that when an AI flagged its own outputs as in-range or out-of-range, user error rates dropped by roughly half. Same outputs, same team. The only change was the cue telling the human where to look.

That finding has a direct read-across for SMEs. If your team is acting on AI outputs without any consequence-weighted check, you are sitting on a known doubling of preventable error. The UK Information Commissioner’s Office made the regulatory point in its 2026 guidance on AI and data protection: when AI outputs influence decisions about individuals, the business is expected to document the calibration and oversight processes it applies, not merely the tool it uses. That documentation is the escalation policy, and it is no longer optional in sectors where personal data is in play.

Where will you actually meet escalation thresholds?

You meet escalation thresholds anywhere an AI output is about to be acted on. In a professional services firm that is client communications, draft advice, billing entries, and meeting summaries. In an operational firm it is scheduling, inventory decisions, supplier selection, and document classification. Anywhere an AI agent acts automatically, the threshold is the rule deciding whether a human sees the output first.

The simplest place to meet your first threshold is in a tool you already use. If your AI assistant exposes a confidence indicator, look for it. If it does not, ask your vendor whether one is available and whether it has been calibrated. If neither is on offer, you can still build escalation from behavioural signal alone. Track how often your team edits the AI’s first draft before sending, how often they override its recommendation, and how long they spend reviewing. Those patterns are your threshold inputs even when the model itself is silent.

When to insist on thresholds and when to leave well alone

Sort your use cases into three consequence tiers and the question answers itself. Tier one is irreversible, regulated, or client-facing output (tax filings, contract clauses, regulatory submissions): always-check, no exceptions. Tier two is operational and reversible (scheduling, draft summaries, supplier shortlists): sampled checks, with the rate set higher when staff edit rates climb. Tier three is routine and low-impact: trust by default and review monthly.

The tier framework also tells you when to leave well alone. If you find yourself writing escalation rules for tier three outputs, you have over-engineered the policy and your team will quietly route around it. The point of an escalation threshold is to make the riskiest outputs the most checked. If a low-impact use case has the same checking burden as a regulated one, the regulated one will end up under-checked when the team runs out of time. Match the rigour to the consequence and the policy will hold.

Escalation thresholds sit alongside a small family of ideas about reading AI output well. The two-person review threshold extends the logic when single-check evaluation is not enough. The veto check covers outputs where no model confidence justifies action. The owner’s two-question evaluation method is the faster informal version. The wider problem of plausible nonsense in AI numbers is why confidence calibration matters at all.

If you want to take the work out of the threshold-setting itself, the practical move is to write one paragraph per use case naming the consequence tier, the threshold rule, the cue your team will use, and the person who is the human in the loop. One page covers the typical owner-managed firm. If that exercise feels like a stretch, that is the signal to book a conversation. The cost of getting escalation wrong is higher than the cost of designing it properly, and the design itself is a one-afternoon job, not a multi-month programme.

Sources

- MIT News (2024). New method for AI calibration helps users know when to trust a model's predictions. Coverage of the Thermometer technique for LLM confidence calibration, the source for the "confidence score does not equal correctness" claim. https://news.mit.edu/2024/thermometer-prevents-ai-hallucination-0729 - Harvard Business School Working Knowledge (2024). How decision makers can catch generative AI's bad advice. Source for the in-range and out-of-range alerts halving user error rate in human-AI decisions. https://www.library.hbs.edu/working-knowledge/how-decision-makers-can-catch-generative-ais-bad-advice - UK Information Commissioner's Office (2026). Guidance on AI and data protection. Source for the requirement to document calibration and decision processes when AI outputs affect individuals. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ - Financial Conduct Authority (2024). AI Live Testing programme guidance. UK regulator-supervised testing environment that gives a working example of staged escalation thresholds in regulated firms. https://www.fca.org.uk/firms/innovation/ai-live-testing - UK Government (2025). AI Opportunities Action Plan. Sets the UK policy context for sector-specific AI assurance and oversight expectations on SMEs. https://www.gov.uk/government/publications/ai-opportunities-action-plan - EU AI Act, Article 14 (2024). Human oversight requirements for high-risk AI systems. Source for the always-check tier on regulated or rights-affecting outputs. https://artificialintelligenceact.eu/article/14/ - National Cyber Security Centre (2025). Guidelines for secure AI system development. UK NCSC guidance on monitoring, audit trails, and human oversight in AI deployments. https://www.ncsc.gov.uk/collection/guidelines-secure-ai-system-development - NIST AI Risk Management Framework (2023). Generative AI Profile, AI 600-1. Cited for the consequence-tiering approach and the recommendation to document confidence and oversight criteria. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf - Institute of Chartered Accountants in England and Wales (2026). AI Assurance Framework guidance. Source for the practical pattern of tiered verification protocols in professional services SMEs. https://www.icaew.com/technical/audit-and-assurance/ai-assurance-framework

Frequently asked questions

Is the confidence score the AI tool shows me actually trustworthy?

Often not, by default. Research from MIT in 2024 on a method called Thermometer found that out-of-the-box confidence scores from large language models are frequently miscalibrated, which means a 90% confidence reading does not reliably mean a 90% chance of being right. Calibration techniques can fix this, but only if the vendor has applied them. Ask before you trust the number.

How do I set the threshold number itself?

Start with the consequence, not the number. For irreversible or regulated outputs, the threshold is effectively 100%, every output goes to a human. For reversible operational outputs, set the threshold where the cost of a missed check is roughly equal to the cost of an unnecessary one. For routine low-impact outputs, trust by default and sample.

What if our AI tool does not expose a confidence score at all?

You can still build escalation. Use behavioural signals instead, response latency, how often staff edit the output before sending, and override rates. If a team member is consistently rewriting the AI's first draft on a particular task, that is your signal. Add a consequence tier and you have a working policy without needing the model's own confidence score at all.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation