The two-person review threshold, when single-check AI evaluation is not enough

A founder and his communications lead standing at a high desk reading a printed press release between them
TL;DR

Single-check evaluation of AI output works most of the time and fails predictably in four situations, external attribution, financial commitment, regulatory representation, and decisions about named people. Two-person review is the right threshold for those, with one named alternate per category, a short pre-set trigger list, and a brief written exception protocol when the alternate is unavailable. The list is short, the discipline is simple, and the value is asymmetric.

Key takeaways

- Single-check works for routine AI output and breaks predictably when output is externally attributed, commits the firm financially, represents a regulatory position, or decides something about a named individual. - The four trigger situations should be a written list, not a judgement call, so a tired team member can recognise one without permission or interpretation. - The second reviewer is checking a specific risk the first reviewer is not, authenticity of voice, completeness of cost base, defensibility of a regulatory claim, or freedom from algorithmic bias against a protected characteristic. - A University of Washington study of state-of-the-art language models found they favoured white-associated names eighty-five per cent of the time when ranking otherwise-equivalent CVs and never preferred Black-male-associated names over white-male-associated names, which is the kind of pattern a single hiring manager will rarely catch alone. - The discipline lasts longer when the team understands the failure mode it answers, the model's tendency to produce plausible-sounding falsehoods that pass single review without leaving a seam, rather than receiving the rule as headquarters policy.

The owner I am thinking of found out at half past four on a Tuesday afternoon. Her marketing lead had sent a press release to a small list of trade journalists an hour earlier, with what she now realised was a quote attributed to her, the founder, that she had never said and would not say. The quote was reasonable. It was on-message. It was in roughly her register. The marketing lead had asked an AI tool to draft a release, the tool had produced a usable draft with a founder quote built in, and the marketing lead had single-checked the document and sent it. Nobody else had read it before it went out.

The release was correctable. Two journalists had not yet opened the email. The third had, and the founder spent the next forty minutes on the phone explaining. The damage was small but real. What stayed with her afterwards was not the quote itself. It was the gap between her assumption, that obvious external-facing content with her name on it would be read by more than one person before it went out, and the marketing lead’s assumption, that the review the lead had done was the review the document needed. Neither of them had agreed in advance which situations needed a second pair of eyes and which did not. The default was single check, and the default had broken predictably.

What is the two-person review threshold for AI output?

It is the named line at which single-check evaluation of AI output stops being adequate and a second reviewer is required. Single review works for routine output where the reviewer has domain knowledge and errors are reversible. Two-person review applies to a short, pre-agreed list of situations where the cost of an undetected error is asymmetric.

The four situations on that list, in practice, are external attribution (quotes, named statements, public claims), financial commitment (cost estimates, pricing decisions, contract terms above a threshold), regulatory representation (compliance disclosures, audit responses, statements to oversight bodies), and decisions about named individuals (hiring, performance, discipline, access to a service). Everything else stays on single review.

Why does single-check evaluation predictably fail in those four situations?

Because the failure mode of large language models is fluent plausibility, and these four situations are where a fluent error is least likely to be caught by the reviewer closest to the work. A BBC and European Broadcasting Union study published in 2025 found that around forty-five per cent of AI queries to leading chatbots produced factually incorrect answers, expressed in the same authoritative tone as the correct ones.

The first reviewer is usually the person closest to the work, a marketing lead, a finance analyst, a hiring manager. They are reading for the things their role typically checks, message fit, internal logic, surface accuracy. They are not usually reading for the things the model is likely to get wrong in that domain, attribution of a quote to a person, omission of a cost category specific to the firm, mischaracterisation of a regulatory exemption, algorithmic bias against a protected characteristic. The blind spot is structural, not a question of competence. A University of Washington study of state-of-the-art language models ranking otherwise-equivalent CVs found the models favoured white-associated names eighty-five per cent of the time and never preferred Black-male-associated names over white-male-associated names. A single hiring manager reading the ranking will rarely catch that pattern alone.

Where will you actually meet the four trigger situations?

In the regular rhythm of work for any owner-operated services firm. External attribution comes up every time AI helps draft a release, a website page, a partner statement, or a post under the founder’s name. Financial commitment comes up whenever AI estimates, prices, or models a number the firm will commit to. Regulatory representation comes up in compliance responses. Decisions about people come up in hiring and performance reviews.

The exposure is highest where the team is busiest and AI is saving the largest share of hours, which is the same reason the second-check discipline matters there. The Information Commissioner’s Office is explicit in its UK GDPR guidance on AI that the deployer remains accountable for the output, and Article 22 of UK GDPR (mirroring the EU AI Act’s Article 22 and the human-oversight requirements of Article 14) preserves the right to a human review of significant automated decisions. The regulatory frame is already there. The internal discipline either matches it or leaves a gap.

When should the alternate reviewer step in and what are they specifically checking?

When the work falls into a trigger situation, the named alternate reads the document with a specific check in mind, not a general quality pass. For external attribution, the alternate checks authenticity of voice and position, has this person actually said this, would they say it, is there documented basis for it. The right alternate is the person being quoted, or someone authorised to represent that relationship.

For financial commitment, the alternate checks completeness against the firm’s actual cost base, does this include the categories of expense we typically incur, do the assumptions match our documented history. The alternate is someone with budget accountability in that category. For regulatory representation, the alternate is a compliance specialist or external advisor who reads for whether the statement accurately characterises the regulation and whether it creates latent exposure if a regulator audits it. ICAEW and ACCA both treat AI-supported output going to a regulator or an audit as requiring qualified human review, not as something the model’s confidence stands in for. For decisions about people, the alternate is someone outside the direct reporting line, trained in employment law or bias-audit principles, checking whether the recommendation can be defended as job-relevant and free of pattern bias against protected characteristics. Naming the alternate per category, in advance, stops the role defaulting to whoever happens to be available, which is what produces rubber-stamp approval.

How do you operationalise this without creating bureaucracy?

Three elements, all of them light. A one-page written trigger list of the four situations, in language the team recognises without interpreting. A named primary and backup reviewer per category, visible in a shared document. A brief written exception protocol for when both are unavailable, the person documents why it cannot wait, names a temporary substitute, and commits to a retrospective review the next business day.

The discipline lasts longer when the team understands the failure mode it answers, not as headquarters policy. A short conversation works better than a memo, AI tools produce fluent output that occasionally invents quotes, omits cost categories, mischaracterises regulations, or ranks candidates against patterns nobody authorised, and here are the four named situations where a second pair of eyes catches what the first pair will predictably miss. Briefed that way, the discipline is a tool the team is using to protect itself and the firm, not a constraint imposed from above. Research on shared mental models in teams consistently finds that adherence improves and errors decrease when the team understands the why behind a process, not only the what.

If you would like to map your firm’s four trigger situations and your named reviewers in a single sitting, book a conversation.

Sources

- European Parliament (2024). EU AI Act, Article 14 on human oversight and Article 22 on the right to a human review of solely automated decisions producing legal or similarly significant effects, the regulatory foundation for two-person review in high-risk settings. https://artificialintelligenceact.eu/article/14/ - Information Commissioner's Office (2024). Guidance on AI and data protection, on the UK deployer's accountability for AI output and the Article 22 requirement to provide meaningful human review of significant automated decisions. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - National Institute of Standards and Technology (2023). AI Risk Management Framework, NIST AI 100-1, on the principle of layered human oversight calibrated to the impact of the decision. https://www.nist.gov/itl/ai-risk-management-framework - University of Washington (2024). AI tools show biases in ranking job applicants' names, the study of state-of-the-art language models finding eighty-five per cent preference for white-associated names and zero per cent preference for Black-male-associated names over white-male-associated names. https://www.washington.edu/news/2024/10/31/ai-bias-resume-screening-race-gender/ - BBC and European Broadcasting Union (2025). News integrity in AI assistants, the study finding approximately forty-five per cent of AI queries to leading chatbots produced factually incorrect answers. https://www.bbc.co.uk/aboutthebbc/documents/bbc-research-into-ai-assistants.pdf - Harvard Kennedy School Misinformation Review (2024). New sources of inaccuracy, a conceptual framework for studying AI hallucinations, on the distinction between AI hallucination and traditional misinformation and the reviewer's blind spot under fluent output. https://misinforeview.hks.harvard.edu/article/new-sources-of-inaccuracy-a-conceptual-framework-for-studying-ai-hallucinations/ - ICAEW (2024). The role of AI in audit and assurance, on the institute's position that AI-generated output supporting regulatory or audit claims requires human review by a qualified professional. https://www.icaew.com/technical/audit-and-assurance/the-future-of-audit/audit-and-technology/artificial-intelligence-and-audit - ACCA (2024). Machine learning, more science than fiction, the institute's view on professional accountability and dual review of AI-supported financial output. https://www.accaglobal.com/gb/en/professional-insights/technology/machine-learning.html - National Association of Corporate Directors (2024). Director essentials, governing the use of AI, on board-level expectations for layered review of AI output that materially affects the organisation. https://www.nacdonline.org/insights/publications/director-essentials/ - CIO.com (2024). Five famous analytics and AI disasters, including the Sports Illustrated AI-bylined articles incident, the MyCity legal-misinformation chatbot, and other cases where single-check approval allowed AI output into circulation that a second reviewer would have caught. https://www.cio.com/article/190888/5-famous-analytics-and-ai-disasters.html

Frequently asked questions

Will not two-person review just slow the team down on everything?

Only if the trigger list is too broad. The discipline applies to a short, named list of situations, external attribution, financial commitment, regulatory representation, decisions about people. Routine output keeps single review. A second reviewer who knows what they are checking for can complete a review in five to ten minutes because they are not re-reading the whole document, they are checking one specific dimension. If the trigger list is producing more than a handful of second reviews a week in a small firm, the list is wrong, not the discipline.

What if the named alternate is on leave when the work needs to go out?

Use a brief written exception protocol. The person seeking the exception documents in writing why the second review cannot wait, names a temporary substitute with relevant background, and commits to a retrospective review the next business day. The exception log itself is useful data. If exceptions become common, the trigger list is too broad, or the team is using urgency to avoid the review, and the pattern is worth surfacing rather than allowing the discipline to erode quietly.

Is this the same as the four-eyes principle in regulated industries?

Adjacent, not the same. The four-eyes principle in pharmaceuticals or banking mandates dual verification by regulation, for every transaction or batch above a threshold. The two-person threshold for AI output is narrower and more operational, four named situations where the model's specific failure modes are concentrated, not a blanket rule across the firm. The principle is the same, two pairs of eyes catch what one pair misses, but the application is targeted to where the asymmetry actually lives.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation