The owner I am thinking of found out at half past four on a Tuesday afternoon. Her marketing lead had sent a press release to a small list of trade journalists an hour earlier, with what she now realised was a quote attributed to her, the founder, that she had never said and would not say. The quote was reasonable. It was on-message. It was in roughly her register. The marketing lead had asked an AI tool to draft a release, the tool had produced a usable draft with a founder quote built in, and the marketing lead had single-checked the document and sent it. Nobody else had read it before it went out.
The release was correctable. Two journalists had not yet opened the email. The third had, and the founder spent the next forty minutes on the phone explaining. The damage was small but real. What stayed with her afterwards was not the quote itself. It was the gap between her assumption, that obvious external-facing content with her name on it would be read by more than one person before it went out, and the marketing lead’s assumption, that the review the lead had done was the review the document needed. Neither of them had agreed in advance which situations needed a second pair of eyes and which did not. The default was single check, and the default had broken predictably.
What is the two-person review threshold for AI output?
It is the named line at which single-check evaluation of AI output stops being adequate and a second reviewer is required. Single review works for routine output where the reviewer has domain knowledge and errors are reversible. Two-person review applies to a short, pre-agreed list of situations where the cost of an undetected error is asymmetric.
The four situations on that list, in practice, are external attribution (quotes, named statements, public claims), financial commitment (cost estimates, pricing decisions, contract terms above a threshold), regulatory representation (compliance disclosures, audit responses, statements to oversight bodies), and decisions about named individuals (hiring, performance, discipline, access to a service). Everything else stays on single review.
Why does single-check evaluation predictably fail in those four situations?
Because the failure mode of large language models is fluent plausibility, and these four situations are where a fluent error is least likely to be caught by the reviewer closest to the work. A BBC and European Broadcasting Union study published in 2025 found that around forty-five per cent of AI queries to leading chatbots produced factually incorrect answers, expressed in the same authoritative tone as the correct ones.
The first reviewer is usually the person closest to the work, a marketing lead, a finance analyst, a hiring manager. They are reading for the things their role typically checks, message fit, internal logic, surface accuracy. They are not usually reading for the things the model is likely to get wrong in that domain, attribution of a quote to a person, omission of a cost category specific to the firm, mischaracterisation of a regulatory exemption, algorithmic bias against a protected characteristic. The blind spot is structural, not a question of competence. A University of Washington study of state-of-the-art language models ranking otherwise-equivalent CVs found the models favoured white-associated names eighty-five per cent of the time and never preferred Black-male-associated names over white-male-associated names. A single hiring manager reading the ranking will rarely catch that pattern alone.
Where will you actually meet the four trigger situations?
In the regular rhythm of work for any owner-operated services firm. External attribution comes up every time AI helps draft a release, a website page, a partner statement, or a post under the founder’s name. Financial commitment comes up whenever AI estimates, prices, or models a number the firm will commit to. Regulatory representation comes up in compliance responses. Decisions about people come up in hiring and performance reviews.
The exposure is highest where the team is busiest and AI is saving the largest share of hours, which is the same reason the second-check discipline matters there. The Information Commissioner’s Office is explicit in its UK GDPR guidance on AI that the deployer remains accountable for the output, and Article 22 of UK GDPR (mirroring the EU AI Act’s Article 22 and the human-oversight requirements of Article 14) preserves the right to a human review of significant automated decisions. The regulatory frame is already there. The internal discipline either matches it or leaves a gap.
When should the alternate reviewer step in and what are they specifically checking?
When the work falls into a trigger situation, the named alternate reads the document with a specific check in mind, not a general quality pass. For external attribution, the alternate checks authenticity of voice and position, has this person actually said this, would they say it, is there documented basis for it. The right alternate is the person being quoted, or someone authorised to represent that relationship.
For financial commitment, the alternate checks completeness against the firm’s actual cost base, does this include the categories of expense we typically incur, do the assumptions match our documented history. The alternate is someone with budget accountability in that category. For regulatory representation, the alternate is a compliance specialist or external advisor who reads for whether the statement accurately characterises the regulation and whether it creates latent exposure if a regulator audits it. ICAEW and ACCA both treat AI-supported output going to a regulator or an audit as requiring qualified human review, not as something the model’s confidence stands in for. For decisions about people, the alternate is someone outside the direct reporting line, trained in employment law or bias-audit principles, checking whether the recommendation can be defended as job-relevant and free of pattern bias against protected characteristics. Naming the alternate per category, in advance, stops the role defaulting to whoever happens to be available, which is what produces rubber-stamp approval.
How do you operationalise this without creating bureaucracy?
Three elements, all of them light. A one-page written trigger list of the four situations, in language the team recognises without interpreting. A named primary and backup reviewer per category, visible in a shared document. A brief written exception protocol for when both are unavailable, the person documents why it cannot wait, names a temporary substitute, and commits to a retrospective review the next business day.
The discipline lasts longer when the team understands the failure mode it answers, not as headquarters policy. A short conversation works better than a memo, AI tools produce fluent output that occasionally invents quotes, omits cost categories, mischaracterises regulations, or ranks candidates against patterns nobody authorised, and here are the four named situations where a second pair of eyes catches what the first pair will predictably miss. Briefed that way, the discipline is a tool the team is using to protect itself and the firm, not a constraint imposed from above. Research on shared mental models in teams consistently finds that adherence improves and errors decrease when the team understands the why behind a process, not only the what.
If you would like to map your firm’s four trigger situations and your named reviewers in a single sitting, book a conversation.



