The owner's two-question AI evaluation method

Picture an operations manager at 4:55pm on a Thursday, finishing a client email her AI tool drafted in twelve seconds. She skims it, the prose reads cleanly, the numbers look about right, and she clicks send. By Monday she is on a call apologising for two figures that were close but wrong, and a tone that landed flatter than she would have written herself. The output was not bad. It was confidently presented and unverified, which at SME scale is a more expensive failure mode than owners often realise.

This post is for owners whose teams are producing more work with AI than ever, whose error count is also creeping up, and who need a piece of discipline that does not slow the team down but does not let confident-wrong output through either. The discipline is two questions. Both take a few seconds. Both work across content, numbers, and recommendations. Neither requires you to redo the work yourself.

Why does feel-based evaluation of AI output fail so reliably?

Fluency and accuracy are uncorrelated in AI output, which means the cue the human reviewer is using, how the text reads, is the wrong signal. A model trained to be persuasive will produce confident-sounding work whether the underlying claim is true or fabricated. The ICO and the NIST AI Risk Management Framework both flag this as a structural risk for firms relying on reviewer intuition.

The failures cluster in three predictable places. The first is anywhere the output contains factual claims a reviewer cannot verify without redoing the work, citations, statistics, regulatory interpretations, candidate summaries, tax positions. The second is anywhere time pressure compresses review to a speed-of-reading scan rather than a depth check, which is the default state of any small team at 4:55pm. The third is anywhere AI output becomes the input to a further decision, a financial model, a project plan, a compliance position. Errors compound at each stage and become harder to trace back to the original drafting moment. None of this is a competence failure on the part of the team. It is what happens when any subjective review system meets output designed to be persuasive.

What does the first question actually do?

The first question is, would I sign my name to this as it stands. It shifts the frame from “does this look okay” to “would I stake my reputation on this”, and that shift is what makes it work. A binary yes-or-no gate forces conscious commitment at the moment the team member is about to act, which is precisely when momentum makes drift most likely.

The question generalises because it does not require domain expertise. A finance professional reviewing AI-drafted tax notes does not have to verify every citation to know whether they would sign the summary. A delivery manager reviewing AI-generated project assumptions does not need to recalculate every estimate to know whether they would defend those assumptions as their own recommendation. The reviewer’s confidence becomes the explicit subject of the assessment rather than an unconscious by-product of reading fluent prose. Owners who introduce the question report a common discovery in the first fortnight, that the team had been sending or pasting work they would not actually have signed if asked directly.

What does the second question add?

The second question is, what would have to change for the answer to be yes. Where the first question is a gate, the second is a diagnostic. It surfaces three categories of gap, evidence the reviewer cannot stand behind, tone or precision that does not fit the audience, and context the output lacks for the situation it is being used in.

In practice the second question turns a vague uncomfortable feeling into something actionable. “I would not quite sign this” becomes “I would sign this if the citations were verified”, or “I would sign this if the tone matched our usual voice”, or “I would sign this if I flagged which version of the guidance it covers”. Each of those edits is small and fast, often under two minutes. Without the second question the reviewer either accepts the output anyway or rejects it without knowing why, and the team learns nothing about how to brief the tool better next time.

Where do the two questions actually fit in the working day?

The two questions only work if they sit inside existing workflows rather than as a new review step. Owners who have tried formal review queues report the same pattern, the queue gets skipped under pressure or performed perfunctorily, which defeats the point. The cleaner approach is to embed the questions at three natural decision points where the team is already pausing, the moment of sending, the moment of pasting, and the moment of paying.

The moment of sending is the obvious point for outbound communication. Before any AI-assisted client email, proposal, or message goes out, the sender asks both questions. It adds 20 to 45 seconds and catches the cases where the message reads well enough to send but would not survive being signed for. The moment of pasting applies whenever AI-generated text is being moved into a client deliverable, a financial model, or an internal document. The pause is already there, the questions just make it conscious. The moment of paying applies at invoicing or sign-off, when a deliverable that included AI output is about to be billed. The first question becomes, would I sign my name to this as billable work that meets the quality we promised. Owners who have run this for a quarter typically report fewer client comebacks, less rework, and a quieter inbox on Monday mornings.

Implementation does not require a new tool. A line in a team standard, a five-minute conversation at a Monday huddle, and a fortnight of the owner asking the questions out loud when they sign off work, is enough to make the practice stick. After two to three weeks the questions stop feeling like an addition and become how the team works.

When are two questions not enough?

The discipline is right for the majority of AI output but it is not a substitute for structured review on three categories, regulated output, work with material financial or legal exposure, and external public-facing claims. The two questions are an in-flow gate, not an audit trail, and on the heavier categories the audit trail is what regulators and reputation actually require.

Regulated output is the first category. Anything forming part of compliance documentation, financial statements subject to professional standards, FCA-regulated client advice, legal opinions, or accounting work subject to ICAEW or ACCA standards needs documented review. SRA guidance is explicit that AI-generated legal output should be reviewed by a qualified lawyer with documented sign-off, and ICAEW and ACCA say the same for AI-supported financial work.

The second category is output with material financial or legal implications, scope statements above a threshold the owner has set, contract interpretations that change obligations, risk assessments that drive commitments. The heavier check is manager sign-off with the threshold named in advance, so the team is not making a judgement call about escalation in the moment. The third category is external public-facing claims, marketing copy, published research, social posts representing the firm. Run these past someone with authority to speak for the firm before they go out.

Name the three categories explicitly to the team, specify what “heavier” looks like in each case, and let everything else run through the two questions. That separation is what lets the discipline scale at SME size without becoming bureaucracy.

Confidence is not a quality signal in AI output, which means the work of evaluation has to be done deliberately or it will not be done at all. Two questions, embedded at the right moments, do most of the work most of the time. The rest needs structured review, and naming which is which in advance is the owner’s job.

If you would like to think through how to make this stick across your team, book a conversation.

The owner's two-question evaluation method for AI output

Key takeaways

Why does feel-based evaluation of AI output fail so reliably?

What does the first question actually do?

What does the second question add?

Where do the two questions actually fit in the working day?

When are two questions not enough?

Sources

Frequently asked questions

My team is already overstretched. Won't two extra questions slow them down?

How is this different from a formal review process?

What if my team says yes to the first question without really meaning it?

Ready to talk it through?

If any of this sounds familiar, let's talk.

The owner's two-question evaluation method for AI output

Key takeaways

Why does feel-based evaluation of AI output fail so reliably?

What does the first question actually do?

What does the second question add?

Where do the two questions actually fit in the working day?

When are two questions not enough?

Sources

Frequently asked questions

My team is already overstretched. Won't two extra questions slow them down?

How is this different from a formal review process?

What if my team says yes to the first question without really meaning it?

Ready to talk it through?

Related reading

Quality signals over time, how to spot when AI output is drifting

The two-person review threshold, when single-check AI evaluation is not enough

Sampling rates for AI output, what the volume should drive

If any of this sounds familiar, let's talk.