The owner's two-question evaluation method for AI output

A woman pausing at her desk, reading back an email on screen before clicking send
TL;DR

Feel-based evaluation of AI output fails predictably because fluent prose and accurate prose are not the same thing. The simplest discipline that works at SME scale is two questions, asked before any AI output is used: would I sign my name to this as it stands, and what would have to change for the answer to be yes. Embed them at the moment of sending, pasting, or paying.

Key takeaways

- Fluent AI output and accurate AI output are uncorrelated, which is why intuition-based review fails worst on the work that matters most. - Question one, would I sign my name to this as it stands, forces a binary commitment that exposes the difference between intellectual acceptance and practical confidence. - Question two, what would have to change for the answer to be yes, surfaces three categories of gap that otherwise stay invisible, gaps in evidence, gaps in tone, and gaps in context. - The two questions fit existing workflows when embedded at three natural decision points, the moment of sending, the moment of pasting, and the moment of paying. - Three categories still warrant heavier review, regulated output, work with material financial or legal exposure, and external public-facing claims where reputation risk is substantial.

Picture an operations manager at 4:55pm on a Thursday, finishing a client email her AI tool drafted in twelve seconds. She skims it, the prose reads cleanly, the numbers look about right, and she clicks send. By Monday she is on a call apologising for two figures that were close but wrong, and a tone that landed flatter than she would have written herself. The output was not bad. It was confidently presented and unverified, which at SME scale is a more expensive failure mode than owners often realise.

This post is for owners whose teams are producing more work with AI than ever, whose error count is also creeping up, and who need a piece of discipline that does not slow the team down but does not let confident-wrong output through either. The discipline is two questions. Both take a few seconds. Both work across content, numbers, and recommendations. Neither requires you to redo the work yourself.

Why does feel-based evaluation of AI output fail so reliably?

Fluency and accuracy are uncorrelated in AI output, which means the cue the human reviewer is using, how the text reads, is the wrong signal. A model trained to be persuasive will produce confident-sounding work whether the underlying claim is true or fabricated. The ICO and the NIST AI Risk Management Framework both flag this as a structural risk for firms relying on reviewer intuition.

The failures cluster in three predictable places. The first is anywhere the output contains factual claims a reviewer cannot verify without redoing the work, citations, statistics, regulatory interpretations, candidate summaries, tax positions. The second is anywhere time pressure compresses review to a speed-of-reading scan rather than a depth check, which is the default state of any small team at 4:55pm. The third is anywhere AI output becomes the input to a further decision, a financial model, a project plan, a compliance position. Errors compound at each stage and become harder to trace back to the original drafting moment. None of this is a competence failure on the part of the team. It is what happens when any subjective review system meets output designed to be persuasive.

What does the first question actually do?

The first question is, would I sign my name to this as it stands. It shifts the frame from “does this look okay” to “would I stake my reputation on this”, and that shift is what makes it work. A binary yes-or-no gate forces conscious commitment at the moment the team member is about to act, which is precisely when momentum makes drift most likely.

The question generalises because it does not require domain expertise. A finance professional reviewing AI-drafted tax notes does not have to verify every citation to know whether they would sign the summary. A delivery manager reviewing AI-generated project assumptions does not need to recalculate every estimate to know whether they would defend those assumptions as their own recommendation. The reviewer’s confidence becomes the explicit subject of the assessment rather than an unconscious by-product of reading fluent prose. Owners who introduce the question report a common discovery in the first fortnight, that the team had been sending or pasting work they would not actually have signed if asked directly.

What does the second question add?

The second question is, what would have to change for the answer to be yes. Where the first question is a gate, the second is a diagnostic. It surfaces three categories of gap, evidence the reviewer cannot stand behind, tone or precision that does not fit the audience, and context the output lacks for the situation it is being used in.

In practice the second question turns a vague uncomfortable feeling into something actionable. “I would not quite sign this” becomes “I would sign this if the citations were verified”, or “I would sign this if the tone matched our usual voice”, or “I would sign this if I flagged which version of the guidance it covers”. Each of those edits is small and fast, often under two minutes. Without the second question the reviewer either accepts the output anyway or rejects it without knowing why, and the team learns nothing about how to brief the tool better next time.

Where do the two questions actually fit in the working day?

The two questions only work if they sit inside existing workflows rather than as a new review step. Owners who have tried formal review queues report the same pattern, the queue gets skipped under pressure or performed perfunctorily, which defeats the point. The cleaner approach is to embed the questions at three natural decision points where the team is already pausing, the moment of sending, the moment of pasting, and the moment of paying.

The moment of sending is the obvious point for outbound communication. Before any AI-assisted client email, proposal, or message goes out, the sender asks both questions. It adds 20 to 45 seconds and catches the cases where the message reads well enough to send but would not survive being signed for. The moment of pasting applies whenever AI-generated text is being moved into a client deliverable, a financial model, or an internal document. The pause is already there, the questions just make it conscious. The moment of paying applies at invoicing or sign-off, when a deliverable that included AI output is about to be billed. The first question becomes, would I sign my name to this as billable work that meets the quality we promised. Owners who have run this for a quarter typically report fewer client comebacks, less rework, and a quieter inbox on Monday mornings.

Implementation does not require a new tool. A line in a team standard, a five-minute conversation at a Monday huddle, and a fortnight of the owner asking the questions out loud when they sign off work, is enough to make the practice stick. After two to three weeks the questions stop feeling like an addition and become how the team works.

When are two questions not enough?

The discipline is right for the majority of AI output but it is not a substitute for structured review on three categories, regulated output, work with material financial or legal exposure, and external public-facing claims. The two questions are an in-flow gate, not an audit trail, and on the heavier categories the audit trail is what regulators and reputation actually require.

Regulated output is the first category. Anything forming part of compliance documentation, financial statements subject to professional standards, FCA-regulated client advice, legal opinions, or accounting work subject to ICAEW or ACCA standards needs documented review. SRA guidance is explicit that AI-generated legal output should be reviewed by a qualified lawyer with documented sign-off, and ICAEW and ACCA say the same for AI-supported financial work.

The second category is output with material financial or legal implications, scope statements above a threshold the owner has set, contract interpretations that change obligations, risk assessments that drive commitments. The heavier check is manager sign-off with the threshold named in advance, so the team is not making a judgement call about escalation in the moment. The third category is external public-facing claims, marketing copy, published research, social posts representing the firm. Run these past someone with authority to speak for the firm before they go out.

Name the three categories explicitly to the team, specify what “heavier” looks like in each case, and let everything else run through the two questions. That separation is what lets the discipline scale at SME size without becoming bureaucracy.

Confidence is not a quality signal in AI output, which means the work of evaluation has to be done deliberately or it will not be done at all. Two questions, embedded at the right moments, do most of the work most of the time. The rest needs structured review, and naming which is which in advance is the owner’s job.

If you would like to think through how to make this stick across your team, book a conversation.

Sources

- Information Commissioner's Office (2024). AI and Data Protection guidance, on systematic checks at the point of use for AI-generated output. https://ico.org.uk/about-the-ico/ai-and-data-protection/ - National Institute of Standards and Technology (2024). AI Risk Management Framework, on the structural unreliability of subjective assessment for fluent AI output. https://www.nist.gov/itl/ai-risk-management-framework - National Institute of Standards and Technology (2024). Generative AI Profile for Risk Management, on documented review for higher-stakes use cases. https://www.nist.gov/itl/ai-risk-management-framework/generative-ai-profile - Solicitors Regulation Authority (2024). Use of artificial intelligence by law firms, recommending qualified review of AI-generated legal output with clear documentation. https://www.sra.org.uk/sra/guidance/artificial-intelligence/ - ICAEW (2024). AI governance for accounting firms, recommending documented review protocols where AI feeds regulated output. https://www.icaew.com/insights/viewpoints-on-the-news/2024-jan-to-dec/jan-2024/ai-governance-accounting - ACCA (2024). Machine learning in financial services, on review obligations for AI-derived advice and analysis. https://www.accaglobal.com/uk/en/professional-insights/technology/ai-financial-services.html - European Commission (2024). EU AI Act Article 14, on the human oversight requirement for higher-risk AI systems. https://artificialintelligenceact.eu/article/14/ - MIT Sloan Management Review (2024). The reality of managing AI-generated content, on error-discovery rates in firms relying on intuition versus structured review. https://sloanreview.mit.edu/article/the-reality-of-managing-ai-generated-content/ - Stanford Institute for Human-Centered AI (2024). AI Index Report 2024, on hallucination rates and the gap between fluency and accuracy in generative models. https://hai.stanford.edu/ai-index/ - Harvard Business Review (2024). How to evaluate AI output in professional services, on point-of-use checks as the most reliable gate for SME workflows. https://hbr.org/2024/01/evaluating-ai

Frequently asked questions

My team is already overstretched. Won't two extra questions slow them down?

In practice the time cost is 20 to 45 seconds per decision because the questions land at moments where the team is already pausing, before sending, before pasting, before paying. After two or three weeks the questions become automatic. The throughput trade is usually positive once you count the rework, the client follow-ups, and the corrections that no longer need to happen.

How is this different from a formal review process?

A formal review introduces a second person and a queue, which gets skipped under pressure. The two-question method stays inside the original person's workflow and only adds reflection at the point of use. It catches the everyday output. Formal review still applies to the three heavier categories, regulated work, material financial or legal commitments, and external claims that carry reputational weight.

What if my team says yes to the first question without really meaning it?

That is the most common failure mode and it is fixable. Either spot-check a small sample of outputs each week and ask the person who signed off to walk you through their reasoning, or move to a written gate where the answer is logged briefly next to the output. Both surface drift quickly. Most teams who go through a fortnight of spot-checks stop rubber-stamping because the cost of getting caught is higher than the cost of pausing for ten seconds.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation