Reviewing AI's output, the editor's eye you still need

A founder at a home office desk reading a printed AI-drafted email aloud, pen in hand, with a laptop open beside her showing the same draft
TL;DR

AI drafts look plausible at first read, which is precisely what makes them risky. The editor's eye is a four-pass review that owner-operators apply before any AI-drafted artefact leaves their desk: verify the specifics, check the nuance, listen for tonal drift, and confirm the structure earns its length. The skill is learnable, gets faster with use, and is what separates AI as a genuine lever from AI as a quiet liability.

Key takeaways

- AI output is dangerous because it is plausible, not because it is obviously wrong. Hallucinated citations, false certainty, and tonal drift all pass an inattentive first read, which is why editorial discipline has to be deliberate rather than reflexive. - The Mata v. Avianca sanctions case in 2023 is the canonical worked example: two New York attorneys filed a brief with six fabricated case citations generated by ChatGPT, were sanctioned by Judge Castel, and the pattern has now been documented in over 50 US cases and a growing UK High Court list. - Modern hallucination rates remain material even on frontier models. The Vectara Hallucination Leaderboard's 2026 dataset shows the best model at 3.3 percent, with Gemini-3-pro at 13.6 percent and most reasoning models above 10 percent. - The four-pass review covers claims (verify against primary sources), nuance (read once aloud), tone (does it sound like you in British English), and structure (does it earn its length). It takes 10 minutes on a two-page document and gets faster with practice. - UK regulators including the ICO, FCA Mills Review, ICAEW, and the Bar Council have been explicit: meaningful human review of AI output is non-negotiable in any regulated context, and organisations are expected to have documented their oversight before something lands wrong.

A founder I spoke with last month sent a client email she had drafted in ChatGPT and lightly polished. It contained a specific number, a market-share figure, that she had not asked for, had not verified, and that turned out to be invented. The client noticed. She is still slightly mortified about it, and rightly so. The email read perfectly. That was the problem.

This is the discipline the AI-on-your-desk conversation quietly demands you sharpen, not soften. AI output looks plausible. The editor’s eye is the bit that still has to be yours.

What is the editor’s eye, and why does AI need it?

The editor’s eye is the small set of habits a careful writer applies to any draft before it leaves their desk: check the specifics, listen to the tone, ask whether each claim earns its place, read once aloud. It is what fact-checkers at The New Yorker and the FT have done for decades. The reason AI needs it is that AI is now fluent enough to disguise its own errors as competent prose.

That is a different problem from earlier machine output. A clumsy AI draft was its own warning. A fluent AI draft is not. Operators who get burned are usually the ones who treated fluency as a proxy for accuracy. The Vectara Hallucination Leaderboard’s 2026 dataset, run across 7,700 articles, puts even the best summarisation model at a 3.3 percent hallucination rate, with named reasoning systems including Claude Sonnet 4.5, GPT-5, Grok-4, and Deepseek-R1 all sitting above 10 percent. Stanford’s HELM benchmark and Anthropic’s published system cards point the same way.

Why does it matter for your business?

Because the consequences of plausible-but-wrong scale with the seriousness of the artefact. The canonical example is Mata v. Avianca, where two New York attorneys filed a brief in 2023 containing six fabricated court citations generated by ChatGPT. Judge Castel sanctioned them, fined the firm 5,000 dollars, and ordered them to send the false affidavit to the judges whose names had been misappropriated in the invented opinions.

The pattern has now spread well beyond that one case. The Charlotin AI Hallucination Cases Database tracks over 50 US court cases and a growing UK High Court list in which parties were found to have relied on hallucinated content. Deloitte Australia delivered a 440,000 dollar government report containing fabricated academic references and a misattributed quote from a Federal Court judge, and offered a partial refund. GPTZero, analysing 4,000 NeurIPS 2025 papers, found over 100 hallucinated citations across 50 papers that had each cleared three to five expert reviewers. The implication for an SME founder is direct. If peer-reviewed academic conferences and named consulting firms are missing AI-fabricated specifics at scale, an unaided 10pm review of a Tuesday client email is not going to catch them either.

UK regulators have been explicit about what is now expected. The ICO’s accuracy guidance under UK GDPR, the FCA’s Mills Review of AI in retail financial services, the ICAEW’s audit-work guidance, and the Bar Council’s 2025 update all converge on the same point: meaningful human review of AI output is non-negotiable in regulated contexts, and organisations are expected to have documented their oversight before something lands wrong.

Where will you actually meet it?

You will meet it as one of three predictable failure modes, each of which passes a casual first read. The first is invented specifics: a confidently formatted citation, a precise statistic, a dated quote, all wearing the texture of authentic professional writing because the AI has learned the shape of citations more thoroughly than their substance. Mata’s fabricated cases included docket numbers and reporter pages. They looked entirely real until counsel checked the database.

The second is missed nuance, which is the more insidious of the three. AI tends to compress genuine disagreement, jurisdictional variation, or live evidence into a single confident narrative. A draft client advisory that quietly resolves a debate the field has not actually resolved commits you to a position you may not have taken unaided. A board memo that reads as clean consensus when the underlying evidence is split is the same failure mode in a different format.

The third is tonal drift, often toward a flattened North American business register. The Max Planck Institute, studying 740,000 hours of content, has documented a measurable rise in ChatGPT’s preferred vocabulary in everyday writing since 2023, including the words a USC writing-variation study tracked under the same heading. The em dash has become a strong enough AI tell that some careful writers now self-consciously avoid it. For a UK owner-operator writing in British English to UK clients and regulators, drift toward Americanised flat reads as carelessness, whether or not the content is accurate. It signals to the reader that nobody finished the job.

When should you polish, and when should you throw it out?

A polish pass is the right move when three conditions hold. The claims are verifiable from sources you already have to hand. The tone reads as yours within one read-aloud. Nothing in the draft commits you to a position you would not have taken unaided. If those three are clean, the draft is a starting point worth keeping.

The four-pass review is the working tool. Claims, verify each specific against a primary source. Nuance, read once aloud and listen for false certainty or compressed disagreement. Tone, ask directly whether this sounds like you in British English. Structure, ask whether the draft earns its length, or whether the model has padded the middle to look thorough.

You throw it out when the verification pass turns up two or more invented specifics, when the read-aloud produces a flatness you cannot localise to one paragraph, or when the draft has resolved a nuance you were not yet ready to resolve. Throwing out is faster than rewriting from a corrupted base. The judgement question that catches the in-between cases is this: would a thoughtful peer who reads my work catch that this is not me. If the answer is yes, the draft is not ready. The discipline overlaps with the practice of drafting first passes with AI. The first-pass habit and the editor’s eye are the two halves of the same skill.

How does the editor’s eye build over time?

It compounds. The first ten reviews are slower because you are building a personal catalogue of your own AI’s failure patterns: which kinds of citation it tends to fabricate, which words signal tonal drift in your voice, which structural moves it defaults to when it does not know what you want. By the thirtieth review, the catalogue is fast and largely automatic, and you are spending the time on substance rather than on detection.

The supporting habits are mechanical. Read aloud, because the auditory cortex catches rhythm and false certainty that silent reading does not. Verify every quantitative claim against a primary source before sending; if you cannot find the source in 60 seconds, the claim is not solid enough to ship. Run a brief premortem on regulatory or client-facing documents, the technique Gary Klein published in HBR in 2007: assume the document will be scrutinised and identify in advance which claims are most likely to be questioned, then verify those first.

The voice work is the part that takes longest, because it is the part the AI is best at imitating shallowly. A useful test, taught in the Roy Peter Clark “Writing Tools” tradition, is to read the draft aloud and listen for sentence-length variety. Frontier models default to a metronome rhythm that drones once you hear it. Your own voice does not. The wider category context for this discipline lives in AI for your own work, not just your business, the cluster pillar. The editor’s eye is the foundational discipline that makes everything else in personal AI practice safe to ship, which is the part of the conversation that does not get talked about enough.

Sources

- Berkeley Law (2023). Mata v. Avianca, Inc., Judge P. Kevin Castel sanctions opinion. Source for the canonical hallucinated-citation case where ChatGPT-generated fake court decisions were filed in federal court. Cited as the worked example for invented specifics. https://www.law.berkeley.edu/wp-content/uploads/archive/2025/12/Mata-v-Avianca-Inc.pdf - Bar Council of England and Wales (2025). Updated guidance on generative AI for the Bar. UK barristers' professional-conduct guidance covering hallucinations, anthropomorphism, and the duty of accuracy in court filings. Cited as the UK regulator anchor. https://www.barcouncil.org.uk/resource/updated-guidance-on-generative-ai-for-the-bar.html - Information Commissioner's Office (2024). Guidance on AI and data protection: accuracy and statistical accuracy. UK regulator's position that the accuracy principle of UK GDPR applies to all personal data input to or output by an AI system. Cited as the regulatory standard for editorial review. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/what-do-we-need-to-know-about-accuracy-and-statistical-accuracy/ - Vectara (2026). Next generation Hallucination Leaderboard. Standardised summarisation evaluation across 7,700 articles using the HHEM detection model, showing best-in-class at 3.3 percent and most reasoning models above 10 percent. Cited as the evidence base for hallucination rates. https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard - Stanford CRFM (2024). Holistic Evaluation of Language Models (HELM). Living benchmark for transparency in language models, used as a reference standard for cross-model performance comparison. Cited as the academic evaluation framework. https://crfm.stanford.edu/helm/ - Anthropic (2025). Claude Opus 4.5 system card. Published evaluation of failure modes, hallucination rates, and harmlessness across task types for the Claude model family. Cited as the model-vendor evaluation precedent. https://www.anthropic.com/claude-opus-4-5-system-card - Financial Conduct Authority (2025). Mills Review: long-term impact of AI on retail financial services. UK regulator's review identifying hallucinatory advice, opaque decision-making, and erosion of consumer trust as growing AI risks under the Consumer Duty. Cited as the financial-services oversight anchor. https://www.fca.org.uk/publications/calls-input/review-long-term-impact-ai-retail-financial-services-mills-review - ICAEW (2025). Artificial intelligence in audit work: managing the risks. Institute of Chartered Accountants in England and Wales guidance requiring documented policies, human-in-the-loop review, and alertness to unauthorised AI use in audit teams. Cited as the accountancy-profession anchor. https://www.icaew.com/regulation/working-in-the-regulated-area-of-audit/audit-regulations-and-guidance/artificial-intelligence-in-audit-work-managing-the-risks - Fortune (2026). NeurIPS research papers contained over 100 AI-hallucinated citations, GPTZero analysis. Reporting on 50 NeurIPS 2025 papers found to contain fabricated citations after peer review by three to five reviewers, illustrating that even expert review misses AI hallucinations at scale. Cited as the academic-publishing evidence. https://fortune.com/2026/01/21/neurips-ai-conferences-research-papers-hallucinations/ - Charlotin, Damien (2026). AI Hallucination Cases Database. Continuously updated tracker of court cases in which parties were found to have relied on hallucinated AI content, now over 50 US cases plus a growing UK list. Cited as the live evidence base for professional consequences. https://www.damiencharlotin.com/hallucinations/

Frequently asked questions

How long does the four-pass review actually take?

For a two-page document, around 10 minutes once you have done it 30 or 40 times. The first ten attempts are slower, maybe 20 minutes, because you are building the catalogue of things to check for. The pace picks up sharply once you have caught your own AI's typical failure modes a handful of times. It gets faster, not slower.

When is an AI draft good enough to send with just a polish pass?

When the claims are verifiable from sources you already have to hand, the tone reads as yours within one read-aloud, and nothing in it commits you to a position you would not have taken unaided. If any of those three is missing, throw the draft out and start from a tighter brief. The polish pass is for cosmetic work, not substantive repair.

Is the editor's eye really a competitive advantage, or just defensive?

Both. The Vectara leaderboard puts even the best frontier model at a 3.3 percent hallucination rate, and most reasoning models above 10 percent. Operators who can spot a fabricated specific or a tonal drift inside 30 seconds will ship more, and ship cleaner, than peers who either trust the output or refuse to use AI at all. It is a learnable edge.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation