AI writing fact-check: the small pass that catches errors

A sales lead at a small services firm sent out an AI-drafted introductory email last quarter. The structure was clean, the tone right, the call to action clear. The prospect replied with three small corrections inside the first paragraph. A regulatory deadline was one year off. The job title attributed to her opposite number was from a previous company. An industry statistic was eighteen months out of date. None of these were dramatic fabrications. All of them were plausible enough to slip past the read-through. All of them eroded credibility before the conversation had started.

The owner asked the obvious question. How did the email get through? The answer is the subject of this post.

What does AI actually get wrong in SME writing?

The errors that matter are rarely dramatic. They are quieter. A date that is one year off in a smooth narrative. A job title attributed to someone who has moved firms. A regulatory reference that exists but applies to a different jurisdiction. A price that was accurate six months ago. The BBC tested four major chatbots on news queries and found 45% of responses contained errors, often outdated details framed as current.

The same pattern shows up in citation tasks, which sit closest to what SME writers actually do when they reference research or regulation. Vectara’s hallucination leaderboard puts citation accuracy at the worst-performing task family across frontier models, averaging 12.4% hallucination with some models reaching 19.1%. Models invent DOIs, paper titles, author names, and journal references at rates of 6.8 to 19.1%. For a proposal claiming “research by X published in Y found Z”, the chance of all three elements being subtly wrong is material.

Citation hallucination has been documented in court at scale. By April 2026, the AI Hallucination Cases Database had tracked 1,174 decisions worldwide in which judges engaged with hallucinated content in AI-generated filings. The fabrications were dangerous because they appeared credible, not because they were obviously false.

Why do the headline benchmark numbers mislead?

Benchmarks test isolated factual recall against a curated ground-truth set, with frontier models scoring as low as 4.2% hallucination on those simplified tasks. Real SME writing is harder. Each email or proposal carries multiple claims that must all be simultaneously accurate, contextually appropriate, and plausible-sounding. A model generating text from statistical patterns has no way to distinguish a current fact from a historical fact that happens to appear in training data.

The Stanford HAI 2026 AI Index added a more uncomfortable finding. Benchmark error rates themselves reach 42% on widely used evaluations, and the Chatbot Arena Leaderboard may partly reflect adaptation to its own format rather than general capability. If the measurement of progress is itself compromised, the headline accuracy numbers deserve scepticism.

The practical implication is straightforward. A model that scores well on a published benchmark may still fail predictably on the specific writing tasks your firm cares about, because the benchmark does not test what you need tested. The defence is verification discipline that works regardless of which model the team is using this quarter.

What are the four claim types worth checking?

Four claim categories produce the bulk of credibility-damaging errors in AI-drafted SME writing. Knowing the categories lets you build a five-minute pass that is specific enough to catch real errors and cheap enough to sustain. The pass is a triage, not a full review. You scan only for these four types, spot-check two or three of each, and stop.

Dates and temporal references. Any statement about when something happened, when a deadline applies, or when a rule took effect. A one-year drift in a historical reference reads as carelessness when caught. For outreach into regulated sectors, a misquoted compliance deadline can flag the message to a prospect’s internal audit team. Check the date against the source document or the regulator’s current calendar.

Named attributions and roles. Any claim about who said something, who holds a position, or which firm someone works for. LinkedIn profiles refresh within weeks, AI training data is often 18 months stale, and a prospect who moved firms six months ago may still appear with the old employer. Verify the top one or two named individuals against current public sources before delivery.

Regulatory references and compliance claims. Any statement about which rule applies, which authority has jurisdiction, or what a regulator has decided. The regulator’s existence is not the point of the check. The point is whether the specific reference is current, applies to the prospect’s jurisdiction, and has not been amended or superseded. The EU AI Act, for example, became fully applicable on 2 August 2026 with the bulk of high-risk system rules in force from that date.

Current-state assertions. Prices, availability, market positions, competitive claims. What was true six months ago has often shifted. A single out-of-date price point in a proposal erodes negotiation credibility immediately. Ask one question of each such claim: is this still true today, or based on data that has gone stale?

The MIT Sloan finding here is useful. Adding small, deliberate friction to AI review processes increases accuracy without significantly increasing review time. The five-minute pass is exactly that kind of friction.

Where does the check pay off, and where is it overkill?

The return on five minutes is not uniform. For customer-facing external content, proposals, prospect outreach, regulator submissions, published thought leadership, the pass justifies itself the first time a corrected error would have damaged a relationship. For internal communications, team updates, draft notes circulated to staff, the cost outweighs the value. Colleagues will catch factual errors in conversation and the cost of being wrong is contained.

The hierarchy matters by stake, not just by audience. High-stakes claims, pricing, compliance deadlines, named-individual attributions, financial figures, warrant the check every time. Low-stakes claims, illustrative examples, descriptive background, general framing, do not. This mirrors the principle in the NIST AI Risk Management Framework: governance must match risk, with documented oversight on the decisions that affect customers, compliance, or revenue, and lighter controls on the rest.

The practical rule for SMEs is to define which claim types matter for your business and apply the pass consistently to those, on customer-facing work, before delivery. Not every output, not every paragraph, not every word. The discipline survives because it is narrow.

What happens after three months of doing the check?

Teams that apply the check consistently see initial improvements. Error rates on client-facing content drop. Prospect corrections fall. Credibility rises. Then, around month three or four, something shifts. The check feels less necessary because the visible failures have stopped. Exceptions accumulate, “this one looks fine”, “we are running late”. Within weeks, the discipline has eroded and the error rate climbs back toward baseline.

The pattern is predictable. As salience falls, deliberate practice slips into muscle memory and then into optional habit. The fix is unglamorous and durable. Cadence. The check is scheduled, assigned to a named person, applied to customer-facing content before delivery without exception. The moment exceptions are permitted, the discipline erodes. The response to visible drift is not more sophisticated tooling, it is re-establishing the rule and naming the owner.

For SMEs measuring whether the discipline is working, the metric is simple. Track prospect corrections received on factual points over a 60-day baseline. Apply the pass on all customer-facing AI-drafted content for the next 90 days. Track corrections again. The signal is in the direction. Continuing decline means the check is holding. Errors climbing again means the team has drifted, and process correction, not new tooling, is what gets you back.

If you want help building the five-minute check into a quality standard your team will actually maintain after the novelty fades, book a conversation.

Factual accuracy in AI writing: the small check that catches most errors

Key takeaways

What does AI actually get wrong in SME writing?

Why do the headline benchmark numbers mislead?

What are the four claim types worth checking?

Where does the check pay off, and where is it overkill?

What happens after three months of doing the check?

Sources

Frequently asked questions

How is this different from checking every fact in the document?

Which claim types fail most often in AI-drafted writing?

What changes after three months of doing this?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Factual accuracy in AI writing: the small check that catches most errors

Key takeaways

What does AI actually get wrong in SME writing?

Why do the headline benchmark numbers mislead?

What are the four claim types worth checking?

Where does the check pay off, and where is it overkill?

What happens after three months of doing the check?

Sources

Frequently asked questions

How is this different from checking every fact in the document?

Which claim types fail most often in AI-drafted writing?

What changes after three months of doing this?

Ready to talk it through?

Related reading

Quality signals over time, how to spot when AI output is drifting

The two-person review threshold, when single-check AI evaluation is not enough

Sampling rates for AI output, what the volume should drive

If any of this sounds familiar, let's talk.