Factual accuracy in AI writing: the small check that catches most errors

A person at a desk reading a printed document, a notebook with handwritten ticks on the side, a laptop open in the background
TL;DR

The factual error rate in AI-drafted SME writing is rarely the headline-grabbing hallucination. It is the date that is one year off, the job title from the wrong company, the regulatory reference that mostly exists, the price that was true six months ago. A disciplined five-minute pass on four specific claim types, dates, named attributions, regulatory references, and current-state assertions, catches the errors that damage trust before they reach a client or prospect.

Key takeaways

- Real SME writing errors are not dramatic fabrications, they are small drifts in four predictable claim types: dates, names and roles, regulatory references, and current-state assertions like prices and market positions. - The BBC and European Broadcasting Union tested four major chatbots on news queries and found roughly 45% of responses contained errors, with citation accuracy the worst-performing task family at 12.4% hallucination on average. - The five-minute check is a triage, not full fact-checking: scan only the four claim types, verify two or three of each against current sources, and stop there. - The check pays for itself on external, deliverable, high-stakes content. It is overkill on internal drafts, illustrative examples, and background context. - Error rates fall when teams apply the check consistently for three months, then drift back unless the cadence is treated as non-negotiable and assigned to a named person.

A sales lead at a small services firm sent out an AI-drafted introductory email last quarter. The structure was clean, the tone right, the call to action clear. The prospect replied with three small corrections inside the first paragraph. A regulatory deadline was one year off. The job title attributed to her opposite number was from a previous company. An industry statistic was eighteen months out of date. None of these were dramatic fabrications. All of them were plausible enough to slip past the read-through. All of them eroded credibility before the conversation had started.

The owner asked the obvious question. How did the email get through? The answer is the subject of this post.

What does AI actually get wrong in SME writing?

The errors that matter are rarely dramatic. They are quieter. A date that is one year off in a smooth narrative. A job title attributed to someone who has moved firms. A regulatory reference that exists but applies to a different jurisdiction. A price that was accurate six months ago. The BBC tested four major chatbots on news queries and found 45% of responses contained errors, often outdated details framed as current.

The same pattern shows up in citation tasks, which sit closest to what SME writers actually do when they reference research or regulation. Vectara’s hallucination leaderboard puts citation accuracy at the worst-performing task family across frontier models, averaging 12.4% hallucination with some models reaching 19.1%. Models invent DOIs, paper titles, author names, and journal references at rates of 6.8 to 19.1%. For a proposal claiming “research by X published in Y found Z”, the chance of all three elements being subtly wrong is material.

Citation hallucination has been documented in court at scale. By April 2026, the AI Hallucination Cases Database had tracked 1,174 decisions worldwide in which judges engaged with hallucinated content in AI-generated filings. The fabrications were dangerous because they appeared credible, not because they were obviously false.

Why do the headline benchmark numbers mislead?

Benchmarks test isolated factual recall against a curated ground-truth set, with frontier models scoring as low as 4.2% hallucination on those simplified tasks. Real SME writing is harder. Each email or proposal carries multiple claims that must all be simultaneously accurate, contextually appropriate, and plausible-sounding. A model generating text from statistical patterns has no way to distinguish a current fact from a historical fact that happens to appear in training data.

The Stanford HAI 2026 AI Index added a more uncomfortable finding. Benchmark error rates themselves reach 42% on widely used evaluations, and the Chatbot Arena Leaderboard may partly reflect adaptation to its own format rather than general capability. If the measurement of progress is itself compromised, the headline accuracy numbers deserve scepticism.

The practical implication is straightforward. A model that scores well on a published benchmark may still fail predictably on the specific writing tasks your firm cares about, because the benchmark does not test what you need tested. The defence is verification discipline that works regardless of which model the team is using this quarter.

What are the four claim types worth checking?

Four claim categories produce the bulk of credibility-damaging errors in AI-drafted SME writing. Knowing the categories lets you build a five-minute pass that is specific enough to catch real errors and cheap enough to sustain. The pass is a triage, not a full review. You scan only for these four types, spot-check two or three of each, and stop.

Dates and temporal references. Any statement about when something happened, when a deadline applies, or when a rule took effect. A one-year drift in a historical reference reads as carelessness when caught. For outreach into regulated sectors, a misquoted compliance deadline can flag the message to a prospect’s internal audit team. Check the date against the source document or the regulator’s current calendar.

Named attributions and roles. Any claim about who said something, who holds a position, or which firm someone works for. LinkedIn profiles refresh within weeks, AI training data is often 18 months stale, and a prospect who moved firms six months ago may still appear with the old employer. Verify the top one or two named individuals against current public sources before delivery.

Regulatory references and compliance claims. Any statement about which rule applies, which authority has jurisdiction, or what a regulator has decided. The regulator’s existence is not the point of the check. The point is whether the specific reference is current, applies to the prospect’s jurisdiction, and has not been amended or superseded. The EU AI Act, for example, became fully applicable on 2 August 2026 with the bulk of high-risk system rules in force from that date.

Current-state assertions. Prices, availability, market positions, competitive claims. What was true six months ago has often shifted. A single out-of-date price point in a proposal erodes negotiation credibility immediately. Ask one question of each such claim: is this still true today, or based on data that has gone stale?

The MIT Sloan finding here is useful. Adding small, deliberate friction to AI review processes increases accuracy without significantly increasing review time. The five-minute pass is exactly that kind of friction.

Where does the check pay off, and where is it overkill?

The return on five minutes is not uniform. For customer-facing external content, proposals, prospect outreach, regulator submissions, published thought leadership, the pass justifies itself the first time a corrected error would have damaged a relationship. For internal communications, team updates, draft notes circulated to staff, the cost outweighs the value. Colleagues will catch factual errors in conversation and the cost of being wrong is contained.

The hierarchy matters by stake, not just by audience. High-stakes claims, pricing, compliance deadlines, named-individual attributions, financial figures, warrant the check every time. Low-stakes claims, illustrative examples, descriptive background, general framing, do not. This mirrors the principle in the NIST AI Risk Management Framework: governance must match risk, with documented oversight on the decisions that affect customers, compliance, or revenue, and lighter controls on the rest.

The practical rule for SMEs is to define which claim types matter for your business and apply the pass consistently to those, on customer-facing work, before delivery. Not every output, not every paragraph, not every word. The discipline survives because it is narrow.

What happens after three months of doing the check?

Teams that apply the check consistently see initial improvements. Error rates on client-facing content drop. Prospect corrections fall. Credibility rises. Then, around month three or four, something shifts. The check feels less necessary because the visible failures have stopped. Exceptions accumulate, “this one looks fine”, “we are running late”. Within weeks, the discipline has eroded and the error rate climbs back toward baseline.

The pattern is predictable. As salience falls, deliberate practice slips into muscle memory and then into optional habit. The fix is unglamorous and durable. Cadence. The check is scheduled, assigned to a named person, applied to customer-facing content before delivery without exception. The moment exceptions are permitted, the discipline erodes. The response to visible drift is not more sophisticated tooling, it is re-establishing the rule and naming the owner.

For SMEs measuring whether the discipline is working, the metric is simple. Track prospect corrections received on factual points over a 60-day baseline. Apply the pass on all customer-facing AI-drafted content for the next 90 days. Track corrections again. The signal is in the direction. Continuing decline means the check is holding. Errors climbing again means the team has drifted, and process correction, not new tooling, is what gets you back.

If you want help building the five-minute check into a quality standard your team will actually maintain after the novelty fades, book a conversation.

Sources

Bersin, J. (2025). BBC finds 45% of AI queries produce erroneous answers. Headline rate from BBC and European Broadcasting Union test of four major chatbots on news queries. https://joshbersin.com/2025/10/bbc-finds-that-45-of-ai-queries-produce-erroneous-answers/ DISCO (2026). AI hallucinations and legal decisions, trend watch. AI Hallucination Cases Database has tracked 1,174 court and tribunal decisions worldwide in which judges found hallucinated content in AI-generated filings. https://csdisco.com/blog/ai-hallucinations-legal-decisions-trends Vectara (2025). Hallucination Leaderboard, next generation. Frontier model rates as low as 4.2% on factual recall and citation accuracy averaging 12.4% across the field, the worst-performing task family. https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard Stanford HAI (2026). 2026 AI Index Report, Technical Performance chapter. Benchmark error rates reaching 42% on widely used evaluations, with concerns that the Chatbot Arena Leaderboard partly reflects adaptation to format rather than capability. https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance Information Commissioner's Office (2024). Governance and accountability in AI. UK guidance that organisations must formally document AI-related risks and track them at corporate level. https://ico.org.uk/for-organisations/advice-and-services/audits/data-protection-audit-framework/toolkits/artificial-intelligence/governance-and-accountability-in-ai/ MIT Sloan Management Review (2024). Nudge users to catch generative AI errors. Adding small, deliberate friction to AI review processes increased accuracy without significantly increasing review time. https://sloanreview.mit.edu/article/nudge-users-to-catch-generative-ai-errors/ McKinsey & Company (2025). The state of AI, global survey 2025. Organisations seeing the most value from AI redesign workflows around where AI amplifies human capability rather than replacing judgment. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai Palo Alto Networks (2024). NIST AI Risk Management Framework explained. The principle that AI governance must match risk: high-stakes decisions require documented oversight, lower-risk flows can use lighter controls. https://www.paloaltonetworks.com/cyberpedia/nist-ai-risk-management-framework PMC / NCBI (2024). Hallucination rates and reference accuracy of ChatGPT and Bard. Models invent DOIs, paper titles, author names, and journal references at rates of 6.8 to 19.1% across citation tasks. https://pmc.ncbi.nlm.nih.gov/articles/PMC11153973/ Advocate Magazine (2026). How AI introduces errors into your documents. The legal-profession case studies of fabricated citations that followed proper formatting and seemed perfectly suited to the argument. https://www.advocatemagazine.com/article/2026-may/before-you-buy-legal-ai-learn-to-use-the-ai-you-already-have-copy

Frequently asked questions

How is this different from checking every fact in the document?

Comprehensive fact-checking is expensive and many teams will not sustain it. The five-minute check is deliberately narrower. You scan the document only for four claim types where AI fails most reliably, you spot-check two or three of each against a current source, and you stop. The discipline trades coverage for sustainability. Five minutes a document, applied to every piece of customer-facing writing, catches more errors over a quarter than an exhaustive review applied to one in five.

Which claim types fail most often in AI-drafted writing?

Four cluster reliably. Dates and temporal references, where the model picks a year that sounds right but is one off. Named attributions, where a job title is current to the training data rather than to today. Regulatory references, where the rule exists but applies to a different jurisdiction or has been amended. Current-state assertions, where prices, market positions, and availability claims are quietly stale. These four account for the bulk of credibility-damaging errors in SME outreach.

What changes after three months of doing this?

Error rates fall, then plateau, then drift back up unless the cadence is maintained. The pattern is predictable. As prospect corrections decline, the salience of the check fades, exceptions accumulate, and the discipline erodes. The fix is unglamorous: name the person responsible, treat the check as non-negotiable on customer-facing content, and re-establish the rule the moment drift shows up. The check is a process, not a one-off improvement.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation