Why AI produces plausible but false answers

A director at a professional services firm asks an AI chatbot to summarise a recent regulatory change for a client briefing. The output is fluent, well-structured, and mentions a named enforcement case that supports the argument. The director sends it without checking. A week later, the client asks for the full case reference. The enforcement case does not exist. The AI invented it, complete with a plausible docket number and a realistic-sounding outcome.

This failure mode is common enough that researchers have given it its own name. They call it hallucination.

What is a generative AI hallucination?

Generative AI systems such as ChatGPT, Copilot, and Gemini are built to predict the next word that best fits the pattern of everything that came before. They have no mechanism for verifying facts. When pattern-matching produces a statement that sounds right but is not grounded in reality, the result is called a hallucination. MIT’s Generative AI Working Group makes this point directly. These tools generate plausible content, not verified content.

Accuracy, when it occurs, is a coincidental side effect of plausibility. The training data compounds the problem. Generative AI is trained on internet-scale datasets that contain accurate information alongside outdated content, falsehoods, and societal biases. Because models learn correlations rather than ground truth, they can reproduce those inaccuracies with high confidence. Josh Bersin’s analysis of a 2025 BBC and European Broadcasting Union study uses the phrase “poisoned corpus” to describe this problem. When training data contains flawed or exaggerated information, the model’s outputs reflect those flaws, often persuasively.

A specific form of this is what researchers at the University of Maryland call ghost citations. AI tools can invent academic articles that do not exist, pairing real authors with fabricated titles, journals, and publication dates. The output looks credible. The source is imaginary.

Why does this matter for your firm?

UK service businesses face a staff over-trust problem with AI. The 2025 BBC and European Broadcasting Union study found roughly 45 per cent of answers from mainstream chatbots contained errors. Copilot wrongly claimed a bird flu vaccine trial was under way in Oxford, citing a BBC article from 2006. MIT’s research notes that the apparent objectivity of AI tools makes people less willing to question incorrect outputs.

The regulatory exposure compounds this. The Information Commissioner’s Office makes clear that organisations cannot avoid accountability by attributing an incorrect output to a third-party AI vendor. Under UK GDPR, you remain responsible for the accuracy of data used in decisions about individuals. The ICO’s guidance on the accuracy principle states that organisations must take reasonable steps to ensure accuracy of outputs used in decision-making, regardless of how those outputs were generated.

The Financial Conduct Authority has reinforced this position in its AI commentary. Existing conduct rules, including the requirement to provide information that is clear, fair, and not misleading, apply to AI-assisted communications and advice. Consumer Duty obligations do not pause because a chatbot was involved.

Where will you actually meet it?

Hallucinations surface in the kinds of tasks small firms reach for AI to help with every day. Summarising a contract clause, checking what a regulation says, drafting a client briefing, answering a staff question about policy. Stanford Human-Centred AI tested general-purpose chatbots on legal research queries and found hallucination rates of 58 to 82 per cent, high enough to make unchecked AI output a genuine liability in professional settings.

The pattern extends to more conversational uses. A 2024 peer-reviewed paper in the journal Patterns documented AI systems engaging in sycophancy, generating answers that match what the user appears to want to hear rather than what is accurate. When a team member asks AI to validate a decision already made, the tool is predisposed to agree. The NCSC advises treating generative AI outputs as untrusted until independently verified, particularly in contexts where the cost of getting it wrong is high.

Internal knowledge bases carry the same risk. Stanford’s research found that even specialised tools using retrieval-augmented generation, grounded in a curated set of company documents, still hallucinated on more than 17 per cent of professional queries.

When should you check AI output and when can you trust it?

Risk scales with the consequences of getting it wrong. Using AI to draft an internal memo or a marketing description is low-stakes, because a human edits it before it goes anywhere. Using AI to answer a client question about their legal position, their tax exposure, or their regulatory obligations is different. The output becomes advice, and the ICO, FCA, and sector regulators make clear that you remain accountable for it, regardless of which tool generated it.

A practical framework divides uses into two groups. The first group, allowed with review, covers marketing drafts, rough proposals, meeting summaries based on non-sensitive notes, and first-pass internal documentation. The second group, tightly controlled or prohibited, covers anything involving regulated advice, HR decisions, client eligibility assessments, and any output that could be treated as a statement of fact in a dispute or audit.

One rule applies across both groups. Any specific claim in an AI output, including a number, a law reference, a regulation, or a case, should be verified against a primary source before it goes anywhere. If a member of staff cannot find the original source independently, the claim should be treated as unverified, and the text should not be used as written.

Two approaches reduce hallucination risk without abandoning the productivity gains. The first is retrieval-augmented generation (RAG); the second is human-in-the-loop review. RAG grounds the AI’s answers in documents you supply, rather than letting it draw freely from general training data. A chatbot built on your vetted policies, contracts, and FAQs is less likely to invent content than one drawing from the open internet. The residual risk remains, which is why human review of critical outputs stays necessary.

Human-in-the-loop review means requiring a named person to sign off any AI output before it reaches a client or drives a decision. The ICO recommends meaningful human involvement in automated processes, not rubber-stamping but genuine checking by someone with the expertise to spot an error. For a small firm, the practical version is straightforward. Assign a named reviewer for each content type, with a short checklist covering the facts that matter most.

A short written AI policy, two or three pages, covers which tasks AI is approved for, what review is required, and what is out of scope entirely. The FCA’s AI commentary and ICO accountability guidance both point to documented controls as a baseline expectation when something goes wrong. The NCSC recommends staff training so your team can recognise plausible but false content and knows not to rely on a source it cannot verify. The time to put that in place is before a problem surfaces, not after.

Why generative AI produces plausible but false answers

Key takeaways

What is a generative AI hallucination?

Why does this matter for your firm?

Where will you actually meet it?

When should you check AI output and when can you trust it?

Sources

Frequently asked questions

Does AI get better at avoiding hallucinations over time?

How do I know when an AI output contains a hallucination?

Does using AI for internal tasks rather than client-facing work reduce the risk?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Why generative AI produces plausible but false answers

Key takeaways

What is a generative AI hallucination?

Why does this matter for your firm?

Where will you actually meet it?

When should you check AI output and when can you trust it?

What related ideas help you manage this?

Sources

Frequently asked questions

Does AI get better at avoiding hallucinations over time?

How do I know when an AI output contains a hallucination?

Does using AI for internal tasks rather than client-facing work reduce the risk?

Ready to talk it through?

Related reading

Practical AI ideas for small business operations

Healthcare AI use cases that reduce admin and improve flow

What digital marketing teams are actually doing with AI

If any of this sounds familiar, let's talk.