A small-firm owner I spoke with last month had a customer call her in the afternoon to say the chatbot on her website had told them she offered a service she does not offer. The bot had answered fluently, with a price, with a turnaround time, with a small caveat about VAT. The customer had taken a screenshot. She now had to decide whether her firm meant what the bot said.
That moment is where the word “hallucination” stops being an industry term and becomes a question an owner has to answer this quarter. Not in theory. Not after the next model release. Now.
What is an AI hallucination?
An AI hallucination is a fluent, confident output from a language model that is simply not true. The model invents a fact, a citation, a policy, or a product detail and presents it with the same calm authority it uses for things it knows. The trap is the fluency. There is no warning tone, no caveat, no internal “I am unsure” flag.
Researchers split hallucinations into two flavours. Factuality hallucinations contradict the world, like inventing a court case or stating a wrong date. Faithfulness hallucinations contradict a source the model was given, like adding a clause to a policy document it was asked to summarise. The distinction matters because the controls for each are different, but the harm is the same: a confident wrong answer that someone acted on.
Why does it happen?
Language models hallucinate because they predict text, they do not look it up. When you ask a model a question, it generates an answer one word at a time, choosing the most likely next word given what came before. It does not ask whether the claim is true. There is no separate truth-check. The machine is optimised to produce plausible continuations, and a plausible-sounding lie is, by definition, plausible.
OpenAI’s own technical write-up on this is unusually clear. Pretraining sees only positive examples of fluent language, so the model learns statistical patterns rather than facts. Spelling and grammar follow consistent patterns and the model learns them well. Arbitrary low-frequency facts, like a specific person’s date of birth or a specific case citation, cannot be predicted from patterns. The model fills the gap with a guess that sounds like the kind of thing that would be true.
There is a second factor that matters for owners evaluating vendor claims. Reinforcement learning from human feedback, the training step where humans rate model answers, tends to reward confident, agreeable output. A model that hedges gets marked down and a model that asserts gets marked up. No model release in 2026 has fully solved this. The best summarisation models on the Vectara leaderboard still hallucinate roughly one answer in fifty.
Where this bites a small business
Four places, in rough order of pain. Customer-facing claims first: a chatbot or AI-drafted email that states a price, a policy, or a warranty term the firm did not authorise. The customer takes a screenshot. The firm has to decide whether to honour the bot or argue with the customer. Regulated advice second: financial, tax, or legal answers where a wrong reply creates duty-of-care exposure under the FCA’s Consumer Duty.
Contract and document drafting third. AI-drafted contracts that cite cases that do not exist. AI-drafted compliance documents that invent regulatory requirements. AI-summarised policies that quietly add a clause the source does not contain. The harm here is that the error survives the draft because the reviewer trusts the fluency. Internal policy and decision support fourth: AI-drafted staff handbooks, AI-summarised board papers, AI-prepared briefings. The audience is internal, but a wrong figure in a board pack is a wrong figure.
The unifying feature across all four is that the model’s confidence is unrelated to whether the model is right. A wrong answer reads exactly like a right one. That is the design of the technology, not a defect in any particular product.
What the cases tell you
Three cases show courts in three jurisdictions starting to expect verification. In Moffatt v Air Canada, a Canadian small-claims tribunal held the airline liable for negligent misrepresentation when its chatbot stated a bereavement-fare policy that did not exist. The tribunal rejected the argument that the chatbot was a separate legal entity. It was part of the firm’s website, the customer relied on it, and the firm was responsible.
In Mata v Avianca, a US federal court sanctioned two attorneys jointly after they submitted a brief containing ChatGPT-invented case citations. In Harber v HMRC, the UK First-tier Tribunal flagged that an appellant had submitted ChatGPT-fabricated case law and noted the citations were “plausible but incorrect”. The Solicitors Regulation Authority’s joint guidance with the NCSC puts the principle bluntly: LLM output “sounds right rather than is right”, and a solicitor who relies on it without verifying is in breach of professional duty.
The takeaway is not that any one of these binds a UK SME directly. It is that the direction of travel is consistent. Courts and regulators in three jurisdictions are now treating AI-generated content as the firm’s own statement, with the firm responsible for its accuracy. This is awareness, not legal advice. Speak to your solicitor and your professional indemnity broker about your specific facts and your specific cover.
What to do about it
Classify your AI uses by what a hallucination would cost. Brainstorming, internal first drafts, and rough summaries can tolerate the occasional invented answer, because a human will read and rewrite anyway. Customer-facing claims, regulated advice, contract drafting, and anything cited externally cannot. Match the controls to the cost. The highest-value control is human review before output leaves the firm in those high-cost categories.
The technical mitigations are useful but partial. Retrieval-augmented generation anchors the model in your actual documents and reduces hallucination, though Stanford’s audit showed even sophisticated RAG-grounded legal tools still hallucinated more than 17% of the time. Prompt design can ask the model to flag uncertainty or cite sources, which makes errors easier to spot. Structured output, where the model returns a fixed schema rather than free text, removes degrees of freedom and shrinks the surface area for invention. Each helps. None solve the underlying issue, which is that the model is guessing.
Treat any vendor claim of “hallucination-free” output with scepticism. Ask for their measured hallucination rate on your specific use case, the methodology, the SLA, and what recourse you have when the system fails. If the vendor cannot answer, the phrase in their pitch deck is doing work the product cannot do. The decision worth making this quarter is not whether to use AI. It is which uses you trust to a model alone, which you put a human in front of, and which you keep out of AI’s hands until the controls catch up. If you’d like a second pair of eyes on that classification for your own firm, book a conversation.



