A professional services firm set up a chatbot to handle routine client questions. The answers sounded authoritative. Several were wrong. One quoted a fee structure that had changed six months earlier. The client acted on it. The firm spent an afternoon managing the fallout. The model had found an old version of the fee schedule and returned it as current.
That gap between fluent and accurate is the central problem with LLM-powered chatbots, and one that business owners tend to underestimate on first deployment.
What does chatbot accuracy actually mean?
Accuracy for a business chatbot means the same thing it means for a member of staff: the answer is factually correct, current, and relevant to the question. A member of staff who is unsure will pause and check. A large language model will produce a fluent, confident answer whether the underlying information is solid or not.
A 2024 study in BMJ Public Health, reported across industry press, found that ChatGPT and similar tools regularly gave inaccurate, incomplete, or potentially unsafe answers to consumer health queries. Health questions are an extreme case, but the failure mode applies broadly: the model optimises for fluency, not truth.
The UK government’s public Ask GOV.UK pilot illustrated how much design choices matter. Before the team tuned prompts and model configuration, accuracy sat at around 76%. After deliberate optimisation, it rose to roughly 90%. That gain came from decisions about what the bot was allowed to do and how it was set up, not from a newer model alone.
Why does accuracy matter for your firm?
A wrong answer from a chatbot is more than an embarrassment. If a customer acts on incorrect information from a channel your firm controls, you carry the legal exposure. UK risk consultancy URM Consulting notes that firms can face misrepresentation and negligence risk where customers rely on inaccurate AI-generated advice, particularly when personal or financial decisions are involved.
The FCA’s Consumer Duty makes this explicit for regulated firms. Communications must be fair, clear, and not misleading. Using an LLM to generate those communications does not shift the responsibility to the model provider. The firm remains accountable for what the chatbot says.
Internal chatbots carry a different but real risk. If a member of staff acts on incorrect HR policy guidance from an internal tool, the error still belongs to the business. The ICO’s guidance on AI and data protection stresses that organisations must test, monitor, and maintain the accuracy of AI-generated outputs, not just at launch but on an ongoing basis.
The practical upside is that accuracy is largely controllable. A chatbot designed with the right constraints performs reliably within its scope. The problems arise when scope is left undefined, when source documents are out of date, or when no one is checking what the bot actually says.
Where do accuracy problems actually come from?
Accuracy failures in business chatbots typically trace back to one of three sources: the model is hallucinating because no relevant document exists in the knowledge base, the model is drawing on a document that is out of date or conflicts with a newer one, or the bot’s scope is too broad and it attempts to answer questions it was never equipped to handle.
The UK government’s DSIT department addressed all three when building its internal Ask Ops chatbot. The team scraped vetted intranet documents into a vector database and instructed the model to query only that corpus. They also set the model temperature to 0.1, making answers as deterministic as possible rather than generatively varied. The model is instructed to return “No answer found” rather than attempt a response when no relevant document is retrieved.
That “no answer found” policy matters more than it sounds. A chatbot that admits it cannot help is far less dangerous than one that fills the gap with a confident approximation. The NCSC, in its guidance on using AI safely, explicitly warns that LLMs “can confidently state incorrect information as fact” and recommends that organisations constrain models to vetted sources and make it easy for users to flag errors.
Out-of-date documents create a quieter failure mode. If your knowledge base contains a 2022 pricing page alongside a 2024 pricing page, the model may retrieve the older version and present it as current. DSIT tackled this by adding conflict-resolution rules to its prompts: the model is told how to handle superseded or contradictory guidance.
What controls should you put in place?
The controls that work are design decisions about scope, source quality, model configuration, and human review. In sequence: define what the bot is allowed to answer, build a small clean knowledge base, configure the model conservatively, specify when a human takes over, and monitor outputs over time. Each step reduces a distinct failure mode.
Start with scope. Define the topics the bot is authorised to answer and set a standard fallback message for anything outside that list. Follow the DSIT pattern: give the model a clear role and restrict it to the provided documents. Anything outside that scope routes to a person.
Knowledge base quality is the second lever. Start with a small, verified set: current service descriptions, pricing, FAQs, key policies. Remove outdated versions before ingestion. Version your documents so the model prefers the most recent. The DSIT experience reinforces a broader principle: how you structure and ingest source material into the knowledge base matters more to accuracy than the choice of underlying LLM.
Human escalation paths matter as much as any technical control. Identify the categories that warrant a handoff: complaints, regulatory language, expressions of dissatisfaction, requests for refunds. When these appear, the bot routes to a person and the conversation is flagged for review. Click4Assistance, which builds chatbots for UK contact centres, emphasises that pre-approved scripted answers and full interaction logging are compliance controls, not just quality improvements.
On monitoring, run a weekly test of ten to fifteen standard questions and score the answers for accuracy. Store interaction logs. Add a simple thumbs-up or thumbs-down rating. When accuracy drops after a model or document update, you want to find out before a customer does.
What do UK regulations require on accuracy?
Three UK regulatory frameworks directly affect chatbot accuracy. The ICO’s UK GDPR accuracy principle requires that personal data used or generated by AI systems is accurate and correctable. The FCA’s Consumer Duty requires fair, clear, and not misleading customer communications regardless of what technology delivers them. The NCSC advises treating all LLM outputs as unverified and applying human oversight to anything consequential.
For financial services firms, the FCA has been direct: AI does not reduce your obligations under Consumer Duty. You must have systems to test and monitor output accuracy, correct issues when they arise, and record how you do so. In practice, that means a named person responsible for the chatbot’s content, not just the technology running it.
The EU AI Act adds a layer for UK firms with customers in the European Union. For limited-risk chatbots, the Act requires that users are told they are interacting with AI. For high-risk use cases such as credit decisions or employment screening, requirements for accuracy documentation, human oversight, and output traceability are considerably more demanding.
The CMA has also signalled a direction of travel through its AI foundation models programme. SMEs deploying consumer-facing chatbots built on large foundation models are expected to be transparent and avoid misleading outputs. That expectation will only become more formalised over time.
For the typical owner-managed UK firm, the practical starting point is NCSC and ICO guidance. Treat model outputs as drafts to be validated. Document your controls. Review them when the underlying model, the document set, or the scope of the bot changes.



