How to use AI safely when accuracy matters

Person reviewing a printed document at a desk, pen in hand, looking closely at the page
TL;DR

AI language models generate plausible-sounding text rather than retrieving verified facts, which means hallucinations occur in 3 to 27 percent of outputs depending on task and prompt design. For owner-operated services firms, this creates professional liability risk: existing UK regulatory obligations, UK GDPR accuracy requirements, and professional body guidance all apply to AI-generated content you send to clients. A classification framework, retrieval-augmented generation, mandatory human sign-off, and output logging are the four controls that close the bulk of the gap.

Key takeaways

- LLMs predict statistically plausible text rather than retrieving verified facts, generating hallucinations in 3 to 27 percent of outputs depending on task type and prompt design. - In a services firm, liability for AI-generated content sent to clients sits with the firm. UK GDPR accuracy requirements, FCA SM&CR, and professional body obligations all apply regardless of how the content was produced. - Classifying output types before deployment, not case by case once AI is in use, is the operational safeguard that prevents high-risk outputs reaching clients unchecked. - Retrieval-augmented generation (RAG) significantly reduces hallucination risk by grounding model responses in your own verified documents rather than general training data. - A four-control framework covering deployment (RAG), prompting (require uncertainty disclosure and source citation), review (checklist sign-off), and logging (record who checked what) gives a defensible position with regulators and clients.

In February 2024, a tribunal in British Columbia ruled against Air Canada after its customer service chatbot invented a bereavement discount that did not exist. The airline argued the chatbot was effectively a separate entity responsible for its own statements. The tribunal disagreed: Air Canada remained accountable for what appeared on its platform, regardless of how the content was produced.

A year earlier, a New York law firm used ChatGPT to research legal precedents for a court submission. Several of the cited cases did not exist. The judge imposed a $5,000 sanction and described the citations as fabricated.

Both situations have the same operating problem at their core. The AI produced wrong information confidently, with no outward signal to distinguish it from correct information. For any owner-operated services firm considering AI, that problem does not disappear by choosing a better tool. It has to be managed.

What makes an AI produce information that is simply wrong?

An AI language model does not retrieve verified facts from a database. It generates text by predicting the most plausible word sequence based on patterns in its training data. This means it can produce a citation, a figure, or a piece of advice that reads as authoritative but has no factual grounding at all. Researchers have measured hallucination rates between 3 and 27 percent, varying by task and prompt design.

Both OpenAI and Anthropic acknowledge this in their own documentation. OpenAI warns that models “may make up facts, figures, or citations”. Anthropic describes the issue as outputs “not grounded in the provided data or reality”. These are structural features of how large language models work, present even in the most capable models available today.

A peer-reviewed evaluation of LLMs answering medical questions, published in npj Digital Medicine, found clinically significant errors in 16 to 27 percent of responses. That error rate would be considered unacceptable in any clinical workflow without human review.

The practical implication is straightforward. AI is a drafting tool, not a source of truth. The quality of its output depends on the material you feed it, how you ask the question, and the judgement you apply when reviewing the answer.

Why does this create more risk in a services firm than elsewhere?

A services business stakes its reputation on the accuracy of what it delivers. If an AI-drafted document contains a wrong figure, a fabricated precedent, or advice that does not reflect current law, the client acts on it as if a qualified professional signed it off. The liability sits with the firm. UK professional regulations and UK GDPR make this explicit and apply regardless of how the content was produced.

The FCA has confirmed that the Senior Managers and Certification Regime applies to AI-related decisions, meaning a named senior individual is personally accountable for AI-assisted outcomes in their area. The SRA and Bar Standards Board have both issued guidance confirming that solicitors and barristers must verify AI-generated content before use. The GMC holds clinicians accountable for clinical decisions regardless of which tools helped generate them.

UK GDPR adds a separate layer. Accuracy is a core data protection principle under Article 5, requiring personal data to be correct and up to date. Where automated processing significantly affects individuals, additional safeguards and the right to human review apply. The ICO has made clear that firms remain accountable for AI-assisted decisions even when partly generated by a third-party model.

For a smaller regulated firm, the message is consistent: existing professional and data protection obligations apply to AI output as much as to any other form of client-facing work.

Where in a services business will your team actually encounter it?

The highest-exposure points are the places where AI is most attractive to deploy: drafting client correspondence, answering regulatory queries, producing figures for reports, summarising long documents, and running customer-facing chatbots. These are high-volume tasks where the efficiency gain is real. They are also the places where an inaccurate output travels furthest and fastest toward a client.

A customer-facing chatbot inventing a policy is a customer service and legal liability risk. A legal research tool fabricating case citations is a professional conduct risk. An AI-drafted financial report containing invented figures is an audit risk. These examples have already reached courts and tribunals. In each case, the failure arose from AI-generated content reaching a client without adequate verification.

The exposure is proportional to what happens downstream when the AI is wrong. An internal draft used as the basis for a strategic decision carries consequences if the underlying data is fabricated. A client-facing document, a signed report, or a customer chatbot response carries greater ones.

The highest-risk combination is high trust with low oversight. If a team member assumes AI output is verified because the tool sounds confident, and no review step exists, the inaccuracy reaches the client every time.

When does a human need to check, and when can you let it run?

The practical rule is to classify your output types before you deploy, not case by case once AI is already in use. Three categories cover the working territory. Client-facing outputs with factual claims need verification by a qualified person before they leave the firm. AI-assisted internal drafts need a reviewer before they drive a decision. Low-risk automation of structured data, formatting, or template population can usually run without per-output review.

The UK Government AI Playbook describes this as defining “no-mistake zones”: processes where an error could materially harm a client. For those, a qualified professional reviews and signs off the final output. AI may assist with research or drafting, but a human owns the conclusion.

The NCSC is direct about the principle: treat AI outputs as you would information from an untrusted source. For important content, verify against trusted primary sources before acting on them.

A one-page AI use policy for your team makes the classification explicit. It states which tasks allow AI-assisted drafting, which require sign-off before any client contact, and which categories should not use AI-generated output at all. The ICO expects firms to demonstrate accountability over how automated tools are used where personal data is involved and where decisions significantly affect individuals. A written policy that staff have seen is your starting point for that accountability.

What controls actually reduce the hallucination risk?

Four controls work together at SME scale: how you deploy the model, how you prompt it, how you review its output, and how you log what happened. None of them requires a technical team or a significant budget. Together they give you a defensible position if a client dispute, a regulatory enquiry, or a data subject complaint ever asks how you managed accuracy.

The first control is deployment. Retrieval-augmented generation (RAG) grounds model responses in your own verified documents rather than general training data. It does not eliminate errors, but it significantly narrows the space in which they can occur. Many enterprise tools now use RAG by default. BT’s AI pilots, for instance, built human-in-the-loop review into every customer-facing workflow as a condition of wider deployment.

The second is prompting. Ask the model to flag uncertainty and cite its source document for every substantive claim. An instruction along the lines of “if you cannot find support for this in the provided documents, say you don’t know” reduces hallucinations by design. It also makes review easier: the reviewer can check each cited source rather than searching for evidence from scratch.

The third is review. A short checklist for the sign-off person, covering whether facts and citations have been verified against primary sources and whether the content reflects current regulations, takes less than five minutes on a standard client document.

The fourth is logging. Record when AI was used, who reviewed the output, and what was changed before it reached the client. A dated note in the file is sufficient at SME scale. The NPCC AI Playbook for Policing recommends this level of documentation for any AI-assisted decision. It establishes that a human was accountable, which is what regulators look for.

The underlying principle behind all four is the one the NCSC has stated clearly since 2023: AI outputs are untrusted until a qualified person says otherwise. Getting that sequencing right, AI first, human second, sign-off third, is responsible deployment at any firm size. If you want help designing these controls for your specific firm, book a conversation.

Sources

- OpenAI (2024). Safety Best Practices. Warns that models may generate fabricated facts, figures, or citations and recommends human review for accuracy-critical outputs. https://platform.openai.com/docs/guides/safety-best-practices - NCSC (2023). Using Public Generative AI. Advises organisations to treat AI outputs as untrusted and verify important content against reliable sources before acting on them. https://www.ncsc.gov.uk/collection/guidelines-for-secure-ai-system-development/using-public-generative-ai - UK Government (2025). Artificial Intelligence Playbook for the UK Government. Requires validation checks for generative AI outputs and meaningful human oversight at high-risk decision points. https://assets.publishing.service.gov.uk/media/67aca2f7e400ae62338324bd/AI_Playbook_for_the_UK_Government__12_02_.pdf - ICO (2023). AI and Data Protection. Covers the UK GDPR accuracy principle and accountability requirements for AI-assisted processing of personal data. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ai-and-data-protection/ - ICO (2023). Explaining Decisions Made with AI. Covers Article 22 automated-decision rights including the right to human review for decisions with significant individual impact. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/explaining-decisions-made-with-ai/ - Manakul et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Peer-reviewed study documenting hallucination rates up to 27 percent on tested tasks, validating the need for verification workflows. https://arxiv.org/abs/2305.13534 - Kung et al. (2023). Performance of ChatGPT on USMLE. npj Digital Medicine. Found clinically significant errors in 16 to 27 percent of LLM medical answers, supporting mandatory human oversight in regulated sectors. https://www.nature.com/articles/s41746-023-00858-w - FCA (2022). Discussion Paper DP5/22: Artificial Intelligence and Machine Learning. Confirms the Senior Managers and Certification Regime applies to AI-related decisions and stresses data quality and model risk management. https://www.fca.org.uk/publications/discussion-papers/dp5-22-artificial-intelligence-and-machine-learning - SRA (2023). Guidance on Generative Artificial Intelligence. Confirms solicitors must verify AI-generated content before use and maintain client confidentiality when using AI tools. https://www.sra.org.uk/solicitors/guidance/guidance-on-generative-artificial-intelligence/ - Civil Resolution Tribunal (2024). Moffatt v Air Canada. Ruled Air Canada liable for chatbot inaccuracies, establishing that operators cannot disclaim responsibility for AI-generated customer communications. https://decisions.civilresolutionbc.ca/crt/moffatt-v-air-canada

Frequently asked questions

How do I know if an AI output is accurate enough to send to a client?

Verify any factual claims, figures, or regulatory references against primary sources before the output leaves the firm. AI outputs that involve specific legal, financial, clinical, or technical claims should be signed off by a qualified professional. If you cannot point to a verifiable source for every substantive statement in the document, it is not ready to send.

Does UK law hold my firm responsible if AI produces wrong information that harms a client?

Yes. The FCA, ICO, SRA, and courts have all confirmed that existing professional obligations and UK GDPR accuracy requirements apply regardless of whether a human or an AI tool generated the content. The Air Canada ruling in 2024 established that a firm cannot shift liability to its chatbot. Your professional indemnity cover and regulatory standing are at risk if inaccurate AI outputs reach clients unchecked.

Does retrieval-augmented generation (RAG) solve the hallucination problem?

RAG significantly reduces hallucination risk by grounding model responses in your own verified documents rather than general training data. It does not eliminate the risk entirely. Models can still misread documents, omit important caveats, or misapply context. RAG is one control rather than a complete solution. It works best alongside prompts that require uncertainty disclosure, a human review step, and a logging record of what was checked.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation