How to use AI safely when accuracy matters

In February 2024, a tribunal in British Columbia ruled against Air Canada after its customer service chatbot invented a bereavement discount that did not exist. The airline argued the chatbot was effectively a separate entity responsible for its own statements. The tribunal disagreed. Air Canada remained accountable for what appeared on its platform, regardless of how the content was produced.

A year earlier, a New York law firm used ChatGPT to research legal precedents for a court submission. Several of the cited cases did not exist. The judge imposed a $5,000 sanction and described the citations as fabricated.

Both situations have the same operating problem at their core. The AI produced wrong information confidently, with no outward signal to distinguish it from correct information. For any owner-operated services firm considering AI, that problem does not disappear by choosing a better tool. It has to be managed.

What makes an AI produce information that is simply wrong?

An AI language model does not retrieve verified facts from a database. It generates text by predicting the likeliest word sequence based on patterns in its training data. This means it can produce a citation, a figure, or a piece of advice that reads as authoritative but has no factual grounding at all. Researchers have measured hallucination rates between 3 and 27 percent, varying by task and prompt design.

Both OpenAI and Anthropic acknowledge this in their own documentation. OpenAI warns that models “may make up facts, figures, or citations”. Anthropic describes the issue as outputs “not grounded in the provided data or reality”. These are structural features of how large language models work, present even in today’s best models.

A peer-reviewed evaluation of LLMs answering medical questions, published in npj Digital Medicine, found clinically significant errors in 16 to 27 percent of responses. That error rate would be considered unacceptable in any clinical workflow without human review.

The practical implication is straightforward. AI is a drafting tool, not a source of truth. The quality of its output depends on the material you feed it, how you ask the question, and the judgement you apply when reviewing the answer.

Why does this create more risk in a services firm than elsewhere?

A services business stakes its reputation on the accuracy of what it delivers. If an AI-drafted document contains a wrong figure, a fabricated precedent, or advice that does not reflect current law, the client acts on it as if a qualified professional signed it off. The liability sits with the firm. UK professional regulations and UK GDPR make this explicit and apply regardless of how the content was produced.

The FCA has confirmed that the Senior Managers and Certification Regime applies to AI-related decisions, meaning a named senior individual is personally accountable for AI-assisted outcomes in their area. The SRA and Bar Standards Board have both issued guidance confirming that solicitors and barristers must verify AI-generated content before use. The GMC holds clinicians accountable for clinical decisions regardless of which tools helped generate them.

UK GDPR adds a separate layer. Accuracy is a core data protection principle under Article 5, requiring personal data to be correct and up to date. Where automated processing significantly affects individuals, additional safeguards and the right to human review apply. The ICO has made clear that firms remain accountable for AI-assisted decisions even when partly generated by a third-party model.

For a smaller regulated firm, the message is consistent. Existing professional and data protection obligations apply to AI output as much as to any other form of client-facing work.

Where in a services business will your team actually encounter it?

The highest-exposure points are the places where the case for deploying AI is strongest, drafting client correspondence, answering regulatory queries, producing figures for reports, summarising long documents, and running customer-facing chatbots. These are high-volume tasks where the efficiency gain is real. They are also the places where an inaccurate output travels furthest and fastest toward a client.

A customer-facing chatbot inventing a policy is a customer service and legal liability risk. A legal research tool fabricating case citations is a professional conduct risk. An AI-drafted financial report containing invented figures is an audit risk. These examples have already reached courts and tribunals. In each case, the failure arose from AI-generated content reaching a client without adequate verification.

The exposure is proportional to what happens downstream when the AI is wrong. An internal draft used as the basis for a strategic decision carries consequences if the underlying data is fabricated. A client-facing document, a signed report, or a customer chatbot response carries greater ones.

The highest-risk combination is high trust with low oversight. If a team member assumes AI output is verified because the tool sounds confident, and no review step exists, the inaccuracy reaches the client every time.

When does a human need to check, and when can you let it run?

The practical rule is to classify your output types before you deploy, not case by case once AI is already in use. Three categories cover the working territory. Client-facing outputs with factual claims need verification by a qualified person before they leave the firm. AI-assisted internal drafts need a reviewer before they drive a decision. Low-risk automation of structured data, formatting, or template population can usually run without per-output review.

The UK Government AI Playbook describes this as defining “no-mistake zones”, meaning processes where an error could materially harm a client. For those, a qualified professional reviews and signs off the final output. AI may assist with research or drafting, but a human owns the conclusion.

The NCSC is direct about the principle. Treat AI outputs as you would information from an untrusted source. For important content, verify against trusted primary sources before acting on them.

A one-page AI use policy for your team makes the classification explicit. It states which tasks allow AI-assisted drafting, which require sign-off before any client contact, and which categories should not use AI-generated output at all. The ICO expects firms to demonstrate accountability over how automated tools are used where personal data is involved and where decisions significantly affect individuals. A written policy that staff have seen is your starting point for that accountability.

What controls actually reduce the hallucination risk?

Four controls work together at SME scale, covering how you deploy the model, how you prompt it, how you review its output, and how you log what happened. None of them requires a technical team or a significant budget. Together they give you a defensible position if a client dispute, a regulatory enquiry, or a data subject complaint ever asks how you managed accuracy.

The first control is deployment. Retrieval-augmented generation (RAG) grounds model responses in your own verified documents rather than general training data. It does not eliminate errors, but it significantly narrows the space in which they can occur. Many enterprise tools now use RAG by default. BT’s AI pilots, for instance, built human-in-the-loop review into every customer-facing workflow as a condition of wider deployment.

The second is prompting. Ask the model to flag uncertainty and cite its source document for every substantive claim. An instruction along the lines of “if you cannot find support for this in the provided documents, say you don’t know” reduces hallucinations by design. It also makes review easier. The reviewer can check each cited source rather than searching for evidence from scratch.

The third is review. A short checklist for the sign-off person, covering whether facts and citations have been verified against primary sources and whether the content reflects current regulations, takes less than five minutes on a standard client document.

The fourth is logging. Record when AI was used, who reviewed the output, and what was changed before it reached the client. A dated note in the file is sufficient at SME scale. The NPCC AI Playbook for Policing recommends this level of documentation for any AI-assisted decision. It establishes that a human was accountable, which is what regulators look for.

The underlying principle behind all four is the one the NCSC has stated clearly since 2023. AI outputs are untrusted until a qualified person says otherwise. Getting that sequencing right, AI first, human second, sign-off third, is responsible deployment at any firm size. If you want help designing these controls for your specific firm, book a conversation.

How to use AI safely when accuracy matters

Key takeaways

What makes an AI produce information that is simply wrong?

Why does this create more risk in a services firm than elsewhere?

Where in a services business will your team actually encounter it?

When does a human need to check, and when can you let it run?

What controls actually reduce the hallucination risk?

Sources

Frequently asked questions

How do I know if an AI output is accurate enough to send to a client?

Does UK law hold my firm responsible if AI produces wrong information that harms a client?

Does retrieval-augmented generation (RAG) solve the hallucination problem?

Ready to talk it through?

If any of this sounds familiar, let's talk.

How to use AI safely when accuracy matters

Key takeaways

What makes an AI produce information that is simply wrong?

Why does this create more risk in a services firm than elsewhere?

Where in a services business will your team actually encounter it?

When does a human need to check, and when can you let it run?

What controls actually reduce the hallucination risk?

Sources

Frequently asked questions

How do I know if an AI output is accurate enough to send to a client?

Does UK law hold my firm responsible if AI produces wrong information that harms a client?

Does retrieval-augmented generation (RAG) solve the hallucination problem?

Ready to talk it through?

Related reading

Practical AI ideas for small business operations

Healthcare AI use cases that reduce admin and improve flow

What digital marketing teams are actually doing with AI

If any of this sounds familiar, let's talk.