When an AI system goes wrong in production

A senior account manager at a professional services firm noticed the problem on a Thursday afternoon. A client had replied to say their summary report contained figures from a different engagement. The AI drafting tool had been pulling context from a previous session for three weeks, and nobody in the team had caught it. The immediate priority was clear enough. What do they do right now, and in what order?

What does it mean for an AI system to go wrong in production?

An AI system goes wrong in production when it is live, connected to real workflows, and producing incorrect, harmful, or non-compliant outputs. That covers a wide range, from a language model that hallucinates client advice to an automated tool that acts on the wrong data or a workflow system that routes information to the wrong recipient. The type of failure shapes your response, so naming it clearly is the first practical move.

Production failures are not always dramatic. Many begin as silent quality degradation, outputs drifting from accurate to plausible-but-wrong, or automations carrying out the right action on the wrong record. The Latitude production failure guide identifies goal drift, context loss, and cascading errors in multi-agent systems among the common patterns. These are workflow failures as much as model failures, which means treating the failure as a prompt-tuning problem rarely addresses the root cause.

The NCSC’s incident management framework puts containment before diagnosis. When you spot a problem, stop the system or throttle it first. Disable the risky workflow, revoke API keys if the system is making external calls, or switch to human-only processing for that task. Investigating while the system is still running risks extending the damage before you have any picture of what went wrong.

Why does your response sequence matter more than the fix?

The order in which you respond to an AI incident affects how bad the consequences turn out to be. Containment first, then classification, then evidence preservation, then remediation. Reversing that sequence, by trying to fix the root cause while the system is still running, is how small incidents become large ones.

Once the system is stopped, classify what happened. Was this a data breach? Did personal data go to the wrong person? Did the AI produce advice that a client acted on? Did it trigger actions in downstream systems? Classification determines who else needs to be involved. A data breach involving personal information triggers the ICO’s 72-hour notification window if the breach is likely to risk individuals’ rights and freedoms. A failure touching regulated financial services activity brings FCA governance expectations into play. A cyber incident activates your insurer and, in some cases, your solicitor.

Preserve evidence before you begin any repair work. Save prompts, tool call logs, outputs, timestamps, and user actions so you can reconstruct what happened. For a small firm, this usually means screenshots, export files, and a short written account of when the problem was first noticed and what the symptoms were. That record matters if a client asks questions, a regulator makes enquiries, or your insurer needs to assess the claim.

Where do production AI failures show up in a services firm?

Production AI failures in a services firm tend to cluster around four points. Outbound communications, client-facing documents, integrations with external systems, and automated decision steps each expose a different type of risk. Outbound failures tend to be visible quickly. Integration failures can go unnoticed for days or weeks. The type of failure determines both the urgency and the remediation path.

Outbound communications are the failure point that shows up first. An AI that drafts client emails, writes summaries, or generates reports is only one session-context error away from including the wrong client’s information. The failure is immediate and obvious, but by the time it is noticed, the harm has already reached the client.

Integrations with external tools carry a subtler risk. When an AI agent is connected to CRM, accounting, or ticketing systems and authorised to take actions, a loop error or a context misread can create bad records, trigger wrong transactions, or send data outside the intended boundary. The NimbleBrain production AI guide notes that systems built without recovery paths, retry logic, and human escalation points for unknown situations are particularly exposed. For a small firm, the integration was often set up quickly and the failure modes were never mapped.

Document generation is the third common failure point. AI tools that produce contracts, reports, or proposals from template logic can propagate the same error across every document in a batch before anyone notices. One bad assumption in the context window is all it takes.

When does an AI incident become a regulatory matter?

An AI incident in a UK services firm becomes a regulatory matter when personal data is involved, when the system touches regulated activities, or when a client has suffered a concrete harm. Three bodies are most relevant in the UK, the ICO for data protection, the FCA for regulated financial services, and your insurer or solicitor for contractual or legal exposure. Classification happens before remediation.

The ICO is the first checkpoint. If your AI system accessed, disclosed, or processed personal data incorrectly, you need to assess whether the incident is likely to result in a risk to individuals’ rights and freedoms. If it is, the ICO expects notification within 72 hours. The ICO’s guidance on AI and data protection is clear that AI is not exempt from UK GDPR principles. Fairness, transparency, data minimisation, and accuracy all apply, even when the processing is automated.

The FCA is relevant if your firm works in or for regulated financial services, or if the AI touched client onboarding, advice workflows, or operational processes that fall under FCA oversight. The FCA’s 2024 AI update made clear that firms remain responsible for outcomes and are expected to maintain governance and control even when AI is doing the work. A failure in that context is a conduct, record-keeping, and controls matter, not only an IT incident.

If neither applies, which is true for a firm using AI only for internal drafting with no client data and no external integrations, the response burden is substantially lower. Fix the problem, document what happened, and update your runbook.

What should you put in place before the next incident?

The practical lesson from every AI production failure is that the response playbook should exist before the incident, not be written during it. A short runbook covering four things is sufficient for a 5 to 50 person firm. Who can switch the system off, what counts as a reportable incident, where the logs live, and what safe enough to re-enable means.

The UK government’s AI Playbook frames this as planning for errors, failure, and human override. In practice, for a 5 to 50 person firm, the whole thing fits on one page with a named person assigned to each decision point.

The second piece is turning each incident into a regression test. Latitude’s four-step framework for AI production failures works well at small scale. Collect the trace, cluster the failure type, identify the root cause, and create an evaluation case that catches the same failure in future. A firm that treats incidents as a source of evaluation data builds more reliable AI over time.

The third point is contractual. Do not assume your vendor will handle liability or notification for you. The ICO and FCA frameworks place responsibility on the deploying organisation, regardless of where the model lives. Read the contract before an incident happens, understand what your vendor does and does not cover in a failure scenario, and know which notifications you are responsible for making yourself.

If you want to think through your firm’s exposure before something goes wrong, Book a conversation and we can work through it together.

What to do when an AI system goes wrong in production

Key takeaways

What does it mean for an AI system to go wrong in production?

Why does your response sequence matter more than the fix?

Where do production AI failures show up in a services firm?

When does an AI incident become a regulatory matter?

What should you put in place before the next incident?

Sources

Frequently asked questions

Does an AI incident mean I have to tell the ICO?

What should I do first when I notice my AI system is producing wrong outputs?

If my vendor's AI tool makes the mistake, is it still my responsibility?

Ready to talk it through?

If any of this sounds familiar, let's talk.

What to do when an AI system goes wrong in production

Key takeaways

What does it mean for an AI system to go wrong in production?

Why does your response sequence matter more than the fix?

Where do production AI failures show up in a services firm?

When does an AI incident become a regulatory matter?

What should you put in place before the next incident?

Sources

Frequently asked questions

Does an AI incident mean I have to tell the ICO?

What should I do first when I notice my AI system is producing wrong outputs?

If my vendor's AI tool makes the mistake, is it still my responsibility?

Ready to talk it through?

Related reading

Write an AI acceptable-use policy your team will actually follow

Who owns the AI in your agency, and what do you tell the client?

What your board actually wants when it asks about AI

If any of this sounds familiar, let's talk.