Six months after a successful AI launch, the tool is running, the team has adapted, and attention has moved on to the next priority. That is usually when the problems begin. The model was validated against a dataset that is now six months old. The world it operates in has shifted, and the tool has no mechanism to notice.
What is AI model degradation?
A deployed machine learning model does not hold its launch performance indefinitely. The data it encounters in production gradually drifts away from what it was trained on, and accuracy, consistency, or both, begin to slip. Research by academics from MIT, Harvard, and the University of Monterrey found that 91% of machine learning models degrade in performance after deployment, in a pattern the researchers term AI ageing.
This is not a flaw that better implementation would have prevented. Machine learning systems learn from historical data, then operate in a world that keeps changing. Customer language shifts, product catalogues update, query patterns evolve across the seasons. The model does not update automatically to reflect any of this. Every day it runs, the gap between what it was trained on and what it encounters in production grows a little wider.
The research documents two things happening in parallel: overall error rates climbing as model outputs diverge from ground truth, and the spread between best and worst outputs widening. That second signal matters because erratic performance is harder to catch than uniformly poor performance. A consistently wrong system gets flagged by the people using it. An erratic one gets explained away.
Why does this matter if the tool is still running?
A tool that is technically running but degrading below its launch performance creates risks that do not appear on a system dashboard. The team compensates for errors rather than reporting them, and the signal that would trigger a review disappears into workarounds. By the time the degradation is visible externally, whether in a customer complaint or a compliance flag, the early intervention window has closed.
MIT Sloan Management Review frames AI maintenance through the lens of technical debt, and the analogy is useful for non-technical leaders. There is a principal (the accumulated gap between where the model is and where it should be), interest (the extra work the team does to compensate for poor outputs), and opportunity cost (the decisions made on flawed AI outputs that cannot be reversed). All three accrue without showing up in any obvious report.
The compounding effect is what makes early monitoring pay. At launch, an error rate within tolerance is manageable. Twelve months in, if error rates are rising and output variability is growing, the tool is creating a materially different risk profile to the one the original business case priced in. The approved decision and the actual behaviour of the system have separated, and the gap is invisible unless someone is looking for it.
What signals should you be watching for?
You don’t need data science capability to run a useful monitoring regime. The three signals that matter are error rate trends tracked over time rather than single snapshots, output quality spot-checks by someone who knows what a good output looks like, and the spread between the best and worst outputs the tool produces in any given week. All three are trackable without dedicated technical infrastructure and belong in a monthly review.
Error rate trends are the most reliable of the three. A single snapshot tells you nothing about direction; a chart over eight weeks tells you whether performance is stable, declining gradually, or dropping sharply. If your vendor or implementation team cannot produce this chart on request, that itself is diagnostic information about the monitoring you do not have.
Output quality spot-checks are the human layer that data alone misses. Take ten outputs per week at random, review them against a simple quality rubric, and score them. This does not need to be a data scientist’s job. It needs to be someone’s job, with a named owner and a standing slot in a regular meeting.
The spread between best and worst outputs is the leading indicator the research highlights specifically. When the best outputs still look good but the worst are getting noticeably worse, the model is entering a degradation phase before average quality visibly drops. Catching the widening spread catches the problem earlier, and that timing difference is where the cost savings live.
When should you act, and what are the options?
Act when a trend becomes clear, not when the tool fails completely. Three options are available to a non-technical owner facing a confirmed trend: retrain the model on fresh data, re-scope it to a narrower task where performance is still reliable, or retire and replace it. Waiting for a visible breakdown makes all three options more expensive and harder to negotiate.
Retraining requires a pre-agreed protocol. If the vendor or implementation team has not built one into the contract, you will be negotiating it under pressure after the degradation is already externally visible. The time to establish the retraining workflow and its cost is at launch, or immediately after the first monitoring review surfaces a declining trend.
Re-scoping is the option that gets underused. A model deployed across a broad range of customer queries may still perform well on a specific subset of them. Narrowing the scope, routing the harder queries to human handlers, and retraining only within that narrowed set is often faster and cheaper than full retraining, and it reduces the risk exposure while the longer-term fix is prepared.
MIT Sloan’s PAID framework offers a useful triage for deciding how urgently to act. When technical debt is high and business impact is high, immediate remediation is required. When debt is high but business impact is still low, schedule the fix before impact rises. The framework converts a technical assessment into a business prioritisation decision, which is the conversation that actually needs to happen between the person who owns the tool and the person who owns the budget.
What else connects to this?
AI model degradation is one piece of a broader set of post-deployment challenges that many owner-managed businesses have not built processes for. Related areas include vendor dependency risk, the wider category of AI technical debt, and the EU AI Act’s lifecycle requirements for systems in the high-risk classification. Understanding where model decay sits within that picture helps you ask better questions of whoever manages your AI tools.
Vendor dependency is where the same pattern plays out at a larger scale. If your AI tool relies on a single provider and that provider is acquired, pivots, or discontinues the product, you need to know whether your data is portable and whether the system can be maintained without them. Consumer Reports Innovation documented what this looks like in practice with the Humane AI Pin, where an asset sale to HP left owners with devices their warranties could not save. Enterprise AI tools follow the same logic on a longer timeline.
The EU AI Act is worth knowing about even if your systems do not fall into the high-risk category. Its requirements for high-risk systems, including lifecycle monitoring, performance recordkeeping, and human oversight provisions, describe what good post-deployment practice looks like regardless of regulatory obligation. Operators preparing for an exit are adopting these standards voluntarily, because acquirers increasingly treat documented AI maintenance as a signal of operational maturity rather than a regulatory compliance box.
The monitoring regime does not need to be complex. It needs to be consistent, assigned to a named owner, and connected to a decision framework that tells you when a trend has become a problem worth acting on. That is the difference between a deployed AI tool and one that is actually maintained.
If you want to audit what you have in place before the next performance review, a conversation is a good place to start. Book a conversation.



