Is your AI getting worse after launch? What to monitor and when to act

A person at a desk reviewing printed performance data charts alongside an open laptop
TL;DR

AI tools don't hold their launch performance indefinitely. Research shows 91% of machine learning models degrade after deployment as real-world data drifts away from training conditions. A non-technical owner can catch this early by tracking error rate trends, running output quality spot-checks, and watching for widening output variability in a monthly review. Clear ownership, a retraining protocol, and pre-agreed vendor exit terms complete the picture.

Key takeaways

- Research from MIT, Harvard, and the University of Monterrey shows 91% of machine learning models degrade in performance after deployment, as production data drifts away from training conditions in a pattern researchers call AI ageing. - The key early warning sign is widening variability between best and worst outputs, not just a drop in average quality. Erratic systems get explained away; consistently wrong ones get flagged and fixed. - An effective non-technical monitoring regime assigns error rate trend tracking, output quality spot-checks, and a watch on output variability spread to a named owner, reviewed monthly. - The decision to retrain, re-scope, or retire should be triggered by a clear trend, not by complete failure. Waiting for a visible breakdown makes all three options more expensive and harder to negotiate. - Vendor dependency belongs on the same checklist as performance monitoring. Data portability and exit terms should be in the contract before the tool goes live, not negotiated after a problem forces the conversation.

Six months after a successful AI launch, the tool is running, the team has adapted, and attention has moved on to the next priority. That is usually when the problems begin. The model was validated against a dataset that is now six months old. The world it operates in has shifted, and the tool has no mechanism to notice.

What is AI model degradation?

A deployed machine learning model does not hold its launch performance indefinitely. The data it encounters in production gradually drifts away from what it was trained on, and accuracy, consistency, or both, begin to slip. Research by academics from MIT, Harvard, and the University of Monterrey found that 91% of machine learning models degrade in performance after deployment, in a pattern the researchers term AI ageing.

This is not a flaw that better implementation would have prevented. Machine learning systems learn from historical data, then operate in a world that keeps changing. Customer language shifts, product catalogues update, query patterns evolve across the seasons. The model does not update automatically to reflect any of this. Every day it runs, the gap between what it was trained on and what it encounters in production grows a little wider.

The research documents two things happening in parallel: overall error rates climbing as model outputs diverge from ground truth, and the spread between best and worst outputs widening. That second signal matters because erratic performance is harder to catch than uniformly poor performance. A consistently wrong system gets flagged by the people using it. An erratic one gets explained away.

Why does this matter if the tool is still running?

A tool that is technically running but degrading below its launch performance creates risks that do not appear on a system dashboard. The team compensates for errors rather than reporting them, and the signal that would trigger a review disappears into workarounds. By the time the degradation is visible externally, whether in a customer complaint or a compliance flag, the early intervention window has closed.

MIT Sloan Management Review frames AI maintenance through the lens of technical debt, and the analogy is useful for non-technical leaders. There is a principal (the accumulated gap between where the model is and where it should be), interest (the extra work the team does to compensate for poor outputs), and opportunity cost (the decisions made on flawed AI outputs that cannot be reversed). All three accrue without showing up in any obvious report.

The compounding effect is what makes early monitoring pay. At launch, an error rate within tolerance is manageable. Twelve months in, if error rates are rising and output variability is growing, the tool is creating a materially different risk profile to the one the original business case priced in. The approved decision and the actual behaviour of the system have separated, and the gap is invisible unless someone is looking for it.

What signals should you be watching for?

You don’t need data science capability to run a useful monitoring regime. The three signals that matter are error rate trends tracked over time rather than single snapshots, output quality spot-checks by someone who knows what a good output looks like, and the spread between the best and worst outputs the tool produces in any given week. All three are trackable without dedicated technical infrastructure and belong in a monthly review.

Error rate trends are the most reliable of the three. A single snapshot tells you nothing about direction; a chart over eight weeks tells you whether performance is stable, declining gradually, or dropping sharply. If your vendor or implementation team cannot produce this chart on request, that itself is diagnostic information about the monitoring you do not have.

Output quality spot-checks are the human layer that data alone misses. Take ten outputs per week at random, review them against a simple quality rubric, and score them. This does not need to be a data scientist’s job. It needs to be someone’s job, with a named owner and a standing slot in a regular meeting.

The spread between best and worst outputs is the leading indicator the research highlights specifically. When the best outputs still look good but the worst are getting noticeably worse, the model is entering a degradation phase before average quality visibly drops. Catching the widening spread catches the problem earlier, and that timing difference is where the cost savings live.

When should you act, and what are the options?

Act when a trend becomes clear, not when the tool fails completely. Three options are available to a non-technical owner facing a confirmed trend: retrain the model on fresh data, re-scope it to a narrower task where performance is still reliable, or retire and replace it. Waiting for a visible breakdown makes all three options more expensive and harder to negotiate.

Retraining requires a pre-agreed protocol. If the vendor or implementation team has not built one into the contract, you will be negotiating it under pressure after the degradation is already externally visible. The time to establish the retraining workflow and its cost is at launch, or immediately after the first monitoring review surfaces a declining trend.

Re-scoping is the option that gets underused. A model deployed across a broad range of customer queries may still perform well on a specific subset of them. Narrowing the scope, routing the harder queries to human handlers, and retraining only within that narrowed set is often faster and cheaper than full retraining, and it reduces the risk exposure while the longer-term fix is prepared.

MIT Sloan’s PAID framework offers a useful triage for deciding how urgently to act. When technical debt is high and business impact is high, immediate remediation is required. When debt is high but business impact is still low, schedule the fix before impact rises. The framework converts a technical assessment into a business prioritisation decision, which is the conversation that actually needs to happen between the person who owns the tool and the person who owns the budget.

What else connects to this?

AI model degradation is one piece of a broader set of post-deployment challenges that many owner-managed businesses have not built processes for. Related areas include vendor dependency risk, the wider category of AI technical debt, and the EU AI Act’s lifecycle requirements for systems in the high-risk classification. Understanding where model decay sits within that picture helps you ask better questions of whoever manages your AI tools.

Vendor dependency is where the same pattern plays out at a larger scale. If your AI tool relies on a single provider and that provider is acquired, pivots, or discontinues the product, you need to know whether your data is portable and whether the system can be maintained without them. Consumer Reports Innovation documented what this looks like in practice with the Humane AI Pin, where an asset sale to HP left owners with devices their warranties could not save. Enterprise AI tools follow the same logic on a longer timeline.

The EU AI Act is worth knowing about even if your systems do not fall into the high-risk category. Its requirements for high-risk systems, including lifecycle monitoring, performance recordkeeping, and human oversight provisions, describe what good post-deployment practice looks like regardless of regulatory obligation. Operators preparing for an exit are adopting these standards voluntarily, because acquirers increasingly treat documented AI maintenance as a signal of operational maturity rather than a regulatory compliance box.

The monitoring regime does not need to be complex. It needs to be consistent, assigned to a named owner, and connected to a decision framework that tells you when a trend has become a problem worth acting on. That is the difference between a deployed AI tool and one that is actually maintained.

If you want to audit what you have in place before the next performance review, a conversation is a good place to start. Book a conversation.

Sources

- NannyML (2025). AI ageing: 91% of ML models degrade in performance after deployment. Documents rising error rates and widening output variability in production as key degradation signals, drawing on research from MIT, Harvard, and the University of Monterrey. https://www.nannyml.com/blog/91-of-ml-perfomance-degrade-in-time - MIT Sloan Management Review (2022). How to manage tech debt in the AI era. Framework for treating AI maintenance obligations as financial debt with principal, interest, liabilities, and opportunity cost; includes the PAID prioritisation quadrant for non-technical leaders. https://sloanreview.mit.edu/article/how-to-manage-tech-debt-in-the-ai-era/ - European Commission (2024). Regulatory framework for AI (EU AI Act). Primary source for lifecycle monitoring requirements, performance recordkeeping obligations, and human oversight provisions applicable to high-risk AI systems throughout their operational life. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai - WilmerHale Privacy and Cybersecurity Law (2024). What are high-risk AI systems within the EU AI Act and what requirements apply to them? Analysis of classification thresholds under Article 6 and the ongoing monitoring and traceability obligations that follow. https://www.wilmerhale.com/en/insights/blogs/wilmerhale-privacy-and-cybersecurity-law/20240717-what-are-highrisk-ai-systems-within-the-meaning-of-the-eus-ai-act-and-what-requirements-apply-to-them - McKinsey & Company (2024). Four critical strategies for sustainable gen AI adoption. Covers the need for ongoing AI management beyond initial deployment, including monitoring culture and clear ownership of AI systems after go-live. https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-insights/the-organization-blog/four-critical-strategies-for-sustainable-gen-ai-adoption - Berkman Klein Centre for Internet and Society, Harvard University (2022). Examining AI failures and lessons learned. Case analysis of AI system failures attributable to inadequate post-deployment oversight, including the absence of performance monitoring regimes. https://www.ethics.harvard.edu/blog/post-8-abyss-examining-ai-failures-and-lessons-learned - Consumer Reports Innovation (2025). Should an asset sale orphan an IoT device? Analyses the Humane AI Pin case and vendor lock-in risk when an asset sale removes product support obligations, with implications for enterprise AI dependency on single providers. https://innovation.consumerreports.org/should-humanes-asset-sale-orphan-an-iot-device/ - Witness.ai (2024). Model monitoring: best practices for post-deployment AI. Covers complementary monitoring layers across data drift, performance drift, and operational metrics for production machine learning systems in business settings. https://witness.ai/blog/model-monitoring/

Frequently asked questions

How quickly do AI models typically degrade after deployment?

The timeline varies by application and how much real-world data drifts from the training set. Research documents both gradual degradation over months and sharper drops when data distribution shifts significantly, such as after a product change or a shift in customer behaviour. Degradation is rarely announced. It shows up as rising error rates and widening output variability before average quality visibly drops. A monthly monitoring rhythm catches it before it reaches customers.

Who should own AI monitoring in a business without a data science team?

Ownership should sit with the person accountable for the business outcome the AI tool supports, not the person who implemented it. If the tool handles customer queries, the customer experience lead owns it. If it supports financial reporting, the finance lead owns it. The role needs a monthly review slot, a basic quality rubric, and a clear escalation path to whoever can authorise retraining or replacement. Technical knowledge helps but is not required to run this regime effectively.

What should be in an AI monitoring agreement before a tool goes live?

At minimum, the agreement should specify who is responsible for monitoring after go-live, what a retraining workflow looks like and at what cost, how data drift is measured and reported, and what the exit terms are if the vendor is acquired or discontinues the product. Many businesses find they are negotiating these terms after a problem appears, when their bargaining position is weak and urgency is high. Establishing them before launch is the lower-cost path by some distance.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation