The practice manager at a twelve-person professional services firm has spent six months quietly fine-tuning an AI assistant on five years of client emails. It drafts responses well. The partners are pleased. Last week a new enterprise client sent a vendor questionnaire asking for the firm’s AI data retention policy, the lawful basis for training on customer data, and the deletion schedule for any model artefacts. There is no policy. There is no deletion schedule. There is a Friday afternoon spent reading the EU AI Act on her phone in the kitchen.
That moment is now common. The tools have moved faster than the policies behind them, and the questionnaires from larger customers have caught up. The good news is that the rules, while real, are tractable for a small firm if you separate the questions properly at the start.
What is the actual rule on AI and data retention?
There is no single AI data retention rule. There are two regimes pulling in opposite directions on different categories of data, and the practical answer for any small firm sits in the gap between them. Both apply at once, and neither cancels the other out.
GDPR Article 5(1)(e), the storage limitation principle, requires personal data to be kept no longer than necessary for the purpose it was collected. The ICO restates this in plain language for small organisations and asks you to set a retention period, justify it, and securely delete or anonymise data at the end of it. The EU AI Act, in force and applying in phases through 2026, requires providers of high-risk AI systems to retain technical documentation, quality management records, and certain logs for at least ten years after the system is placed on the market.
Specialist commentary describes these as two regulatory clocks running against each other, and the resolution is hierarchical. Deletion of raw personal data must happen first, leaving behind only non-personal documentation and anonymised records to satisfy the AI Act’s longer archival duty. Both clocks can be honoured if the data is sorted into the right buckets at the start, which is the part that gets skipped when a firm builds an AI tool first and writes the policy second.
Why does it matter for your business?
The gap between the two regimes is where reputational and contractual risk sits, even for firms not in scope of the AI Act’s high-risk provisions. Enterprise buyers are already asking for AI data retention policies as standard procurement questions. Insurers are starting to ask the same. The ICO can act on GDPR breaches regardless of whether you have heard of the AI Act, and storage limitation has been a live enforcement area for years.
If you cannot answer a vendor questionnaire about how long you keep training data, what happens when a client asks to be forgotten, and who owns each dataset that feeds your AI tools, you will lose deals before you lose fines. That is the more immediate cost. The deals lost this way are usually the larger, slower-closing ones, the enterprise contracts that take three months of procurement and a clean answer to the data section to convert. A firm without a policy looks indistinguishable from a firm with bad practice.
Where will you actually meet it?
You meet it in three places, none of which announce themselves as AI data retention questions. Each surfaces in a different week, from a different person, and each needs the same underlying policy to answer cleanly. If the policy is not written down somewhere a colleague can find it without asking you, the answer will land late or land wrong.
The first place is the data processing addendum on any AI tool that handles client information, where the small print says the vendor may retain logs and metadata for stated periods and may use anonymised inputs for product improvement. The second is procurement, where a new enterprise customer sends a security questionnaire that includes specific questions on AI use, training data, and deletion schedules. The third is internal, when a former employee asks for their data to be removed and you realise their emails are already woven into the model that drafts the firm’s client correspondence. The third one is the hardest, because the cost of unpicking it after the fact is usually a full retrain.
When to ask versus when to ignore
There are categories of AI use where the retention question is small enough to handle with common sense, and others where it deserves serious attention. The test is whether the data leaves its original system, whether identifiers travel with it, and whether the model or the vendor retains anything derived from it. If any of those three is yes, treat retention as load-bearing rather than admin.
If your team is using a hosted AI tool for drafting individual emails or summarising a meeting transcript that nobody is storing centrally, the retention question is essentially the vendor’s data processing addendum and a sensible internal rule about what you paste into prompts. If you are fine-tuning a model on customer data, building a retrieval system over client files, or using AI on health information, financial records, or anything that would count as special category data under GDPR, the question changes shape. It becomes a design decision, not a policy footnote, and the cost of getting it wrong is usually borne by the next deal cycle or the next subject access request, whichever arrives first.
How does a small firm actually run this?
The practical answer is to treat retention as four separate questions rather than one. The four categories each have different rules, different retention clocks, and different audit consequences, which is why bundling them under a single policy line tends to produce a document that nobody can actually apply. Sort the data first, then write the rule for each bucket.
Raw personal data, the original emails, records, or transcripts, gets the shortest retention period, set per use case and justified by the lawful basis you relied on to collect it. Pseudonymised data, where identifiers have been replaced with tokens but the mapping still exists, stays under GDPR and inherits the same deletion clock. Anonymised data, where re-identification is genuinely not feasible, sits outside GDPR and can be kept for model auditing or retraining. Peer-reviewed work in digital health has shown how synthetic datasets that replicate statistical properties without identifiers can support AI training while staying compliant. System logs and model documentation, the AI Act’s ten-year material, are non-personal by design and can be retained on a longer cycle.
Wrap those four categories in a short policy that names an owner per dataset, sets a retention period in months, schedules the deletion or anonymisation action, and logs that the action happened. Add a flow-down clause to your vendor contracts that mirrors your own policy. Review quarterly. The discipline is not glamorous. It is also the thing that turns a Friday-night vendor questionnaire into a fifteen-minute answer with a document attached.
If the practice manager at the start had had that document six months ago, she would still have built the assistant. She would have done it with a defined retention period on the source emails, an anonymised retraining set kept for audit, a deletion schedule logged in writing, and a one-paragraph answer ready for any client who asked. The work is the same. The exposure is different.



