Many owners buying AI tools this year have not asked a question that is working its way through US courts right now: what was this model trained on, and did the people who created that content agree to it? The Authors Guild, major news publishers and code contributors are pressing that question hard, against OpenAI and Microsoft. The answer will shape the risk profile for businesses using those tools, not just the firms that built them.
What are training data lawsuits?
Training data lawsuits are copyright claims brought by authors, publishers and rights holders who allege that AI developers copied their work without permission to train models. The best-known target OpenAI and Microsoft. The Authors Guild sued OpenAI in 2023, claiming books by George R.R. Martin, John Grisham and others were ingested wholesale. That case later consolidated with twelve other complaints from news publishers and code contributors.
The US Copyright Office’s 2025 report on generative AI training concluded that training AI models typically involves copying protected works, and that wholesale copying of entire texts “ordinarily weighs against fair use.” Over forty AI-copyright cases are now active in US courts alone, with actions filed in at least eight other countries. Courts have let the core infringement claims survive early motions to dismiss, which signals that this wave of litigation is not going to resolve quickly.
A separate set of cases targets code repositories. GitHub Copilot users filed suit against Microsoft, GitHub and OpenAI over code scraped from open-source repositories to train the product. That litigation runs in parallel and raises the same core question: does copying content at scale to build a commercial AI product require a licence from the original authors?
Why does this matter for your business?
You are not a defendant in these lawsuits. The UK regulators watching them are drawing the same inference the courts are, though. If training AI models on third-party content carries infringement risk, businesses that assemble datasets, fine-tune models, or deploy tools built on unverified data share some of that exposure. The ICO has stated that organisations cannot outsource legal responsibility simply by purchasing a third-party AI tool.
There is also the UK GDPR angle. If an AI tool you deploy is trained on your clients’ data or your employees’ personal information, and there was no explicit authorisation for that use, you may have a data protection problem that sits entirely with your business. The ICO’s enforcement findings against Experian, which involved repurposing personal data for profiling and analytics without proper transparency, are the most relevant UK precedent for owner-managed businesses thinking about AI training decisions.
The EU AI Act adds a further layer. Providers of general-purpose AI models are now required to document their training data, including whether it contains copyrighted material, under Article 53 of the Act. If you are using EU-marketed AI tools, that transparency should flow to you as a customer. If it does not, that gap is worth raising in your vendor contract.
Where will you actually meet this risk?
The risk appears in three places for owner-managed businesses. First, in the AI tools you buy off the shelf: if your vendor’s model was trained on unlicenced content, exposure from a future settlement or enforcement action sits partly with you as a commercial user. Second, in any fine-tuning or custom model work you commission. Third, in any dataset you have assembled yourself from scraped web content or copied client documents.
The NCSC’s guidance on secure AI development is instructive here. It frames training data as an attack surface rather than just a legal issue. Data poisoning, where malicious content is introduced into a training corpus to skew model behaviour, and training data exfiltration are both documented risks. An owner-managed business that uses a vendor tool without understanding its training data provenance has opened a governance gap that sits between legal and security and gets owned by neither.
For any fine-tuning scenario, the UK GDPR angle sharpens further. If you have ever uploaded client documents, call transcripts, or employee records to improve a model’s performance on your specific domain, you may have created a new data processing purpose. That typically requires a fresh lawful basis, updated privacy notices, and a data protection impact assessment. The ICO has explicitly listed AI and large-scale data re-use among its regulatory priorities for the coming year.
When does this become your problem to manage?
The answer for many businesses is: right now. Three conditions put an owner-managed business inside this governance picture. You are using any AI tool connected to client data, even inside a CRM or reporting platform. You have ever sent personal, confidential or proprietary material through an AI tool without reviewing the vendor’s data-use terms. You have ever asked a team member to gather or prepare data for use in a model.
The practical moves are not dramatic. Four things are worth doing before the next AI vendor renewal. Ask for written confirmation of what data the vendor uses for training and whether your firm’s data is ever included. Check your standard vendor contracts for any clause that grants the provider rights to use your inputs to improve their models. Review whether any personal data flowing through your AI tools has a lawful basis for that processing under UK GDPR. Then assess whether your firm would need a data protection impact assessment before your next AI deployment.
The FCA and Bank of England’s AI Public-Private Forum report is relevant here if your business operates in financial services. Its findings stress that boards must understand how AI models are trained and validated, particularly where those models rely on third-party or unverified data. The governance expectation is the same across sectors: AI training data provenance is a question for the leadership team, not only for whoever handles the tech procurement.
What does this mean for your AI governance approach?
Training data lawsuits signal that AI providers cannot treat third-party content as a free resource, and that businesses using tools built on uncertain data share some responsibility for understanding what that data was. The practical governance question is direct: do you know what your AI tools were trained on, and can you demonstrate that the data handling meets your obligations under UK copyright and data protection law?
Four governance adjustments apply to any owner-managed business deploying AI at meaningful scale. Add training data transparency to your vendor due diligence list: before signing an AI tool contract, ask whether the model uses your data for training, what rights the provider claims over inputs, and whether the training corpus included licenced content. Write protection into the contract: your agreements should include a warranty that the provider has cleared IP rights over training data and an indemnity if a third party later sues over that content. Build a data classification step into your AI intake process, so that personal and confidential data is identified before it enters any model. Finally, follow the case outcomes: the OpenAI multidistrict litigation and the GitHub Copilot cases will define industry norms over the next two to three years, and a settlement requiring licensing fees would change the cost structure of AI tools significantly.
The CMA’s review of AI foundation models, which raised concerns about incumbents controlling training data at scale, is pointing in the same direction. Concentrated training data is an antitrust and governance concern, not just a legal one. Owner-managed businesses that build good data hygiene habits now, before the litigation settles, will find the adjustments required later are smaller.



