What training data lawsuits mean for AI governance

A person reviewing printed documents at a desk in a daylit office
TL;DR

Training data lawsuits, led by the Authors Guild and others against OpenAI and Microsoft, signal that AI models built on unlicenced content carry copyright and data protection risk that flows downstream to the businesses using those tools. UK owners need to understand what their AI tools were trained on, review vendor contracts for data-use rights, and apply UK GDPR obligations to any personal data used in AI training or fine-tuning.

Key takeaways

- The Authors Guild, news publishers and code contributors have brought more than forty copyright cases against AI firms including OpenAI and Microsoft, alleging their content was copied without permission to train models. - The US Copyright Office concluded in 2025 that wholesale copying of entire works for AI training ordinarily weighs against fair use, giving these lawsuits legal weight beyond a niche US dispute. - UK businesses cannot outsource legal responsibility for AI training data by purchasing a third-party tool: the ICO's guidance makes the organisation deploying the AI responsible for understanding how its tools were trained. - Any personal data used to fine-tune or train an AI model requires a fresh lawful basis, updated privacy notices, and very likely a data protection impact assessment under UK GDPR. - Four governance steps apply now: ask vendors about training data provenance, add IP indemnity clauses to AI contracts, classify data before it enters a model, and monitor how the OpenAI and GitHub Copilot litigation resolves.

Many owners buying AI tools this year have not asked a question that is working its way through US courts right now: what was this model trained on, and did the people who created that content agree to it? The Authors Guild, major news publishers and code contributors are pressing that question hard, against OpenAI and Microsoft. The answer will shape the risk profile for businesses using those tools, not just the firms that built them.

What are training data lawsuits?

Training data lawsuits are copyright claims brought by authors, publishers and rights holders who allege that AI developers copied their work without permission to train models. The best-known target OpenAI and Microsoft. The Authors Guild sued OpenAI in 2023, claiming books by George R.R. Martin, John Grisham and others were ingested wholesale. That case later consolidated with twelve other complaints from news publishers and code contributors.

The US Copyright Office’s 2025 report on generative AI training concluded that training AI models typically involves copying protected works, and that wholesale copying of entire texts “ordinarily weighs against fair use.” Over forty AI-copyright cases are now active in US courts alone, with actions filed in at least eight other countries. Courts have let the core infringement claims survive early motions to dismiss, which signals that this wave of litigation is not going to resolve quickly.

A separate set of cases targets code repositories. GitHub Copilot users filed suit against Microsoft, GitHub and OpenAI over code scraped from open-source repositories to train the product. That litigation runs in parallel and raises the same core question: does copying content at scale to build a commercial AI product require a licence from the original authors?

Why does this matter for your business?

You are not a defendant in these lawsuits. The UK regulators watching them are drawing the same inference the courts are, though. If training AI models on third-party content carries infringement risk, businesses that assemble datasets, fine-tune models, or deploy tools built on unverified data share some of that exposure. The ICO has stated that organisations cannot outsource legal responsibility simply by purchasing a third-party AI tool.

There is also the UK GDPR angle. If an AI tool you deploy is trained on your clients’ data or your employees’ personal information, and there was no explicit authorisation for that use, you may have a data protection problem that sits entirely with your business. The ICO’s enforcement findings against Experian, which involved repurposing personal data for profiling and analytics without proper transparency, are the most relevant UK precedent for owner-managed businesses thinking about AI training decisions.

The EU AI Act adds a further layer. Providers of general-purpose AI models are now required to document their training data, including whether it contains copyrighted material, under Article 53 of the Act. If you are using EU-marketed AI tools, that transparency should flow to you as a customer. If it does not, that gap is worth raising in your vendor contract.

Where will you actually meet this risk?

The risk appears in three places for owner-managed businesses. First, in the AI tools you buy off the shelf: if your vendor’s model was trained on unlicenced content, exposure from a future settlement or enforcement action sits partly with you as a commercial user. Second, in any fine-tuning or custom model work you commission. Third, in any dataset you have assembled yourself from scraped web content or copied client documents.

The NCSC’s guidance on secure AI development is instructive here. It frames training data as an attack surface rather than just a legal issue. Data poisoning, where malicious content is introduced into a training corpus to skew model behaviour, and training data exfiltration are both documented risks. An owner-managed business that uses a vendor tool without understanding its training data provenance has opened a governance gap that sits between legal and security and gets owned by neither.

For any fine-tuning scenario, the UK GDPR angle sharpens further. If you have ever uploaded client documents, call transcripts, or employee records to improve a model’s performance on your specific domain, you may have created a new data processing purpose. That typically requires a fresh lawful basis, updated privacy notices, and a data protection impact assessment. The ICO has explicitly listed AI and large-scale data re-use among its regulatory priorities for the coming year.

When does this become your problem to manage?

The answer for many businesses is: right now. Three conditions put an owner-managed business inside this governance picture. You are using any AI tool connected to client data, even inside a CRM or reporting platform. You have ever sent personal, confidential or proprietary material through an AI tool without reviewing the vendor’s data-use terms. You have ever asked a team member to gather or prepare data for use in a model.

The practical moves are not dramatic. Four things are worth doing before the next AI vendor renewal. Ask for written confirmation of what data the vendor uses for training and whether your firm’s data is ever included. Check your standard vendor contracts for any clause that grants the provider rights to use your inputs to improve their models. Review whether any personal data flowing through your AI tools has a lawful basis for that processing under UK GDPR. Then assess whether your firm would need a data protection impact assessment before your next AI deployment.

The FCA and Bank of England’s AI Public-Private Forum report is relevant here if your business operates in financial services. Its findings stress that boards must understand how AI models are trained and validated, particularly where those models rely on third-party or unverified data. The governance expectation is the same across sectors: AI training data provenance is a question for the leadership team, not only for whoever handles the tech procurement.

What does this mean for your AI governance approach?

Training data lawsuits signal that AI providers cannot treat third-party content as a free resource, and that businesses using tools built on uncertain data share some responsibility for understanding what that data was. The practical governance question is direct: do you know what your AI tools were trained on, and can you demonstrate that the data handling meets your obligations under UK copyright and data protection law?

Four governance adjustments apply to any owner-managed business deploying AI at meaningful scale. Add training data transparency to your vendor due diligence list: before signing an AI tool contract, ask whether the model uses your data for training, what rights the provider claims over inputs, and whether the training corpus included licenced content. Write protection into the contract: your agreements should include a warranty that the provider has cleared IP rights over training data and an indemnity if a third party later sues over that content. Build a data classification step into your AI intake process, so that personal and confidential data is identified before it enters any model. Finally, follow the case outcomes: the OpenAI multidistrict litigation and the GitHub Copilot cases will define industry norms over the next two to three years, and a settlement requiring licensing fees would change the cost structure of AI tools significantly.

The CMA’s review of AI foundation models, which raised concerns about incumbents controlling training data at scale, is pointing in the same direction. Concentrated training data is an antitrust and governance concern, not just a legal one. Owner-managed businesses that build good data hygiene habits now, before the litigation settles, will find the adjustments required later are smaller.

Sources

- US Copyright Office (2025). Report on Copyright and Artificial Intelligence: Generative AI Training. Concludes that wholesale copying of protected works for AI training ordinarily weighs against fair use. https://www.wiley.law/alert-Copyright-Office-Issues-Key-Guidance-on-Fair-Use-in-Generative-AI-Training - Skadden, Arps, Slate, Meagher & Flom LLP (2025). Summary of US Copyright Office AI Training Report. Covers the prima facie infringement finding and fair use framework for AI training data decisions. https://www.skadden.com/insights/publications/2025/05/copyright-office-report - UK Information Commissioner's Office (2023). Guidance on AI and Data Protection. Sets out that organisations deploying AI cannot outsource legal responsibility and must understand training data provenance. https://ico.org.uk/for-organisations/guide-to-data-protection/key-dp-themes/artificial-intelligence/ - EU Artificial Intelligence Act (2024). Official text, Article 53. Requires general-purpose AI providers to document training data including copyright status and share that documentation with regulators. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689 - National Cyber Security Centre (2023). Secure AI System Development Guidance. Identifies training data as a key attack surface, including data poisoning and exfiltration risks that affect any organisation using AI tools. https://www.ncsc.gov.uk/collection/developing-secure-artificial-intelligence - Bank of England and Financial Conduct Authority (2022). AI Public-Private Forum Final Report. States that boards must engage with AI risk management including data provenance and third-party model dependencies. https://www.bankofengland.co.uk/report/2022/artificial-intelligence-public-private-forum-final-report - Competition and Markets Authority (2023). AI Foundation Models Initial Report. Raises concerns about training data concentration and signals the need for fair, transparent access and licensing arrangements. https://www.gov.uk/government/publications/ai-foundation-models-initial-report - BFV Law (2024). Training Data or Taking Data? How AI Copyright Lawsuits Are Reshaping Creative Rights. Describes the In re OpenAI multidistrict litigation and what courts have allowed to survive early dismissal motions. https://www.bfvlaw.com/training-data-or-taking-data-how-ai-copyright-lawsuits-are-reshaping-creative-rights/ - UK Information Commissioner's Office (2020). Enforcement findings on Experian data practices. Establishes the precedent for repurposing personal data for new analytical or AI-adjacent purposes without proper transparency and lawful basis. https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2020/10/ico-finds-concerns-with-how-data-brokers-use-personal-data/

Frequently asked questions

Do training data lawsuits apply to UK businesses or is this just a US issue?

The active litigation is primarily in the US, but the risk is not geographically limited. UK businesses deploying AI tools built on unlicenced training data face exposure under UK copyright law and UK GDPR, regardless of where the AI provider is based. The ICO and the CMA have both signalled active interest in AI training data governance, and the EU AI Act's training data transparency requirements apply to any tools marketed in the EU that UK firms may also use.

Does this affect me if I only use off-the-shelf AI tools and do not build my own?

Yes, though the exposure differs. If you are using a SaaS AI tool, the vendor's liability for its training data primarily rests with the vendor. However, if your firm's client data or employee information has been used to improve or adapt the model under terms that your contract or privacy notices did not cover, you are the data controller responsible for that processing. Reviewing vendor data-use terms at your next contract renewal is the practical starting point.

What should I do if my AI vendor cannot tell me what their model was trained on?

Treat that gap as a meaningful signal. Vendors of well-governed AI tools should be able to provide at least a general description of their training data and confirm that your inputs are not used to train shared models. If a vendor cannot or will not answer this question, factor that into your procurement decision. For any tool handling personal data, the absence of this information makes completing a data protection impact assessment considerably harder to do properly.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation