What happens when AI is trained on copyrighted data?

A client sends back a services agreement with a new clause highlighted. They want a warranty that any AI-generated content in the deliverables was produced using tools trained only on licensed material. You read it twice. You use two or three AI tools regularly. You have never checked what they were trained on, and until this moment you have not needed to. This post explains what the risk actually is, so you can decide where to focus.

What is copyright risk when AI uses protected training data?

The risk operates on two levels. At the training level, an AI model is built by processing vast volumes of content, often scraped from the web. If that content included copyrighted material taken without authorisation, the rights holders can argue infringement. At the output level, the model’s generated text or images may reproduce protected material closely enough to infringe copyright or trade marks.

In December 2024, the UK High Court ruled in Getty Images v Stability AI that training the Stable Diffusion image model on copyrighted photographs did not, on the evidence before it, amount to making infringing copies under UK copyright law. The court accepted expert evidence that the trained model learns statistics and does not store copies of the original images. That finding reduces the training-side liability risk for UK-based model developers.

But the case had a twist. Getty succeeded on limited trade mark infringement because AI-generated images reproduced the Getty Images and iStock watermarks. The training itself was not the infringement; the output was. That distinction matters because it means the legal exposure for a business using an AI image tool can sit with what comes out of the model, not how the model was built.

UK copyright law also has a narrow text and data mining exception under Section 29A of the Copyright, Designs and Patents Act 1988, but it covers non-commercial research only. Commercial AI training in the UK still requires licensing or a defensible fair-dealing argument. The UK government confirmed in 2024 that it will not extend this exception to commercial use.

Why does this matter for your business?

If you buy and use mainstream AI tools as a service, you sit a step removed from whoever trained the model. That does not make you entirely clear. Many vendors’ standard contracts disclaim responsibility for IP infringement in outputs and pass that liability to the customer. If your firm has ever fine-tuned a model on third-party content, you step directly into the training-side risk.

Some major providers have responded with explicit IP indemnity programmes. Microsoft’s Copilot Copyright Commitment, published in 2023, promises to defend and compensate enterprise customers facing copyright claims arising from use of its Copilots, provided they use Microsoft’s built-in filters and safety systems. Similar commitments exist from other large vendors. These indemnities cover the training data and the model’s output; they do not cover what you put in.

The EU AI Act, formally adopted in 2024, adds a further layer for firms selling into or operating across EU markets. General-purpose AI model providers must now publish a sufficiently detailed summary of the content used to train their models, specifically to give copyright holders a route to enforce their rights. If your firm uses a large foundation model deployed on the EU market, you can expect more disclosure about its training provenance, and with it more scrutiny from rights holders.

Where will you actually meet it?

For a services firm, the three most likely encounter points are client contracts, image and design work, and decisions about fine-tuning on specialised content. Client procurement teams at larger organisations now routinely ask for IP warranties on AI-assisted deliverables. The image and design encounter carries the highest tested litigation risk, given the Getty v Stability AI case.

Generating commercial images with an AI tool, whether for marketing materials, client presentations, or website visuals, sits in heavily litigated territory. AI image models have been trained on enormous corpora of web images, many of which were copyrighted. The Getty case showed both sides of what courts may hold. Training was not found infringing under UK law, but outputs that reproduced identifiable trade mark elements were.

Fine-tuning is the second high-risk scenario. If your firm, or a developer working for you, uploads client documents, licensed database extracts, or third-party content to train or customise a model, you have moved from consumer to producer. That triggers copyright exposure on the training side and, where that content includes personal data, a UK GDPR issue under the ICO’s AI guidance framework.

The third encounter is in proposals, reports, and client deliverables, particularly in marketing, legal, and professional services work. The risk is low if the tool’s terms include an output indemnity and your staff review AI output before it goes out. The risk rises when using a tool with no indemnity and distributing AI text without any editorial review.

When should you ask harder questions, and when can you move on?

For off-the-shelf SaaS tools from large providers, the training-data copyright risk belongs to the vendor. Your practical exposure sits on the output side, and reputable enterprise plans cover that with indemnity commitments. Ask harder questions when fine-tuning on third-party content, when your work involves commercial image or logo generation, or when a client contract includes explicit IP warranties you are being asked to provide.

Three questions worth asking any AI vendor. Does the platform provide an IP indemnity covering copyright claims on outputs, and on what conditions? Does using the service constitute consent to use your inputs for further training? And what happens to any fine-tuned model if the vendor relationship ends?

The last question matters for anyone who has customised a tool on proprietary or client data. Some platforms train on user prompts and inputs, particularly on free or basic tiers. If those inputs include client content, the situation intersects with your confidentiality obligations.

The low-risk baseline is a reputable SaaS tool, enterprise or paid tier with a documented indemnity, human review before AI output goes to a client, and no upload of licensed database content or third-party material for training. US law is still unsettled on fair use for AI training, and the UK is in an active consultation period. That does not mean holding back; it means knowing which side of these distinctions you currently sit on.

What else connects to this risk?

Copyright risk from AI training sits alongside three issues owners encounter in the same conversation. Data protection engages when training involves personal data, which overlaps with copyright where client documents contain both. Trade mark law applies separately, as the Getty case showed when AI outputs reproduced brand watermarks. Insurance is now asking about AI IP exposure on proposal forms, and many firms cannot yet answer.

On data protection, the ICO is clear that training AI on personal data requires a lawful basis, proper data minimisation, and a Data Protection Impact Assessment for high-risk uses. If you are fine-tuning a model on client files, HR records, or customer communications, you have a copyright question and a UK GDPR question on your hands simultaneously.

On trade marks, the Getty v Stability AI ruling confirmed that even where training is held not to infringe copyright, outputs reproducing trade marks remain actionable. For a services firm generating marketing materials or client-facing designs with AI, reviewing output for brand identifiers before publishing is basic hygiene.

On insurance, UK legal commentary identifies IP exposure from generative AI as an emerging claims category. Insurers are beginning to include AI-related IP questions on proposal forms. A firm that cannot describe its AI IP controls clearly may face gaps in existing technology errors and omissions or cyber cover. That is worth checking before the next renewal.

The regulatory picture is still developing. The UK IPO has an active code of practice consultation on AI and copyright. The EU AI Act’s transparency requirements for general-purpose AI models have been in force since August 2025. The WIPO has running international consultations on AI and IP norms. The direction of travel is towards more disclosure, not less.

The practical discipline is not to wait for the law to settle. Know which data goes through which tools, have IP terms in your vendor and client contracts that reflect your actual practice, and require human review of AI output before anything reaches a client.

Copyright risk when AI is trained on protected material

Key takeaways

What is copyright risk when AI uses protected training data?

Why does this matter for your business?

Where will you actually meet it?

When should you ask harder questions, and when can you move on?

What else connects to this risk?

Sources

Frequently asked questions

Does using ChatGPT or similar tools put my firm at legal risk for copyright infringement?

What is the UK text and data mining exception, and does it apply to commercial AI use?

A client contract now requires a warranty that our AI tools were trained on licensed material. How do we respond?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Copyright risk when AI is trained on protected material

Key takeaways

What is copyright risk when AI uses protected training data?

Why does this matter for your business?

Where will you actually meet it?

When should you ask harder questions, and when can you move on?

What else connects to this risk?

Sources

Frequently asked questions

Does using ChatGPT or similar tools put my firm at legal risk for copyright infringement?

What is the UK text and data mining exception, and does it apply to commercial AI use?

A client contract now requires a warranty that our AI tools were trained on licensed material. How do we respond?

Ready to talk it through?

Related reading

Write an AI acceptable-use policy your team will actually follow

Who owns the AI in your agency, and what do you tell the client?

What your board actually wants when it asks about AI

If any of this sounds familiar, let's talk.