AI training data copyright claims

Picture a small design agency spending an afternoon using an AI image generator to produce social media graphics for a client launch. The outputs come back broadly clean. Then one arrives with a faint watermark visible in the corner, the ghost of a Getty Images credit that has somehow come through the generation process. Nobody in the agency is quite sure whether to delete it and try again, or whether something more significant needs a conversation.

That scenario sits at the centre of a live legal debate. The lawsuits are real, the UK High Court has ruled on part of the question, and the practical implications for owner-managed businesses are narrower and more manageable than the headlines tend to suggest.

What are copyright claims over AI training data?

These claims centre on what AI models were trained on, not what they produce for you. Companies including Getty Images and the New York Times argue that the firms building large AI models scraped millions of their images or articles without permission, then used them to train systems sold commercially. The 2024 UK High Court ruling in Getty Images v Stability AI is the leading UK test case resolved so far.

In that case, Justice Joanna Smith found that the Stable Diffusion model did not itself constitute an infringing copy under the Copyright, Designs and Patents Act 1988, on the basis that the model does not store or reproduce the original Getty images in recoverable form. However, the court did find limited trade mark infringement where AI-generated outputs reproduced the Getty watermark visibly. The model cleared one copyright hurdle; certain outputs it produced did not clear the next one.

This distinction matters for anyone following the debate. UK law has not yet decided whether training on copyrighted works within the UK, without a licence, is itself lawful. The UK IPO’s 2021-22 consultation acknowledged the area is “disputed”, and the government stepped back from a proposed broader text-and-data-mining exception after pressure from creative sectors. University of Cambridge researchers have warned that an opt-out approach, where all works can be scraped unless creators proactively object, risks giving “carte blanche” to AI firms at the expense of UK creators.

Why does this matter for your business?

The practical risk for an owner-managed business sits in the outputs, not the training. You didn’t build or train these models. You are deploying outputs from models whose training data is legally contested, and if those outputs closely resemble protected works or reproduce watermarks and logos, a claim can follow. The Getty case found trade mark liability where watermarks appeared in AI-generated images, even as the broader copyright question went the other way.

A 2023 study commissioned by the UK IPO found that 26% of UK creative businesses were “very concerned” about unlicensed use of their works in AI training, with a further 35% “somewhat concerned”. Those numbers signal the direction of future claims. Rights-holders are watching, and active litigation in the US and EU is testing theories of liability that UK courts will eventually have to address.

For an owner-managed business, the exposure concentrates in specific areas. Marketing materials, website imagery, product design, and content created with AI for commercial delivery carry the highest profile. Internal drafting, meeting summaries, and administrative work carry a much lower profile. The question the court asked in Getty, whether the output substantially reproduces protected content, is also the question you should ask before anything AI-generated goes to a client or appears publicly.

Where does this risk actually show up?

The highest-exposure area for owner-managed businesses is generative image tools used for marketing. If you use an image generator to create website banners, social media graphics, or product visuals, the output travels directly to customers, often without anyone checking whether it is substantially similar to a protected work. Code generators carry a related risk when you ship the output as part of a commercial product without review.

Text models carry a lower but still real risk in content-heavy businesses. Asking a language model to reproduce sections of a specific article, or to closely mirror a distinctive copyrighted text, is a direct route to a potential claim. UK legal commentary puts it plainly. Treat any AI output the same way you would treat work from a freelancer. Read it, own it, check it before it leaves the building.

A separate but related issue arises if your team uses AI tools that fine-tune models on your customer data or on data containing personal information. UK GDPR applies the moment personal data is involved, and the ICO’s guidance on AI and data protection is clear that a lawful basis is required, alongside a data protection impact assessment and clear contractual controls on what the vendor does with that data. For many owner-managed businesses, fine-tuning is not something you are doing directly, but knowing whether your vendor does it on your inputs is worth confirming.

When is the risk real, and when can you set it aside?

For typical day-to-day AI use, the copyright risk is low enough to set aside without much analysis. Drafting emails, summarising documents, generating meeting notes, and brainstorming with a language model, none of these produce the kind of output likely to attract a copyright claim. The risk concentrates where output is visual, distinctive, creative, and sent directly to customers or published publicly.

The clearest filter at the moment is whether your vendor offers an IP indemnity. In September 2023, Microsoft announced a Copilot Copyright Commitment, agreeing to defend enterprise customers against copyright claims arising from Copilot outputs, provided those customers use the tool within intended scenarios and respect content filters. Similar commitments are appearing from other enterprise vendors. This shifts a meaningful portion of the exposure from user to vendor.

Where a vendor offers no indemnity, that does not automatically mean serious risk, but it does mean you carry more of the uncertainty. For owner-managed businesses using general-purpose image generators without enterprise agreements, a human review step before any output goes to a client or appears on your site is the proportionate response. That step catches watermarks, obvious copying, and outputs that look too close to known works.

Sector also matters. If your business creates creative work for clients as the core deliverable, such as a design agency, a training provider, or a content studio, the risk profile is higher. Choosing tools that use licensed datasets, or that offer fine-tuning on your own licensed content, is worth the additional sourcing effort.

Four concepts come up repeatedly in the AI training data debate, and knowing what they mean saves time when a vendor or solicitor uses them. IP indemnity carries the greatest practical weight. It is a vendor’s contractual promise to defend you if a third party claims your use of their AI output infringes copyright or trade mark rights. A growing number of enterprise vendors now offer some version of this.

The text-and-data-mining exception is worth understanding. UK copyright law has a narrow exception allowing data mining of works you have lawful access to for non-commercial research. The government considered expanding this in 2022 to cover commercial AI training and stepped back after pushback from creative sectors. The exception is frequently cited in the debate but offers limited practical shelter for commercial AI use, and none for output-level claims.

EU AI Act transparency obligations add a new layer. From August 2025, providers of general-purpose AI models covered by the Act must publish sufficiently detailed summaries of training data, including whether copyright-protected content was used. Many models used by UK owner-managed businesses fall within scope, which means documentation now exists that you can reasonably ask to see before signing any agreement. The CMA’s initial report on AI foundation models noted that opaque training data practices may also raise consumer protection concerns, giving regulators another angle to work from.

The practical close is simple. If your vendor cannot show you a training data summary, does not offer an IP indemnity, and you are creating customer-facing content with their tool, that is an unmanaged risk. Either change the tool or add a human review step before any output goes anywhere public. That step costs very little at an owner-managed business scale, and it closes the gap the current legal uncertainty leaves open.

Copyright claims over AI training data: what the lawsuits mean for your business

Key takeaways

What are copyright claims over AI training data?

Why does this matter for your business?

Where does this risk actually show up?

When is the risk real, and when can you set it aside?

Sources

Frequently asked questions

Can my business be sued because an AI model was trained on copyrighted material?

Does Microsoft's Copilot Copyright Commitment actually protect me?

What does the EU AI Act change for UK businesses using AI tools?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Copyright claims over AI training data: what the lawsuits mean for your business

Key takeaways

What are copyright claims over AI training data?

Why does this matter for your business?

Where does this risk actually show up?

When is the risk real, and when can you set it aside?

What related terms come up in these conversations?

Sources

Frequently asked questions

Can my business be sued because an AI model was trained on copyrighted material?

Does Microsoft's Copilot Copyright Commitment actually protect me?

What does the EU AI Act change for UK businesses using AI tools?

Ready to talk it through?

Related reading

Write an AI acceptable-use policy your team will actually follow

Who owns the AI in your agency, and what do you tell the client?

What your board actually wants when it asks about AI

If any of this sounds familiar, let's talk.