What is synthetic data? Plain-English guide for owners

A vendor was on the screen demoing a tool that, they said, would let the owner train models on customer records without the GDPR headache. “It generates synthetic data, so it’s fully GDPR compliant.” The owner asked how that had been validated. The vendor talked about privacy by design and a proprietary algorithm. The owner sent the deck to her DPO that evening, who replied with five questions the vendor could not answer in writing.

That gap, between “synthetic data so it’s GDPR-clean” and what the ICO actually says, is the post. The term is everywhere in 2026 vendor pitches. The legal claim sitting next to it is usually shakier than the slide makes it sound.

What is synthetic data?

Synthetic data is artificially generated data that mimics the statistical patterns and structure of real data, produced by an algorithm trained on real records rather than collected from real events. The point is that no single record in a fully synthetic dataset corresponds to a specific person, even though the overall distributions, correlations, and patterns reflect the original. It looks like the real thing. It is not the real thing.

There are three flavours, and the difference matters legally. Fully synthetic data contains no real records at all, generated end-to-end from a model that has learned the original data’s properties. Partially synthetic data keeps real records but replaces sensitive fields with artificial values, so a patient dataset might keep clinical codes but swap names and addresses. Hybrid combines the two. Partial and hybrid keep more of the original signal, which is useful for analysis and dangerous for re-identification.

Why does it matter for your business?

It matters because synthetic data unlocks work that is otherwise blocked by privacy law. A marketing tool can be tested on synthetic customer messages before it ever sees a real one. A healthcare vendor can evaluate a clinical product against synthetic patient records without negotiating individual data sharing agreements. Early-stage development gets faster. Vendor proof of concepts get safer. Research collaborations get unstuck.

It also matters because the legal claim sitting on top of it is often wrong. The ICO’s March 2025 guidance on anonymisation is explicit that synthetic data is not automatically anonymous, that identifiability sits on a spectrum, and that periodic review is needed because attack methods evolve. The European Data Protection Supervisor reinforces the same line. In practice, most synthetic datasets generated from personal data are pseudonymised at best, which is a security measure, not a GDPR exemption. The vendor saying “synthetic so it’s compliant” has skipped the part where someone validates the claim.

The core trade-off worth holding in mind is the trilemma. Fidelity is how closely the synthetic data matches the original’s statistical properties. Utility is whether it actually performs for the downstream task you need. Privacy is whether real individuals can be re-identified or have attributes inferred from it. You cannot maximise all three at once. Higher fidelity tends to mean higher privacy risk, because the synthetic dataset gets close enough to the original that linkage attacks become feasible. Stronger privacy degrades fidelity and utility. A vendor claiming all three is selling, not measuring.

Where will you actually meet it?

You will meet synthetic data first in vendor pitches, framed as the answer to whatever data risk the owner is worrying about. Tools like Mostly AI, Tonic.ai, Gretel.ai, Hazy in the UK, and MDClone in healthcare all sell synthetic data generation as a core product. The pitch is consistent. Train models on data that feels real, ship faster, sleep better.

The differentiator is meant to be privacy, though the validation evidence behind the privacy claim varies sharply between vendors. MDClone, for example, describes its method as the only anonymisation approach that fully prevents re-identification, a claim the academic literature does not support and that the buyer should ask to see evidence for.

You will meet it second inside UK regulated sectors where the regulator is now operating its own programme. NHS England’s synthetic data programme is live and free to NHS staff, including the SynAE Accident and Emergency dataset for proof-of-concept analysis. The Financial Conduct Authority’s Digital Sandbox makes synthetic financial data available for innovation testing, and the FCA’s joint work with the Alan Turing Institute on a synthetic AML dataset is the canonical UK case study for regulator-led synthetic data. If you operate in health or financial services, synthetic data is now a normal part of testing workflows rather than an exotic option.

You will meet it third inside the products you may already be using. Cloud providers offer synthetic data generation alongside their machine learning services. Vendor proof-of-concept environments increasingly default to synthetic versions of customer data to avoid the data sharing agreement step. Academic-industry research collaborations now commonly request synthetic versions of enterprise data rather than the real thing.

When to ask, when to ignore

Ask about synthetic data when the use case is testing, integration, or proof of concept, and when fidelity to rare edge cases is not load-bearing. Ask harder when the synthetic dataset will train a production model that makes decisions about real people, because that is the point at which the fidelity, utility, privacy trilemma stops being theoretical.

The right questions for the vendor are simple. What flavour is it, fully synthetic or partial. What privacy validation has been performed, including membership inference and attribute inference testing. Will you run the dataset through Anonymeter or an equivalent independent framework. The answers tell you whether you are buying a product or a slide.

Ignore the term when the vendor is using it as a substitute for proper data governance. “Synthetic so it’s GDPR-clean” is not a finished sentence. The ICO has issued enforcement actions against organisations that claimed data was anonymised when re-identification remained technically possible, and the burden of proof sits with the data controller, not the vendor. If the actual problem is “we want to share data with a known partner”, a data sharing agreement with confidentiality, access controls, and liability allocation may be simpler and lower-risk than generating a synthetic dataset, validating its privacy, and maintaining governance over it. Synthetic data is a real tool. It is not a shortcut to the conversation with your DPO.

Anonymisation under UK GDPR is the high bar where data can no longer relate to an identified or identifiable person, and falls outside GDPR scope entirely. The Article 29 Working Party’s three tests, singling out, linkability, and inference, remain the operative framework. Most synthetic datasets do not clear this bar without independent validation.

Pseudonymisation is what synthetic data typically achieves in practice. The data has been processed so individual identities are not directly visible, but re-identification remains technically possible with auxiliary information. Pseudonymised data is still personal data under UK GDPR, and the ICO is explicit that pseudonymisation is a security measure rather than an exemption.

Differential privacy is the formal mathematical technique that adds calibrated noise to data or models to bound the information any single record contributes. It is the only method shown by independent research to consistently defend against attribute inference attacks on synthetic data, and it comes with a real utility cost. Vendors who claim “differential privacy by default” should be asked for their epsilon and delta values.

Anonymeter is the open-source privacy attack framework developed by Statice, reviewed positively by France’s CNIL data protection authority. It quantifies singling out, linkability, and inference risk in a synthetic dataset against the Article 29 Working Party criteria. Asking a vendor to run their dataset through it is a fair question. Refusing to is also an answer.

Model collapse is the 2026 concern for organisations recycling synthetic data through training pipelines. When language models are trained on the synthetic outputs of other language models, quality degrades over generations as the synthetic data drifts further from the real distribution it was supposed to mimic. The UK government’s own AI Insights guidance warns that synthetic data is just as vulnerable to weakness, bias, and omission as real data, and must be evaluated the same way.

The honest test for any synthetic data product is the privacy validation question. The vendor who answers in evidence, including independent attack testing and named privacy parameters, is selling something defensible. The vendor who answers in adjectives is not.

What is synthetic data? Why it matters for your business

Key takeaways

What is synthetic data?

Why does it matter for your business?

Where will you actually meet it?

When to ask, when to ignore

Sources

Frequently asked questions

Does synthetic data take us outside UK GDPR?

A vendor told us their synthetic data is "100 per cent privacy-safe". Is that true?

When should we actually use synthetic data?

Ready to talk it through?

If any of this sounds familiar, let's talk.

What is synthetic data? Why it matters for your business

Key takeaways

What is synthetic data?

Why does it matter for your business?

Where will you actually meet it?

When to ask, when to ignore

Related concepts

Sources

Frequently asked questions

Does synthetic data take us outside UK GDPR?

A vendor told us their synthetic data is "100 per cent privacy-safe". Is that true?

When should we actually use synthetic data?

Ready to talk it through?

Related reading

Zero-shot vs few-shot learning: when AI works on tiny data

What is AutoML? Why it matters for your business

What is edge AI? Why running AI locally matters for your business

If any of this sounds familiar, let's talk.