What is synthetic data? Why it matters for your business

A person at a desk reading a printed document with a pen in hand and a laptop open beside them
TL;DR

Synthetic data is artificially generated data that mimics the statistical patterns of real data, generated by models rather than collected from real people. It is genuinely useful for testing, development, and privacy-preserving collaboration. It is not, however, automatically anonymous under UK GDPR. The ICO's March 2025 guidance is explicit that identifiability sits on a spectrum, and most synthetic datasets in practice are pseudonymised at best. The owner's job is to recognise the fidelity, utility, privacy trilemma and to ask for independent validation rather than vendor assertion.

Key takeaways

- Synthetic data mimics the statistical patterns of real data without containing real records, and comes in three flavours: fully synthetic, partially synthetic, and hybrid. - It is not automatically anonymous under UK GDPR. The ICO's March 2025 guidance treats identifiability as a spectrum and requires periodic review as attack methods evolve. - You cannot maximise fidelity, utility, and privacy at once. A vendor claiming all three is selling, not measuring. - Independent validation tools exist. Anonymeter, reviewed positively by France's CNIL, quantifies singling out, linkability, and inference risk. "Will you run our data through Anonymeter?" is a fair question. - Synthetic data is the right tool for testing and proof of concept. It is the wrong tool for "avoid GDPR" and for production models that decide things about real people.

A vendor was on the screen demoing a tool that, they said, would let the owner train models on customer records without the GDPR headache. “It generates synthetic data, so it’s fully GDPR compliant.” The owner asked how that had been validated. The vendor talked about privacy by design and a proprietary algorithm. The owner sent the deck to her DPO that evening, who replied with five questions the vendor could not answer in writing.

That gap, between “synthetic data so it’s GDPR-clean” and what the ICO actually says, is the post. The term is everywhere in 2026 vendor pitches. The legal claim sitting next to it is usually shakier than the slide makes it sound.

What is synthetic data?

Synthetic data is artificially generated data that mimics the statistical patterns and structure of real data, produced by an algorithm trained on real records rather than collected from real events. The point is that no single record in a fully synthetic dataset corresponds to a specific person, even though the overall distributions, correlations, and patterns reflect the original. It looks like the real thing. It is not the real thing.

There are three flavours, and the difference matters legally. Fully synthetic data contains no real records at all, generated end-to-end from a model that has learned the original data’s properties. Partially synthetic data keeps real records but replaces sensitive fields with artificial values, so a patient dataset might keep clinical codes but swap names and addresses. Hybrid combines the two. Partial and hybrid keep more of the original signal, which is useful for analysis and dangerous for re-identification.

Why does it matter for your business?

It matters because synthetic data unlocks work that is otherwise blocked by privacy law. A marketing tool can be tested on synthetic customer messages before it ever sees a real one. A healthcare vendor can evaluate a clinical product against synthetic patient records without negotiating individual data sharing agreements. Early-stage development gets faster. Vendor proof of concepts get safer. Research collaborations get unstuck.

It also matters because the legal claim sitting on top of it is often wrong. The ICO’s March 2025 guidance on anonymisation is explicit that synthetic data is not automatically anonymous, that identifiability sits on a spectrum, and that periodic review is needed because attack methods evolve. The European Data Protection Supervisor reinforces the same line. In practice, most synthetic datasets generated from personal data are pseudonymised at best, which is a security measure, not a GDPR exemption. The vendor saying “synthetic so it’s compliant” has skipped the part where someone validates the claim.

The core trade-off worth holding in mind is the trilemma. Fidelity is how closely the synthetic data matches the original’s statistical properties. Utility is whether it actually performs for the downstream task you need. Privacy is whether real individuals can be re-identified or have attributes inferred from it. You cannot maximise all three at once. Higher fidelity tends to mean higher privacy risk, because the synthetic dataset gets close enough to the original that linkage attacks become feasible. Stronger privacy degrades fidelity and utility. A vendor claiming all three is selling, not measuring.

Where will you actually meet it?

You will meet synthetic data first in vendor pitches, framed as the answer to whatever data risk the owner is worrying about. Tools like Mostly AI, Tonic.ai, Gretel.ai, Hazy in the UK, and MDClone in healthcare all sell synthetic data generation as a core product. The pitch is consistent. Train models on data that feels real, ship faster, sleep better.

The differentiator is meant to be privacy, though the validation evidence behind the privacy claim varies sharply between vendors. MDClone, for example, describes its method as the only anonymisation approach that fully prevents re-identification, a claim the academic literature does not support and that the buyer should ask to see evidence for.

You will meet it second inside UK regulated sectors where the regulator is now operating its own programme. NHS England’s synthetic data programme is live and free to NHS staff, including the SynAE Accident and Emergency dataset for proof-of-concept analysis. The Financial Conduct Authority’s Digital Sandbox makes synthetic financial data available for innovation testing, and the FCA’s joint work with the Alan Turing Institute on a synthetic AML dataset is the canonical UK case study for regulator-led synthetic data. If you operate in health or financial services, synthetic data is now a normal part of testing workflows rather than an exotic option.

You will meet it third inside the products you may already be using. Cloud providers offer synthetic data generation alongside their machine learning services. Vendor proof-of-concept environments increasingly default to synthetic versions of customer data to avoid the data sharing agreement step. Academic-industry research collaborations now commonly request synthetic versions of enterprise data rather than the real thing.

When to ask, when to ignore

Ask about synthetic data when the use case is testing, integration, or proof of concept, and when fidelity to rare edge cases is not load-bearing. Ask harder when the synthetic dataset will train a production model that makes decisions about real people, because that is the point at which the fidelity, utility, privacy trilemma stops being theoretical.

The right questions for the vendor are simple. What flavour is it, fully synthetic or partial. What privacy validation has been performed, including membership inference and attribute inference testing. Will you run the dataset through Anonymeter or an equivalent independent framework. The answers tell you whether you are buying a product or a slide.

Ignore the term when the vendor is using it as a substitute for proper data governance. “Synthetic so it’s GDPR-clean” is not a finished sentence. The ICO has issued enforcement actions against organisations that claimed data was anonymised when re-identification remained technically possible, and the burden of proof sits with the data controller, not the vendor. If the actual problem is “we want to share data with a known partner”, a data sharing agreement with confidentiality, access controls, and liability allocation may be simpler and lower-risk than generating a synthetic dataset, validating its privacy, and maintaining governance over it. Synthetic data is a real tool. It is not a shortcut to the conversation with your DPO.

Anonymisation under UK GDPR is the high bar where data can no longer relate to an identified or identifiable person, and falls outside GDPR scope entirely. The Article 29 Working Party’s three tests, singling out, linkability, and inference, remain the operative framework. Most synthetic datasets do not clear this bar without independent validation.

Pseudonymisation is what synthetic data typically achieves in practice. The data has been processed so individual identities are not directly visible, but re-identification remains technically possible with auxiliary information. Pseudonymised data is still personal data under UK GDPR, and the ICO is explicit that pseudonymisation is a security measure rather than an exemption.

Differential privacy is the formal mathematical technique that adds calibrated noise to data or models to bound the information any single record contributes. It is the only method shown by independent research to consistently defend against attribute inference attacks on synthetic data, and it comes with a real utility cost. Vendors who claim “differential privacy by default” should be asked for their epsilon and delta values.

Anonymeter is the open-source privacy attack framework developed by Statice, reviewed positively by France’s CNIL data protection authority. It quantifies singling out, linkability, and inference risk in a synthetic dataset against the Article 29 Working Party criteria. Asking a vendor to run their dataset through it is a fair question. Refusing to is also an answer.

Model collapse is the 2026 concern for organisations recycling synthetic data through training pipelines. When language models are trained on the synthetic outputs of other language models, quality degrades over generations as the synthetic data drifts further from the real distribution it was supposed to mimic. The UK government’s own AI Insights guidance warns that synthetic data is just as vulnerable to weakness, bias, and omission as real data, and must be evaluated the same way.

The honest test for any synthetic data product is the privacy validation question. The vendor who answers in evidence, including independent attack testing and named privacy parameters, is selling something defensible. The vendor who answers in adjectives is not.

Sources

Information Commissioner's Office (2025). Anonymisation, pseudonymisation, and privacy-enhancing technologies guidance. The March 2025 guidance applying directly to synthetic data, including the spectrum-of-identifiability framing and the requirement for periodic review. https://ico.org.uk/about-the-ico/media-centre/events-and-webinars/2025/03/anonymisation-and-pseudonymisation-guidance/ Information Commissioner's Office (2024). What is personal data? The identifiability test rather than the presence of a name. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/ Article 29 Working Party (2014). Opinion 05/2014 on Anonymisation Techniques. The three risks any anonymisation must guard against: singling out, linkability, inference. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf European Data Protection Supervisor (2024). TechSonar: synthetic data. The line that privacy assurance assessment is required to confirm synthetic data is not actual personal data. https://www.edps.europa.eu/press-publications/publications/techsonar/synthetic-data Shokri, Stronati, Song, Shmatikov (2017). Membership Inference Attacks Against Machine Learning Models. The canonical attack showing ML models leak whether a record was in the training set. https://arxiv.org/abs/1610.05820 Annamalai, Gadotti, Rocher (2023). A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data. Shows attribute inference succeeds against state-of-the-art generation methods, with differential privacy the only consistent defence. https://arxiv.org/abs/2301.10053 Statice (ongoing). Anonymeter: open-source privacy attack framework reviewed positively by CNIL for quantifying singling out, linkability, and inference risk in synthetic datasets. https://github.com/statice/anonymeter NHS England (ongoing). Synthetic data programme and the SynAE Accident and Emergency synthetic dataset. The working UK case study for regulator-supported synthetic data at scale. https://nhsx.github.io/AnalyticsUnit/synthetic.html Financial Conduct Authority and Alan Turing Institute (2024). Synthetic data and anti-money laundering: research note and Digital Sandbox dataset. The financial-services equivalent of the NHS programme. https://www.fca.org.uk/publications/research-notes/research-note-synthetic-data-anti-money-laundering-project-report UK Government (2024). AI Insights: synthetic data. Guidance, including the warning that synthetic data is just as vulnerable to weakness, bias, and omission as real data. https://www.gov.uk/government/publications/ai-insights/ai-insights-synthetic-data-html

Frequently asked questions

Does synthetic data take us outside UK GDPR?

Only if it is genuinely anonymous, and the ICO's bar for that is high. If individuals can be re-identified from the synthetic dataset, or if attributes about real people can be inferred from it, the data remains personal data and all GDPR obligations continue to apply. Most synthetic datasets in practice are pseudonymised at best, which is a security measure rather than a GDPR exemption. Route the question to your DPO and the ICO guidance, not the vendor's brochure.

A vendor told us their synthetic data is "100 per cent privacy-safe". Is that true?

It is a marketing claim, not a legal one. Independent research has shown that synthetic data generated without differential privacy can leak attributes about real individuals across the standard generation methods. The honest version is that privacy, fidelity, and utility trade against each other and there is no single metric that guarantees privacy. Ask the vendor for the validation evidence, including any membership inference and attribute inference testing. Refusing to provide it is also an answer.

When should we actually use synthetic data?

When the use case is testing, integration, or proof of concept, when fidelity to rare edge cases is not load-bearing, and when you are willing to commission privacy validation. It is the wrong tool when you are using it to avoid compliance, when the dataset will train a production model that makes decisions about real people, or when a data sharing agreement with a known partner would be simpler. The post's job is to surface the question. The decision sits with the DPO.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation