Practical rules for deduplicating data without causing harm

A business owner sitting at a desk reviewing data on a laptop with printed documents beside them
TL;DR

Data deduplication means merging duplicate records into one authoritative entry, but done carelessly it can delete consent timestamps and compliance records that UK GDPR requires you to keep. A safe deduplication exercise starts with a full backup, uses conservative similarity thresholds of at least 0.8, and includes human review before any merges are applied. For owner-managed businesses, the goal is cleaner data and lower compliance risk, achieved through a staged, documented process rather than a bulk delete.

Key takeaways

- Data deduplication merges duplicate records into a single authoritative entry; done without a safe process, it can delete consent timestamps and compliance records that UK GDPR requires you to keep. - Before running any deduplication exercise, take a full backup of the affected data and document your merge rules, including which field takes precedence when two records conflict. - Use a conservative similarity threshold of at least 0.8 and review proposed merges manually before applying them to production data; lower thresholds generate false positives that can conflate two entirely different people. - Under UK GDPR, a bad merge that exposes one client's records to another client can constitute a personal data breach reportable to the ICO within 72 hours. - The concepts connected to deduplication include "golden record" (the merged authoritative entry), "data minimisation" (the UK GDPR principle it supports), and "entity resolution" (the technical problem of identifying whether two records refer to the same real-world person).

When a small management consultancy ran its first proper client satisfaction survey last autumn, the owner exported his CRM contact list and found 340 names. He was not expecting to find the same person listed four times, with slightly different email formats and company name spellings. His office manager spent two hours tidying the list before the survey went out. A week later, a client called to ask why their newsletter consent had disappeared. The timestamps were gone. The original opt-in record had been sitting in one of the entries she had marked as a duplicate.

That is a deduplication exercise that caused harm. Not through negligence exactly, but through doing the job without a documented, staged process.

What is data deduplication, and why does the “without harm” part need its own rule?

Data deduplication is identifying records that refer to the same real-world entity and merging them into one authoritative entry. Data8 describes the goal as a “single customer view”; Tamr calls it a “golden record.” The harm arrives when a merge removes records that carry legal weight: consent timestamps, transaction histories, or compliance check results a regulator or client may ask you to produce.

The distinction matters because “deduplication” sounds like tidying. In practice it is a data modification operation that changes what you hold and what you can prove. UK GDPR requires personal data to be accurate and, under the data minimisation principle, limited to what is necessary. Removing genuine duplicates satisfies minimisation. Removing the wrong records breaches the accuracy obligation. Those two outcomes are separated by one careless click.

Research on customer data deduplication, published in a 2022 peer-reviewed paper through CEUR-WS, describes the process as a four-step pipeline: blocking to group likely duplicates, filtering out clear non-matches, running similarity scoring, and then clustering matched records into groups. Each step in the pipeline produces a list of candidate matches for human review. A safe deduplication exercise includes that review step. An unsafe one skips it.

Why does safe deduplication matter for owner-managed businesses?

Owner-managed businesses accumulate duplicates through staff turnover, CRM imports, and multi-channel lead capture. The direct cost is inflated marketing spend and missed contacts. The compliance exposure is sharper. Under UK GDPR, if a bad merge causes one client’s documents or correspondence to be visible to another client, that is a personal data breach. The ICO expects notification within 72 hours.

Data8 notes that duplicate records drive unnecessary mailings and inflate costs through multiple contacts to the same person. That is the straightforward operational argument. The compliance argument is harder. The ICO’s 2020 enforcement action against Experian found that credit reference agencies were processing and profiling data in ways consumers had not consented to, and without the documentation to justify their decisions. Deduplication was not the direct cause in that case, but the pattern it exposed is directly relevant: when you modify personal data at scale, especially data connected to consent and lawful basis, you need documented logic that you can show to a regulator. “We were tidying the list” is not documented logic.

The EDPB’s 2023 Data Protection Guide for Small Business makes the same point directly. Owner-managed businesses must keep track of when and how consent was obtained and must be able to respond to subject access requests. A deduplication exercise that strips the original consent record makes that impossible. The ICO’s guidance on the right to rectification reinforces it: if you have merged records incorrectly, you must be able to identify the error and reverse it. An audit trail of what was merged, when, and by which rule is the minimum you need.

Where will you actually meet duplicate data in your operation?

Duplicates cluster in three places in an owner-managed business: your CRM, your email marketing platform, and your billing system. In the CRM, they form when different staff enter the same contact with slightly varying spellings or email formats. In marketing platforms, they come from imported spreadsheet lists that were never reconciled with existing records. In billing, they appear when a client trades under multiple names.

The NCSC’s small business security guidance recommends mapping your data flows as a baseline hygiene measure. Applied to deduplication, this means building a picture of where each system stores contact or customer information before you start cleaning, not during. If your accounts software, your CRM, and your email platform all hold a version of the same client, you need to decide which one is the primary record before any merging begins. Deduplicate without that decision made, and you are as likely to overwrite the correct record with an incorrect one as the reverse.

Deduplication tools in this space typically offer a preview of proposed merges, with the ability to approve or reject each one before it is applied. The preview step is not optional. Applying merges directly to production data without a review pass is a mistake many cleaning exercises make, and it is the step that makes recovery difficult when something goes wrong.

When should you run a deduplication exercise, and when should you hold off?

The right time to deduplicate is when three things are in place: a backup, documented merge rules stating which field wins when records conflict, and someone to review proposed matches manually before they are applied. Run deduplication before a CRM migration, an AI rollout, or a compliance audit. Hold off when any one of those conditions is missing.

A useful precedent for cautious thresholds comes from the OpenSanctions project, which deduplicates sanctions and politically exposed persons lists for compliance purposes. Their published methodology uses conservative similarity scoring and manual curation because a false positive match in their context carries serious consequences for individuals and businesses. In a commercial CRM, the same logic applies: a false positive that conflates two different clients and routes correspondence to the wrong person creates exactly the unauthorised disclosure risk the ICO’s breach guidance describes.

The CEUR-WS research paper recommends starting with a similarity threshold of 0.8 or higher for the initial match pass, then reviewing proposed merges before applying them. Lower thresholds catch more candidates but generate more false positives. Multi-field similarity, comparing name, email, phone, and company name together, outperforms single-field matching. If your tool can only compare email addresses, it will miss pairs where the email format differs but the person is identical, and flag pairs where the email domain matches but the record belongs to two different contacts at the same company.

For firms in regulated sectors, there is an additional check worth making. FCA-regulated businesses hold KYC and AML check results alongside client records. A merge that overwrites the KYC status of a high-risk contact with data from a lower-risk record creates a potential regulatory failure, one that is difficult to unpick without an audit trail of the merge.

What concepts connect to deduplication?

Deduplication touches several related data quality terms. “Golden record” is the authoritative merged entry your deduplication pipeline produces. “Data minimisation” is the UK GDPR principle that deduplication directly supports. “Entity resolution” is the technical name for deciding whether two records refer to the same real-world person. “Master data management” is the broader discipline of keeping reference data consistent across systems over time.

For AI readiness, the connection is direct. An AI tool reading from your CRM will produce different outputs depending on whether it encounters one clean record or three conflicting ones with different email addresses and company name spellings. Cleaning your data for AI and cleaning it for GDPR compliance are the same project with two different motivations behind it. Both require the same starting point: a backup, a data map, documented merge rules, a conservative similarity threshold, and a human in the review loop before anything is applied to production.

The owner who lost his client’s consent record had none of those in place. Getting them in place before the next cleaning exercise is the practical move.

Sources

- CEUR-WS (2022). On Customer Data Deduplication: Lessons Learned from a R&D Project. Peer-reviewed paper describing a four-step deduplication pipeline, including similarity scoring and manual review, with a threshold of 0.8 for candidate identification. https://ceur-ws.org/Vol-3135/darliap_paper6.pdf - ICO. Guide to the UK GDPR: Key Principles. Sets out the data minimisation and accuracy obligations that deduplication must support without undermining. https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-uk-gdpr/key-principles/ - ICO. Personal Data Breaches. Defines reportable incidents including unauthorised disclosure and sets out the 72-hour notification requirement relevant to bad merge outcomes. https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-uk-gdpr/personal-data-breaches/ - ICO. Security of Processing. Requires organisations to be able to restore availability and integrity of personal data after accidental loss or alteration; the basis for mandatory pre-deduplication backups. https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-uk-gdpr/security/ - ICO. Individual Rights: Right to Rectification. Sets out the obligation to correct inaccurate data and the need for audit trails to identify and reverse incorrect merges. https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-uk-gdpr/individual-rights/right-to-rectification/ - ICO (2020). ICO publishes findings into data broking industry (Experian enforcement). Documents how large-scale data processing without transparent consent and documentation creates regulatory exposure. https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2020/10/ico-publishes-findings-into-data-broking-industry/ - EDPB (2023). Data Protection Guide for Small Business. Advises owner-managed businesses on maintaining consent records and responding to subject access requests; directly relevant to safe deduplication practice. https://www.edpb.europa.eu/news/news/2023/edpb-launches-data-protection-guide-small-business_en - NCSC. Small Business Guide. Recommends mapping data flows and applying access controls as baseline hygiene; informs the pre-deduplication data inventory step. https://www.ncsc.gov.uk/collection/small-business-guide - OpenSanctions (2021). How we deduplicate companies and people across data sources. Published methodology for cautious threshold-based entity matching with human review; illustrates the risk of false positives in high-stakes deduplication. https://www.opensanctions.org/articles/2021-11-11-deduplication/ - Data8. What is Data Deduplication and Why is it Important? Explains the "single customer view" concept and the operational costs of duplicate customer records for owner-managed businesses. https://www.data-8.co.uk/what-is-data-deduplication-and-why-is-it-important/

Frequently asked questions

Do I need to back up my data before running a deduplication job?

Yes, every time. The ICO's security of processing guidance requires you to be able to restore personal data after accidental loss or alteration. If a deduplication job merges the wrong records, your only route to fixing it is a clean backup. Take a full export of the affected system before any deduplication runs, and confirm you can actually restore from it before you delete or merge anything.

Can a deduplication exercise create a GDPR breach?

Yes, in two ways. First, if a bad merge causes one person's records or correspondence to be visible to another person, that is an unauthorised disclosure that may be reportable to the ICO within 72 hours. Second, if the deduplication removes original consent timestamps or lawful basis records, you lose the ability to demonstrate compliant processing, which undermines your accountability obligations under UK GDPR. Preserving legal and consent records is non-negotiable during any data cleaning exercise.

What similarity threshold should I use to identify duplicates safely?

Research on customer data deduplication recommends starting with a threshold of 0.8 or higher for the first match pass, then reviewing proposed merges manually before applying them. Lower thresholds catch more duplicates but generate more false positives, and a false positive merge can conflate two different real people. Start conservative, review the results, and only adjust the threshold downward after confirming the initial pass is producing accurate matches.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation