What deduplication is and how systems remove repeated data

Two people sitting at an office desk looking at a laptop screen together in natural daylight.
TL;DR

Deduplication is the technique that spots repeated chunks of data and stores them once, then points everything else at that single copy. In backups it cuts storage by 50 to 90 per cent on typical file shares. In CRMs it removes duplicate customer rows so each real person appears once. For UK service firms it lowers costs, improves data accuracy, and supports ICO obligations on accuracy and security.

Key takeaways

- Deduplication stores one copy of repeated data and replaces every other instance with a small pointer, which is different from compression that shrinks the contents of individual files. - Microsoft reports typical storage savings of 50 to 60 per cent on general Windows Server file shares and up to 90 per cent on highly repetitive workloads. - In CRMs and contact databases, deduplication removes duplicate customer rows so each real person appears once, which directly supports the UK GDPR accuracy principle in Article 5(1)(d). - The technique only delivers strong savings on repetitive data; media files, design assets, and already-compressed archives deduplicate poorly. - The NCSC's ransomware guidance still applies, a single deduplicated backup repository is one location and you still need multiple independent copies.

A founder I spoke to last month was staring at two backup quotes. The on-premise option claimed 12 terabytes of storage; the cloud option claimed 1.2 terabytes. He was being told that the cloud version would back up the same files. Same files, ten times less storage, same vendor’s marketing department signing the slide. He wanted to know whether the cloud quote was lying. The honest answer is that it almost certainly was not, and once you understand deduplication the maths makes more sense than the sales call did.

This post is the plain-English version. No hashing algorithms unless they earn their place, just what the word means, where it actually helps your business, and the questions to put to a supplier before you sign anything.

What is data deduplication?

Deduplication is the technique of spotting repeated chunks of data, storing one copy, and replacing every other instance with a small pointer back to that copy. Oracle and Supermicro both describe it the same way. The system breaks files into blocks, generates a unique identifier for each block, and when two blocks share an identifier it keeps one and references the rest. The application still behaves as if every copy exists.

The textbook example is an email server holding a hundred copies of the same one-megabyte attachment. A naive backup stores 100 megabytes. A deduplicated backup stores one megabyte plus 99 tiny pointers. The data on disk drops by roughly a hundred to one for that file. In databases and CRMs the same idea applies at row level, removing duplicate customer records so each real person appears once.

How do systems actually remove the repeated data?

Storage-level deduplication typically splits files into chunks and calculates a hash, a short fingerprint, for each chunk. When two fingerprints match, the system stores one chunk and points the duplicates at it. The work happens either at source, before data crosses the network, or at target, on the backup appliance itself. Source deduplication saves bandwidth; target deduplication keeps the load off the live server.

There are three common chunking methods. File-level deduplication only matches whole identical files, which is simple but misses a lot of repetition. Block-level deduplication compares fixed-size segments inside each file and catches far more. Variable-length chunking uses smarter boundaries and typically delivers the best savings, at the cost of more CPU at backup time.

In CRMs and contact databases the mechanism is different. Tools like HubSpot or Loqate look at fields such as email, name, and address, score the similarity, and flag possible duplicates for review. The human in the loop is deliberate. Aggressive auto-merging is where this technique most often goes wrong, especially with families sharing one address or contacts who have changed surname.

Where does deduplication actually pay back?

Deduplication pays back hardest on repetitive data. Nightly backups of the same fileshare, virtual machine images, email attachment hoards, document management systems with multiple draft copies of the same files, and any kind of regular snapshot regime. Microsoft reports typical Windows Server file-share savings of 50 to 60 per cent and up to 90 per cent on highly repetitive workloads such as virtual desktop libraries. Backup appliance vendors routinely report ten-to-one ratios on production backups.

It does not pay back well on already-unique or already-compressed data. Photo libraries, video archives, design files, ZIP and MP4 files, and media-heavy storage will commonly show modest savings because the internal repetition has already been removed by their own compression. Oracle is explicit that the benefit varies widely by workload, and many vendors will not promise a ratio without seeing your data first.

In a CRM, the payback is less about disk and more about data quality. Each customer appears once, marketing campaigns stop double-mailing, billing stops getting confused between two records for the same client, and your reports start matching reality. For a 5 to 50 person service firm, this is usually the higher-value place to invest the time.

When should you ask hard, and when can you ignore it?

Ask hard when you are choosing a backup provider, when a regulator audit is on the calendar, when your CRM is throwing duplicate-contact errors, or when you are FCA-regulated. The FCA’s FG16/5 guidance is clear that a regulated firm remains responsible for data integrity and availability when storage is outsourced. The ICO’s accuracy principle in Article 5(1)(d) of UK GDPR expects reasonable steps to keep personal data current. Deduplication touches both duties.

The questions are not technical. Does your backup use block-level or variable-chunk deduplication, and what ratio do you typically see for our type of data? How long does a full restore actually take from your deduplicated repository? Where does the deduplicated data physically live, and is it encrypted at rest? What is your matching logic in our CRM, and who approves merges? Five questions, all of which a competent supplier can answer in one meeting.

Ignore the question when your data is mostly unique creative work, when your CRM volume is so low that one cleanup spreadsheet a quarter does the job, or when you are paying for a managed service whose contract already commits to specific recovery times and data accuracy outcomes. At that point what you care about is the outcome, not the technique that delivers it.

Compression and deduplication are the two storage-saving techniques most often confused. Compression shrinks each file by encoding repeated patterns inside the file. Deduplication works across files and records, replacing whole repeated chunks with a single shared copy. Modern backup systems usually run both at once. The combined saving is often larger than either alone, though the two ratios do not simply multiply.

Data redundancy is the other word you will hear in the same conversation. Redundancy is repetition by design, the same data held in multiple places on purpose for resilience. The 3-2-1 backup rule recommended by the NCSC is a deliberate redundancy pattern, three copies, two storage types, one off-site. Deduplication reduces accidental repetition inside one repository. Redundancy keeps deliberate repetition across repositories. You want both, for different reasons.

Master data management sits above all of this. Where deduplication removes duplicate rows in a single system, master data management defines which system holds the authoritative version of each customer, product, or supplier record so the other systems can sync to it. For a service firm with a CRM, an accounts package, and a project tool, the right sequence is usually to deduplicate inside each, then agree which one is the source of truth, then build the sync.

If you want to think this through against your own backup and CRM rather than in the abstract, Book a conversation. The right answer depends on data volume, regulatory frame, and what your current supplier is already doing. An hour is usually enough to know whether deduplication is something to push your supplier on, or something to leave well alone.

Sources

- Information Commissioner's Office. The accuracy principle under UK GDPR. Article 5(1)(d) requirement that personal data be accurate and kept up to date, with guidance on reasonable steps to erase or rectify inaccurate data. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-protection-principles/a-guide-to-the-data-protection-principles/accuracy/ - National Cyber Security Centre. Small Business Guide, backing up your data. UK government guidance on backup principles for small organisations, including testing restores and protecting against ransomware. https://www.ncsc.gov.uk/collection/small-business-guide/backing-up-your-data - National Cyber Security Centre. Offline backups in an online world. Whitepaper on maintaining multiple immutable backup copies as defence against ransomware affecting connected primary systems. https://www.ncsc.gov.uk/whitepaper/offline-backups-in-an-online-world - Microsoft Learn. Data Deduplication overview for Windows Server. Vendor documentation reporting typical storage savings of 50 to 60 per cent on general file servers and up to 90 per cent on VDI workloads. https://learn.microsoft.com/en-us/windows-server/storage/data-deduplication/overview - Oracle (2024). What is data deduplication? Vendor primer on chunking, hashing, source versus target dedupe, and the trade-off between storage savings and CPU overhead. https://www.oracle.com/data-deduplication/ - Financial Conduct Authority (2016). FG16/5, Guidance for firms outsourcing to the cloud and other third-party IT services. Regulated firms remain responsible for data integrity and availability when using third-party storage. https://www.fca.org.uk/publications/finalised-guidance/fg16-5-guidance-firms-outsourcing-cloud-and-other-third-party-it-arrangements - EU AI Act (2024). Regulation (EU) 2024/1689, Article 10 on data governance. Providers of high-risk AI systems must address quality issues including errors and duplicates in training data. https://eur-lex.europa.eu/eli/reg/2024/1689/oj - Information Commissioner's Office. Direct marketing guidance. Expectation that organisations maintain accurate suppression lists and avoid sending duplicate or unwanted messages. https://ico.org.uk/for-organisations/direct-marketing/ - Loqate (GBG plc). What is data deduplication and what are the benefits? UK-headquartered data quality vendor on record-level deduplication for customer databases. https://www.loqate.com/en-gb/blog/what-is-data-deduplication-and-what-are-the-benefits/ - Taylor Wessing (2021). The accuracy principle under the GDPR. UK legal commentary on the compliance risk of holding multiple inconsistent copies of personal data. https://www.taylorwessing.com/en/insights-and-events/insights/2021/03/the-accuracy-principle-under-the-gdpr

Frequently asked questions

What is the difference between deduplication and compression?

Compression shrinks individual files by encoding repeated patterns inside the file. Deduplication works across many files or records and replaces whole repeated chunks with a single shared copy. A backup system can do both at once. Compression typically saves two or three times the space on a single file. Deduplication can save ten times or more when the underlying data is highly repetitive, such as nightly backups of the same fileshare or hundreds of copies of the same email attachment.

Will deduplicating my CRM accidentally merge different customers?

It can if the matching rules are too aggressive. A family sharing one email address, two contacts at the same company with similar names, or a customer who has changed surname can all get wrongly merged. The Information Commissioner's Office accuracy principle expects you to take reasonable steps to keep personal data accurate, which includes not merging people who are not actually duplicates. Tune your matching rules conservatively, review flagged matches before merging, and keep an audit trail.

Does deduplication replace the need for multiple backups?

No. The NCSC's ransomware guidance recommends multiple independent backup copies in different locations and timeframes, often summarised as the 3-2-1 rule. A deduplicated backup is highly efficient, but it is still one location. If that one repository is encrypted by ransomware or corrupted by a hardware failure, deduplication does not help you recover. Treat deduplication as a storage efficiency technique, not as a substitute for backup independence.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation