A founder I spoke to last month was staring at two backup quotes. The on-premise option claimed 12 terabytes of storage; the cloud option claimed 1.2 terabytes. He was being told that the cloud version would back up the same files. Same files, ten times less storage, same vendor’s marketing department signing the slide. He wanted to know whether the cloud quote was lying. The honest answer is that it almost certainly was not, and once you understand deduplication the maths makes more sense than the sales call did.
This post is the plain-English version. No hashing algorithms unless they earn their place, just what the word means, where it actually helps your business, and the questions to put to a supplier before you sign anything.
What is data deduplication?
Deduplication is the technique of spotting repeated chunks of data, storing one copy, and replacing every other instance with a small pointer back to that copy. Oracle and Supermicro both describe it the same way. The system breaks files into blocks, generates a unique identifier for each block, and when two blocks share an identifier it keeps one and references the rest. The application still behaves as if every copy exists.
The textbook example is an email server holding a hundred copies of the same one-megabyte attachment. A naive backup stores 100 megabytes. A deduplicated backup stores one megabyte plus 99 tiny pointers. The data on disk drops by roughly a hundred to one for that file. In databases and CRMs the same idea applies at row level, removing duplicate customer records so each real person appears once.
How do systems actually remove the repeated data?
Storage-level deduplication typically splits files into chunks and calculates a hash, a short fingerprint, for each chunk. When two fingerprints match, the system stores one chunk and points the duplicates at it. The work happens either at source, before data crosses the network, or at target, on the backup appliance itself. Source deduplication saves bandwidth; target deduplication keeps the load off the live server.
There are three common chunking methods. File-level deduplication only matches whole identical files, which is simple but misses a lot of repetition. Block-level deduplication compares fixed-size segments inside each file and catches far more. Variable-length chunking uses smarter boundaries and typically delivers the best savings, at the cost of more CPU at backup time.
In CRMs and contact databases the mechanism is different. Tools like HubSpot or Loqate look at fields such as email, name, and address, score the similarity, and flag possible duplicates for review. The human in the loop is deliberate. Aggressive auto-merging is where this technique most often goes wrong, especially with families sharing one address or contacts who have changed surname.
Where does deduplication actually pay back?
Deduplication pays back hardest on repetitive data. Nightly backups of the same fileshare, virtual machine images, email attachment hoards, document management systems with multiple draft copies of the same files, and any kind of regular snapshot regime. Microsoft reports typical Windows Server file-share savings of 50 to 60 per cent and up to 90 per cent on highly repetitive workloads such as virtual desktop libraries. Backup appliance vendors routinely report ten-to-one ratios on production backups.
It does not pay back well on already-unique or already-compressed data. Photo libraries, video archives, design files, ZIP and MP4 files, and media-heavy storage will commonly show modest savings because the internal repetition has already been removed by their own compression. Oracle is explicit that the benefit varies widely by workload, and many vendors will not promise a ratio without seeing your data first.
In a CRM, the payback is less about disk and more about data quality. Each customer appears once, marketing campaigns stop double-mailing, billing stops getting confused between two records for the same client, and your reports start matching reality. For a 5 to 50 person service firm, this is usually the higher-value place to invest the time.
When should you ask hard, and when can you ignore it?
Ask hard when you are choosing a backup provider, when a regulator audit is on the calendar, when your CRM is throwing duplicate-contact errors, or when you are FCA-regulated. The FCA’s FG16/5 guidance is clear that a regulated firm remains responsible for data integrity and availability when storage is outsourced. The ICO’s accuracy principle in Article 5(1)(d) of UK GDPR expects reasonable steps to keep personal data current. Deduplication touches both duties.
The questions are not technical. Does your backup use block-level or variable-chunk deduplication, and what ratio do you typically see for our type of data? How long does a full restore actually take from your deduplicated repository? Where does the deduplicated data physically live, and is it encrypted at rest? What is your matching logic in our CRM, and who approves merges? Five questions, all of which a competent supplier can answer in one meeting.
Ignore the question when your data is mostly unique creative work, when your CRM volume is so low that one cleanup spreadsheet a quarter does the job, or when you are paying for a managed service whose contract already commits to specific recovery times and data accuracy outcomes. At that point what you care about is the outcome, not the technique that delivers it.
Related concepts worth knowing
Compression and deduplication are the two storage-saving techniques most often confused. Compression shrinks each file by encoding repeated patterns inside the file. Deduplication works across files and records, replacing whole repeated chunks with a single shared copy. Modern backup systems usually run both at once. The combined saving is often larger than either alone, though the two ratios do not simply multiply.
Data redundancy is the other word you will hear in the same conversation. Redundancy is repetition by design, the same data held in multiple places on purpose for resilience. The 3-2-1 backup rule recommended by the NCSC is a deliberate redundancy pattern, three copies, two storage types, one off-site. Deduplication reduces accidental repetition inside one repository. Redundancy keeps deliberate repetition across repositories. You want both, for different reasons.
Master data management sits above all of this. Where deduplication removes duplicate rows in a single system, master data management defines which system holds the authoritative version of each customer, product, or supplier record so the other systems can sync to it. For a service firm with a CRM, an accounts package, and a project tool, the right sequence is usually to deduplicate inside each, then agree which one is the source of truth, then build the sync.
If you want to think this through against your own backup and CRM rather than in the abstract, Book a conversation. The right answer depends on data volume, regulatory frame, and what your current supplier is already doing. An hour is usually enough to know whether deduplication is something to push your supplier on, or something to leave well alone.



