Deduplication vs compression: what each one actually does

A founder I work with noticed her cloud backup invoice had crept up steadily for eighteen months. When she finally looked at the settings, her backup tool offered two separate options, “Enable deduplication” and “Enable compression.” Both listed storage savings as the benefit. She turned them both on. Within a week, her backup window had doubled and the first restore test she ran timed out.

She had two problems. She had applied compression first, which randomised the data patterns that deduplication relies on to find matches. She had also applied both techniques to an archive dominated by video files, where neither delivers meaningful savings.

Deduplication and compression are frequently presented as equivalent storage-reduction tools, but the two techniques solve different problems, perform differently depending on data type, and interact in ways that matter if you apply them without understanding which one your workload actually needs.

What choice are you actually facing?

Both techniques reduce storage costs, but they target different kinds of waste. Compression shrinks individual files by encoding repeated patterns within a single data stream, lossless so you can restore files exactly as saved. Deduplication works across your whole storage estate, finding identical blocks across many files and keeping one copy while replacing the rest with pointers. Which tool fits depends on where your workload’s waste actually sits.

The scope is the key distinction. Compression operates within one file or data stream at a time, so it works whether or not any other copy of the file exists anywhere. Deduplication requires a pool of stored data to compare against, which is why it returns its biggest savings on backup repositories and file shares where the same content appears repeatedly.

When deduplication is the right lever

Deduplication delivers the biggest savings when large volumes of similar data exist across many files or backup sets. Back up 20 laptops running the same Windows build and the operating system files appear thousands of times in your repository. Microsoft reported savings of 30% to 95% on user documents and virtualisation libraries when it built deduplication into Windows Server. Those ratios only appear when the underlying data is genuinely repetitive.

Virtual machine images are another strong candidate. A hypervisor running ten guest VMs built from the same base image stores nearly all OS blocks once after deduplication, not ten times over. Backup vendors such as Veeam and Dell EMC position deduplication as the primary reduction technique for VM backup storage, with Data Domain systems reporting effective capacity gains of 10:1 to 30:1 across mixed workloads.

Source deduplication, where data is deduplicated before it travels across the network, also cuts bandwidth significantly. For a business sending daily backups to a cloud target over a standard broadband uplink, this can matter as much as the storage saving itself.

The workload where deduplication does worst is a large collection of unique files. A law firm with a document archive of individual client files will see modest ratios. A media business storing raw video files will see almost none.

When compression is the right lever

Compression works best where data is unique across files but internally repetitive. Databases, log archives, CSV exports, and structured text are the prime candidates. A SQL database holding customer records contains thousands of repeated field names, common values, and formatting patterns that compression encodes tightly within each stream. Typical ratios run 2:1 to 4:1 on business documents, lower on already-compressed formats, and negligible on encrypted data.

Primary databases and transactional applications benefit from compression specifically because the content is unique across records but the structure is highly repetitive. Microsoft SQL Server offers row-level and page-level compression tuned for exactly this kind of workload, and applying it does not carry the cross-file dependency that deduplication requires.

One category deserves a clear call-out. Already-compressed and encrypted formats are a different matter entirely. Video files (MP4), JPEG images, and encrypted archives look essentially random to both compression and deduplication algorithms. Applying either to a media library or an encrypted data store burns CPU while returning near-zero savings. If your data is dominated by these formats, cheaper raw storage plus a sensible lifecycle policy often beats either technique.

What it costs to get this wrong

Budgets built on vendor demo ratios rather than measurements on your actual data are the most common failure mode. A 10:1 deduplication ratio sounds compelling. If your backup estate has high file uniqueness, you may see 2:1 in practice. Sizing your retention window around the optimistic number means hitting your storage ceiling early and facing either emergency spend or fewer backup snapshots than your operations actually need.

There is a less visible risk around data integrity. Deduplication relies on block hashing to identify duplicates. If the deduplication metadata becomes corrupted, large volumes of data can be affected by a single fault. In 2019, Code42 disclosed that a bug in its backup deduplication system caused certain Mac files to be omitted silently, a gap that only appeared when users attempted to restore.

The UK GDPR adds a compliance dimension. Article 5 requires personal data to be stored no longer than necessary, which aligns with both techniques in principle. Reducing backup volume does not reduce your legal obligations. You still need processes to locate and, where permissible, erase personal data on request, even if that data sits inside a deduplicated or compressed backup set. The ICO’s guidance confirms that the right to erasure applies to backups, with limited exemptions.

Order of operations also matters when combining these techniques with encryption. The NCSC recommends encrypting data at rest, including cloud backups. Both deduplication and compression must run before encryption is applied, because encrypted data looks random to both algorithms. Applying them in the wrong sequence burns CPU while producing no meaningful storage reduction.

What to ask before you decide

Before committing to a storage or backup platform’s deduplication or compression settings, six questions will separate a well-fitted choice from a vendor-pitched one. The questions apply whether you are buying a new backup tool, reviewing what you already have, or being asked by your IT supplier to upgrade a storage tier. Start with your data, not with the feature sheet.

Ask the vendor what efficiency ratios they can demonstrate on data similar to yours, based on measurement rather than a benchmark from their demo environment. Insisting on a pilot against a sample of your real backup data is reasonable and reputable vendors will accommodate it.

Ask where deduplication runs, at the source before data leaves your machines, or at the target after it arrives at the backup repository. Source deduplication cuts network bandwidth. Target deduplication is simpler to manage but transmits more data.

Restore performance is the third question to pin down. Deduplicated and compressed backups must be reassembled on restore, sometimes called rehydration. For systems with tight recovery time objectives, get tested numbers on full restore times under realistic conditions, not a theoretical figure from a pre-sales demo.

Encryption fit matters equally. Confirm that deduplication and compression run before encryption is applied, that keys are held by you rather than solely by the vendor, and that the provider’s security posture is independently certified.

Check what portability looks like at exit. Many backup platforms store deduplicated data in proprietary formats. Migration requires rehydrating the backup, converting it back to full copies before moving. The FCA’s operational resilience guidance encourages documented exit plans from third-party services; the principle applies to any SME, not only regulated firms.

If stored logs or training data feed AI features in your business, the EU AI Act requires high-risk AI systems to use datasets subject to appropriate data governance. Deduplication or compression applied to those datasets must not introduce silent data loss that undermines traceability.

In short, if you back up many similar systems or run workloads with heavily shared content, deduplication is the first lever to pull. If you run databases and log-heavy applications, compression works harder. If your data is predominantly video or already encrypted, spend the budget on cheaper raw storage and a sensible retention policy instead.

Deduplication versus compression: what each one actually does

Key takeaways

What choice are you actually facing?

When deduplication is the right lever

When compression is the right lever

What it costs to get this wrong

What to ask before you decide

Sources

Frequently asked questions

What is the difference between deduplication and compression?

When should I use deduplication over compression for my backups?

Do deduplication and compression work on encrypted or already-compressed files?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Deduplication versus compression: what each one actually does

Key takeaways

What choice are you actually facing?

When deduplication is the right lever

When compression is the right lever

What it costs to get this wrong

What to ask before you decide

Sources

Frequently asked questions

What is the difference between deduplication and compression?

When should I use deduplication over compression for my backups?

Do deduplication and compression work on encrypted or already-compressed files?

Ready to talk it through?

Related reading

Find the shadow AI in your agency before a client's data leaks through it

A four-tier data map so your team knows what AI can touch

Capture the shop-floor knowledge before it retires

If any of this sounds familiar, let's talk.