Deduplication versus compression: what each one actually does

Business owner reviewing storage or backup data on a laptop screen at a desk in a modern office
TL;DR

Deduplication and compression both reduce storage costs but solve different problems. Deduplication removes duplicate blocks across your whole backup estate and delivers its biggest savings on similar-system backups and virtual machine images. Compression shrinks individual files internally and works best on databases, logs, and structured text. Apply both before encrypting, and ask vendors for measured ratios on your actual data before committing to any platform's defaults.

Key takeaways

- Compression shrinks individual files by encoding repeated patterns within a single data stream; deduplication removes duplicate blocks across your whole storage estate and replaces them with pointers to one stored copy. - Deduplication delivers its biggest savings on backup repositories with many similar systems, virtual machine images, and long-retention archives where identical blocks appear repeatedly across files. - Compression works best on databases, log files, and structured text where content is unique across files but internally repetitive; it delivers little on already-compressed formats such as MP4 or JPEG. - Both deduplication and compression must run before encryption is applied; encrypted data looks random to both algorithms and applying them in the wrong sequence wastes CPU for no meaningful storage gain. - Before committing to any platform, ask vendors for measured ratios on your actual data, confirm restore performance under realistic conditions, and check what portability looks like when you want to exit.

A founder I work with noticed her cloud backup invoice had crept up steadily for eighteen months. When she finally looked at the settings, her backup tool offered two separate options: “Enable deduplication” and “Enable compression.” Both listed storage savings as the benefit. She turned them both on. Within a week, her backup window had doubled and the first restore test she ran timed out.

She had two problems. She had applied compression first, which randomised the data patterns that deduplication relies on to find matches. She had also applied both techniques to an archive dominated by video files, where neither delivers meaningful savings.

Deduplication and compression are frequently presented as equivalent storage-reduction tools, but the two techniques solve different problems, perform differently depending on data type, and interact in ways that matter if you apply them without understanding which one your workload actually needs.

What choice are you actually facing?

Both techniques reduce storage costs, but they target different kinds of waste. Compression shrinks individual files by encoding repeated patterns within a single data stream, lossless so you can restore files exactly as saved. Deduplication works across your whole storage estate, finding identical blocks across many files and keeping one copy while replacing the rest with pointers. Which tool fits depends on where your workload’s waste actually sits.

The scope is the key distinction. Compression operates within one file or data stream at a time, so it works whether or not any other copy of the file exists anywhere. Deduplication requires a pool of stored data to compare against, which is why it returns its biggest savings on backup repositories and file shares where the same content appears repeatedly.

When deduplication is the right lever

Deduplication delivers the biggest savings when large volumes of similar data exist across many files or backup sets. Back up 20 laptops running the same Windows build and the operating system files appear thousands of times in your repository. Microsoft reported savings of 30% to 95% on user documents and virtualisation libraries when it built deduplication into Windows Server. Those ratios only appear when the underlying data is genuinely repetitive.

Virtual machine images are another strong candidate. A hypervisor running ten guest VMs built from the same base image stores most OS blocks once after deduplication, not ten times over. Backup vendors such as Veeam and Dell EMC position deduplication as the primary reduction technique for VM backup storage, with Data Domain systems reporting effective capacity gains of 10:1 to 30:1 across mixed workloads.

Source deduplication, where data is deduplicated before it travels across the network, also cuts bandwidth significantly. For a business sending daily backups to a cloud target over a standard broadband uplink, this can matter as much as the storage saving itself.

The workload where deduplication does worst is a large collection of unique files. A law firm with a document archive of individual client files will see modest ratios. A media business storing raw video files will see almost none.

When compression is the right lever

Compression works best where data is unique across files but internally repetitive: databases, log archives, CSV exports, and structured text. A SQL database holding customer records contains thousands of repeated field names, common values, and formatting patterns that compression encodes tightly within each stream. Typical ratios run 2:1 to 4:1 on business documents, lower on already-compressed formats, and negligible on encrypted data.

Primary databases and transactional applications benefit from compression specifically because the content is unique across records but the structure is highly repetitive. Microsoft SQL Server offers row-level and page-level compression tuned for exactly this kind of workload, and applying it does not carry the cross-file dependency that deduplication requires.

One category deserves a clear call-out: already-compressed and encrypted formats. Video files (MP4), JPEG images, and encrypted archives look essentially random to both compression and deduplication algorithms. Applying either to a media library or an encrypted data store burns CPU while returning near-zero savings. If your data is dominated by these formats, cheaper raw storage plus a sensible lifecycle policy often beats either technique.

What it costs to get this wrong

Budgets built on vendor demo ratios rather than measurements on your actual data are the most common failure mode. A 10:1 deduplication ratio sounds compelling. If your backup estate has high file uniqueness, you may see 2:1 in practice. Sizing your retention window around the optimistic number means hitting your storage ceiling early and facing either emergency spend or fewer backup snapshots than your operations actually need.

There is a less visible risk around data integrity. Deduplication relies on block hashing to identify duplicates. If the deduplication metadata becomes corrupted, large volumes of data can be affected by a single fault. In 2019, Code42 disclosed that a bug in its backup deduplication system caused certain Mac files to be omitted silently, a gap that only appeared when users attempted to restore.

The UK GDPR adds a compliance dimension. Article 5 requires personal data to be stored no longer than necessary, which aligns with both techniques in principle. Reducing backup volume does not reduce your legal obligations. You still need processes to locate and, where permissible, erase personal data on request, even if that data sits inside a deduplicated or compressed backup set. The ICO’s guidance confirms that the right to erasure applies to backups, with limited exemptions.

Order of operations also matters when combining these techniques with encryption. The NCSC recommends encrypting data at rest, including cloud backups. Both deduplication and compression must run before encryption is applied, because encrypted data looks random to both algorithms. Applying them in the wrong sequence burns CPU while producing no meaningful storage reduction.

What to ask before you decide

Before committing to a storage or backup platform’s deduplication or compression settings, six questions will separate a well-fitted choice from a vendor-pitched one. The questions apply whether you are buying a new backup tool, reviewing what you already have, or being asked by your IT supplier to upgrade a storage tier. Start with your data, not with the feature sheet.

Ask the vendor what efficiency ratios they can demonstrate on data similar to yours, based on measurement rather than a benchmark from their demo environment. Insisting on a pilot against a sample of your real backup data is reasonable and reputable vendors will accommodate it.

Ask where deduplication runs: at the source before data leaves your machines, or at the target after it arrives at the backup repository. Source deduplication cuts network bandwidth. Target deduplication is simpler to manage but transmits more data.

Restore performance is the third question to pin down. Deduplicated and compressed backups must be reassembled on restore, sometimes called rehydration. For systems with tight recovery time objectives, get tested numbers on full restore times under realistic conditions, not a theoretical figure from a pre-sales demo.

Encryption fit matters equally. Confirm that deduplication and compression run before encryption is applied, that keys are held by you rather than solely by the vendor, and that the provider’s security posture is independently certified.

Check what portability looks like at exit. Many backup platforms store deduplicated data in proprietary formats. Migration requires rehydrating the backup, converting it back to full copies before moving. The FCA’s operational resilience guidance encourages documented exit plans from third-party services; the principle applies to any SME, not only regulated firms.

If stored logs or training data feed AI features in your business, the EU AI Act requires high-risk AI systems to use datasets subject to appropriate data governance. Deduplication or compression applied to those datasets must not introduce silent data loss that undermines traceability.

The short version: if you back up many similar systems or run workloads with heavily shared content, deduplication is the first lever to pull. If you run databases and log-heavy applications, compression works harder. If your data is predominantly video or already encrypted, spend the budget on cheaper raw storage and a sensible retention policy instead.

Sources

- TechTarget (2023). Compression vs deduplication: overview and use cases. Covers when each technique is most effective, CPU overhead trade-offs, and suitability by workload type. https://www.techtarget.com/searchdatabackup/tip/Compression-deduplication-and-encryption-Whats-the-difference - Microsoft (2023). Windows Server data deduplication overview. Reports space savings of 30% to 95% on user documents and virtualisation libraries with Windows Server deduplication. https://learn.microsoft.com/en-us/windows-server/storage/data-deduplication/overview - Datacore (2023). Inline vs post-process deduplication and compression. Explains inline and post-process dedupe modes, the 10:1 ratio example, and how change rates affect processing overhead. https://www.datacore.com/blog/inline-vs-post-process-deduplication-compression/ - MinIO (2023). Myths about deduplication and compression. Explains order-of-operations implications and why compressed or encrypted data resists both techniques. https://www.min.io/blog/myths-about-deduplication-and-compression - UK Government (2016, retained in UK law). UK GDPR, Article 5: principles relating to processing of personal data. Sets out data minimisation and storage limitation requirements relevant to backup data retention. https://www.legislation.gov.uk/eur/2016/679/article/5 - ICO (2023). Storage limitation: guidance for organisations. Explains the ICO's position on data retention in backup systems and handling erasure requests for backed-up personal data. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-protection-principles/storage-limitation/ - NCSC (2023). Protecting data in the cloud. Sets out NCSC guidance on encryption at rest for cloud-stored data, including the recommendation to apply encryption after deduplication and compression. https://www.ncsc.gov.uk/guidance/protecting-data-in-the-cloud - FCA (2016). FG16/5: Guidance for firms outsourcing to the cloud and other third-party IT services. Recommends documented and tested exit plans from third-party storage and backup services. https://www.fca.org.uk/publication/finalised-guidance/fg16-5.pdf - European Parliament (2024). EU AI Act (Regulation 2024/1689). Article 10 requires high-risk AI systems to use training and testing datasets subject to appropriate data governance. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689 - Microsoft (2023). SQL Server data compression overview. Covers row and page compression for transactional workloads where compression outperforms deduplication. https://learn.microsoft.com/en-us/sql/relational-databases/data-compression/data-compression?view=sql-server-ver16

Frequently asked questions

What is the difference between deduplication and compression?

Compression shrinks individual files by encoding repeated patterns within a single data stream, lossless so files can be restored exactly. Deduplication works across your whole storage estate, identifying identical blocks in multiple files and keeping one copy while replacing the rest with pointers. Compression operates within one file; deduplication requires a pool of stored data to compare against.

When should I use deduplication over compression for my backups?

Deduplication is the better option when you back up many similar systems, run virtual machine images from a shared base, or hold large archives where identical files appear repeatedly. Microsoft reported savings of 30% to 95% on user documents and virtualisation libraries with Windows Server deduplication. Compression works better for databases and structured text where files are unique but internally repetitive.

Do deduplication and compression work on encrypted or already-compressed files?

Neither delivers meaningful savings on encrypted data or already-compressed formats such as MP4, JPEG, or ZIP files. Both algorithms look for patterns in the data; encrypted and compressed files appear essentially random, so the algorithms find almost nothing to reduce. Apply deduplication and compression before encrypting, and consider simpler storage lifecycle policies for media-heavy archives.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation