Deduplication is the overall process of removing duplicate files, or even duplicate blocks of data within files, for a more efficient use of space. Originally, the processing relegated necessary deduplication to backup systems. However, flash is now so fast that deduplication can be used with main storage architectures, and in the process, the effective capacity of SSD-based systems has increased making them more competitive with HDD-based systems.

Getting Some Background

Deduplication as a concept was originally inspired by email. When someone sent out an email to hundreds of people with a file attached, the email server would be swamped. The systems would send out small emails and a link to the attached file, so that it would only be downloaded as recipients opened the message and then double-clicked on the file.

Initial versions of deduplicated backups also worked at the file level. As files were backed up, the system looked for duplicate files, and any file that had already been backed up would not be backed up again. Steps were taken to ensure that the file was actually identical to the one already backed up, not just a file with the same name and size.

An example that might jump to mind here is a file such as doc1.docx, which might be replicated in multiple users’ home directories; deduplication can reach very high levels of efficiency if the full operating system and application directories of each PC are backed up. Many, or even most of the files amounting to dozens or hundreds of megabytes on every PC running the same version of Windows will be identical from one system to the next. This means that since most of the files are identical, a full backup of 10 PCs will use up only a little more space than a full backup of one.

Updating Hardware for In-Line Deduplication

Because of the time required to search through data already stored on the backup tapes or hard drives, original deduplication systems used post-processing. This process backed up data to a landing zone, removed duplicates and then re-wrote the backup without the files that had already been backed up.

As systems became more efficient and processors became more powerful, in-line processing became possible, reducing the overall storage required for a landing zone. However, taking the deduplication process down to the block, or even sub-block level, so small parts within files could be deduplicated, increased the amount of processing again.

The next big step in deduplication came with SSDs. SSDs are so much faster than hard drives or tapes, so it was possible for in-line deduplication, not only for backups, but for online data as well. This allows extra copies to be removed in real time, yielding the same kinds of savings in storage capacity for front-line systems. Even Windows Server 2016 now includes deduplication features.

Increasing Efficiencies

There are some particular types of data that can yield spectacular savings — as with the example above, operating systems and application directories are largely the same from one system to the next. Virtual desktop infrastructure systems and server virtualization systems can have very large numbers of virtual systems that use very little more space than one system would.

On the other hand, systems that have very little in common from one dataset to the next, such as compressed graphics files, or databases with encryption of fields throughout the database, will see relatively little benefit from deduplication.

Still, within the age of data generation, deduplication can provide a real-time way for organizations to keep databases clean and easy to use, and increase accessibility across the entire business.

Find the best storage solutions for your business by checking out our award-winning selection of SSDs for the enterprise.