SIS Versus Block-Based Deduplication: The Forgotten Deduplication Debate
As the deduplication debates rage on, it is easy to forget that the debate occurs on many fronts. The debate that tends to receive the most attention is on which method of deduplication is better, "inline" or "post-processing", as it pertains to its use by disk-based backup appliances. However in the "Which is the best deduplication?" debate, an angle that tends to get overlooked is the one between file-based, such as is used by the CommVault® Simpana® software suite, and block-based deduplication.
Known as single instance store (SIS), file-based deduplication can also reduce the massive amounts of data found in today's storage environments in much the same way that block-based deduplication does. However there are subtle but important differences in how these two deduplication methods work that will influence a company's decision as to which is best for them.
Deduplication generally provides the greatest levels of data reduction. If reduction in backup data stores is a company's sole objective, block-based deduplication generally does this better than SIS. Block-based deduplication analyzes data in backup streams, breaks these streams up into smaller chunks of data and then compares these chunks of data from the backup streams with existing chunks of data to determine if they are a match. If the block-based deduplication algorithm encounters chunks that are the same as existing chunks, it creates indexes and pointers to these chunks of data and discards the duplicate chunk of data.
The trouble with using block-based deduplication for data reduction is that it ends up functioning as a band aid or quick fix to the company's backup problems and may not address all of the concerns that a company has around managing archived and backup data. Deduplication is typically introduced as part of a backup appliance and presented to existing archiving and backup software as either a file server or a virtual tape library (VTL). Though it enables a company to archive or backup data to disk and reduces its storage requirements and backup times, block-based deduplication can end up creating a data silo on vendor specific storage devices.
Using CommVault's SIS, it provides a method to compliment, or even replace, block-based deduplication solutions found on backup appliances. CommVault incorporates SIS across its Simpana software suite and more specifically in its data protection software module.
While SIS does not break up a file into chunks and store the file in chunks as block-based deduplication does, during backups its backup software module uses SIS to compare files against previously backed up files. When it does encounter a duplicate file, it indexes it and creates pointers within the index back to the original file but does not back up the duplicate file again. In this way, SIS works in a manner similar to block-based deduplication as it only stores one unique copy of each file and puts SIS on par with block-based deduplication in term of backup speeds.
The advantages that SIS offers most clearly show up when it comes time to perform restores. A key problem with block-based deduplication is it creates fragmented data sets as it deduplicates the data. Storing single chunks of data drives up compression ratios as companies take a performance and time hit when they go to recover this chunked up data.
Depending on the amount of data they need to recover, it could take tens or even hundreds of hours to reassemble this data which eliminates one of the primary benefits of using disk - faster restores. SIS negates this typical downside of block-based deduplication since SIS stores all of the blocks of the files together negating the need to reassemble them. So while SIS may only deliver 80% of the reduction benefits during backup that block-based deduplication provides, SIS will continue to deliver the full benefits that one expects from disk during recoveries.
Including SIS in backup software also frees companies to use any vendor's disk, whether it is raw disk, a VTL or a network filer. This option opens up new possibilities for tiered storage management that deduplication can not offer on its own. For instance, companies can store data on multiple tiers of disk from high performance Fibre Channel or SAS drives to slower performing Serial ATA disk drives.
Companies have a lot of choices when it comes to deduplication and with all of the hoopla around "inline" versus "post-processing", it is easy to allow this debate to take your eye off the ball in regards to the more fundamental questions that companies should be asking. Block-based deduplication found on backup appliances address immediate corporate concerns in regards to minimizing data stores on disk and shortening backup windows. However when it comes to expediting recoveries when using any vendor's disk, CommVault SIS provides companies with a compelling reason to look at as an alternative to using only block-based deduplication.
Leave a comment