Reverse Deduplication : Optimizing for Fast Restore

Data deduplication has become an important part of the data storage industry, with most major companies providing products in the space. As additional data is added to a deduplicated storage system, the number of shared data chunks increases. This leads to the fragmentation of the data in the system, which in turn leads to increased seek operations and decreased performance. The challenge for all companies is to provide high performance both at the time of data ingest, and also during data retrieval. In many cases, the primary use of deduplicating storage systems is to provide an alternative to tapebased back-up. For these systems, performance during ingest is important, and the most common retrieval case is the most recent back-up. But due to the nature of existing deduplication algorithms, the most recent back-up is also the most fragmented, resulting in performance issues. We propose to address this issue by changing the way deduplication is done. We developed algorithms to eliminate much of the fragmentation for the most common case, restoring from the most recent back-up.