Deduplication for large scale backup and archival storage

The focus of this dissertation is to provide scalable solutions for problems unique to chunk-based deduplication. Chunk-based deduplication is used in backup and archival storage systems to reduce storage space requirements. We show how to conduct similarity-based searches over large repositories, and how to scale out these searches as the repository grows; how to deduplicate low-locality file-based workloads, and how to scale out deduplication via parallelization, data and index organization; how to build a unified deduplication solution that can adapt to tape-based and file-based workloads; and, how to introduce strategic redundancies in deduplicated data to improve the overall robustness of the system. Our scalable similarity-based search solution finds for an object, highly similar objects from within a large store by examining only a small subset of its features. We show how to partition the feature index to scale out the search, and how to select a small subset of the partitions (less than 3%), independent of object size, based on the content of query object alone to conduct distributed similarity-based searches. We show how to deduplicate low-locality file-based workloads using Extreme Binning. Extreme Binning uses file similarity to find duplicates accurately and makes only one disk access for chunk lookup per file to yield reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the data size. Each backup node is autonomous—there is no dependency between nodes, making house keeping tasks robust and low overhead. We build a 'unified deduplication' solution that can adapt and deduplicate a variety of workloads. We have workloads consisting of large byte streams with high-locality, and workloads made up of files of varying sizes without any locality between them. There are separate deduplication solutions for each kind of workload, but so far no unified solution that works well for all. Our unified deduplication solution simplifies administration—organizations do not have to deploy dedicated solutions for each kind of workload—and, it yields better storage space savings than dedicated solutions because it deduplicates across workloads. Deduplication reduces storage space requirements by allowing common chunks to be shared between similar objects. This reduces the reliability of the storage system because the loss of a few shared chunks can lead to the loss of many objects. We show how to eliminate this problem by choosing for each chunk a replication level that is a function of the amount of data that would be lost if that chunk were lost. Experiments show that this technique can achieve significantly higher robustness than a conventional approach combining data mirroring and Lempel-Ziv compression while requiring about half the storage space.