Improved deduplication through parallel Binning

Many modern storage systems use deduplication in order to compress data by avoiding storing the same data twice. Deduplication needs to use data stored in the past, but accessing information about all data stored can cause a severe bottleneck. Similarity based deduplication only accesses information on past data that is likely to be similar and thus more likely to yield good deduplication. We present an adaptive deduplication strategy that extends Extreme Binning and investigate theoretically and experimentally the effects of the additional bin accesses.

[1]  Pankaj Mehra,et al.  Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus , 2007, KDD '07.

[2]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[3]  Mark Lillibridge,et al.  Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service , 2007, FAST.

[4]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[5]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[6]  Pin Zhou,et al.  Demystifying data deduplication , 2008, Companion '08.

[7]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[8]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[9]  Darrell D. E. Long,et al.  Deduplication for large scale backup and archival storage , 2010 .

[10]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[11]  George Forman,et al.  Finding similar files in large document repositories , 2005, KDD '05.

[12]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[13]  Cezary Dubnicki,et al.  Anchor-driven subchunk deduplication , 2011, SYSTOR '11.

[14]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.