SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files. Data deduplication is a storage-optimization technique that reduces the data footprint by eliminating multiple copies of redundant data and storing only unique data. The basis of data deduplication is duplicate data detection techniques, which divide files into a number of parts, compare corresponding parts between files via hash techniques and find out redundant data. This paper proposes an efficient sliding blocking algorithm with backtracking sub-blocks called SBBS for duplicate data detection. SBBS improves the duplicate data detection precision of the traditional sliding blocking (SB) algorithm via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments. Experimental results show that SBBS averagely improves the duplicate detection precision by 6.5% compared with the traditional SB algorithm and by 16.5% compared with content-defined chunking (CDC) algorithm, and it does not increase much extra storage overhead when SBBS divides the files into equal chunks of size 8kB.

[1]  Shu Ji Data Deduplication Techniques , 2010 .

[2]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[3]  Chin-Hsien Wu,et al.  A data de-duplication access framework for solid state drives , 2011, SAC '11.

[4]  Pável Calado,et al.  Efficient and Effective Duplicate Detection in Hierarchical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[5]  Li Ao,et al.  Data Deduplication Techniques: Data Deduplication Techniques , 2010 .

[6]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[7]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[8]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  João Pedro Barreto,et al.  Hash challenges: Stretching the limits of compare-by-hash in distributed data deduplication , 2012, Inf. Process. Lett..

[10]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[11]  Seung-Ho Lim,et al.  DeFFS: Duplication-eliminated flash file system , 2011, Comput. Electr. Eng..

[12]  W. Litwin,et al.  Combining Chunk Boundary and Chunk Signature Calculations for Deduplication , 2012, IEEE Latin America Transactions.

[13]  Benny Pinkas,et al.  Side Channels in Cloud Services: Deduplication in Cloud Storage , 2010, IEEE Security & Privacy.

[14]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[15]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[16]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[17]  Seung-Ho Lim,et al.  Deduplication flash file system with PRAM for non-linear editing , 2010, IEEE Transactions on Consumer Electronics.

[18]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[19]  Zhanhuai Li,et al.  Data deduplication techniques , 2010, 2010 International Conference on Future Information Technology and Management Engineering.

[20]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[21]  Darrell D. E. Long,et al.  Providing High Reliability in a Minimum Redundancy Archival Storage System , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[22]  A. Shulman-Peleg,et al.  Side channels in cloud services , the case of deduplication in cloud storage , 2011 .

[23]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[24]  Marcos André Gonçalves,et al.  A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Shie-Jue Lee,et al.  Detecting near-duplicate documents using sentence-level features and supervised learning , 2013, Expert Syst. Appl..