A Fast Dual-level Fingerprinting Scheme for Data Deduplication

Data deduplication has attracted recent interest in the research community. Several approaches are proposed to eliminate duplicate data first at the file level and then at the chunk level to reduce the duplicate-lookup complexity. To meet the high-throughput requirements, this paper proposes a fast dual-level fingerprinting (FDF) scheme that can fingerprint a dataset both at the file level and at the chunk level in a single scan of the contents. FDF breaks the fingerprinting process into task segments and further leverage the computing resources of modern multi-core CPUs to pipeline the time-consuming operations. The proposed FDF scheme has been evaluated in an experimental data backup network with real-world datasets and compared with an alternative two-stage approach. Experimental results show that FDF can maintain over 100MB/s fingerprinting throughput that matches the bandwidth of a gigabit network adapter while being fully pipelined.

[1]  André Brinkmann,et al.  Multi-level comparison of data deduplication in a backup scenario , 2009, SYSTOR '09.

[2]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[3]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[4]  Hong Jiang,et al.  Detecting Duplicates over Sliding Windows with RAM-Efficient Detached Counting Bloom Filter Arrays , 2011, 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage.

[5]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[6]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[8]  Hong Jiang,et al.  MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[10]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[11]  Lei Yang,et al.  Improved Deduplication Method based on Variable-Size Sliding Window , 2011 .

[12]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[13]  Hong Jiang,et al.  SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup , 2010, 2010 39th International Conference on Parallel Processing.

[14]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[15]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[16]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[17]  Dan Feng,et al.  Scalable high performance de-duplication backup via hash join , 2009, Journal of Zhejiang University SCIENCE C.

[18]  Guanlin Lu,et al.  ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System , 2008, 2008 Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os.

[19]  Jingli Zhou,et al.  Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment , 2010, J. Digit. Content Technol. its Appl..