Research on a Clustering Data De-Duplication Mechanism Based on Bloom Filter

Recently, data de-duplication, the hot emerging technology, has received a broad attention from both academia and industry. Some researches focus on the approach by which to reduce more redundant data. And the others investigate how to do de-duplication at high speed. In this paper, we aim at reducing the time and space requirement for data de-duplication. We describe a clustering architecture with multiple nodes and all nodes can do the chunk-level data de-duplication in parallel. Thus the performance will be improved noticeably. At the same time, this paper proposes a new technique called "Fingerprint Summary". Each node keeps a compact summary of the chunks' fingerprints of every other node in its memory. When checking for duplicate chunks, each node queries its local chunk hash database and then the Fingerprint Summary if necessary to eliminate inter-node redundant chunks. So we can reduce the storage capacity requirement largely.

[1]  Young Woong Ko,et al.  Design and Implementation of Clustering File Backup Server Using File Fingerprint , 2008, Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[2]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[3]  Dan Feng,et al.  3DNBS: A Data De-duplication Disk-Based Network Backup System , 2009, 2009 IEEE International Conference on Networking, Architecture, and Storage.

[4]  Dongsheng Wang,et al.  Semantic Data De-duplication for archival storage systems , 2008, 2008 13th Asia-Pacific Computer Systems Architecture Conference.

[5]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[6]  Mahadev Satyanarayanan,et al.  Opportunistic Use of Content Addressable Storage for Distributed File Systems , 2003, USENIX Annual Technical Conference, General Track.

[7]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[8]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[9]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[10]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[11]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[12]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[13]  Ni Lar Thein,et al.  An Efficient Indexing Mechanism for Data Deduplication , 2009, 2009 International Conference on the Current Trends in Information Technology (CTIT).

[14]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.