An Efficient Data Fingerprint Query Algorithm Based on Two-Leveled Bloom Filter

The function of the comparing fingerprints algorithm was to judge whether a new partitioned data chunk was in a storage system a decade ago. At present, in the most de-duplication backup system the fingerprints of the big data chunks are huge and cannot be stored in the memory completely. The performance of the system is unavoidably retarded by data chunks accessing the storage system at the querying stage. Accordingly, a new query mechanism namely Two-stage Bloom Filter (TBF) mechanism is proposed. Firstly, as a representation of the entirety for the first grade bloom filter, each bit of the second grade bloom filter in the TBF represents the chunks having the identical fingerprints reducing the rate of false positives. Secondly, a two-dimensional list is built corresponding to the two grade bloom filter for the absolute addresses of the data chunks with the identical fingerprints. Finally, a new hash function class with the strong global random characteristic is set up according to the data fingerprints’ random characteristics. To reduce the comparing data greatly, TBF decreases the number of accessing disks, improves the speed of detecting the redundant data chunks, and reduces the rate of false positives which helps the improvement of the overall performance of system.

[1]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2001, PODC '01.

[2]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[3]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[4]  Jie Wu,et al.  The Dynamic Bloom Filters , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[6]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[7]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[8]  Ankur Narang,et al.  High throughput data redundancy removal algorithm with scalable performance , 2011, HiPEAC.

[9]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[10]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[11]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[12]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[13]  Jie Wu,et al.  Theory and Network Applications of Dynamic Bloom Filters , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[14]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[15]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[16]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[18]  Ni Lar Thein,et al.  An Efficient Indexing Mechanism for Data Deduplication , 2009, 2009 International Conference on the Current Trends in Information Technology (CTIT).

[19]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[20]  David Hung-Chang Du,et al.  BloomFlash: Bloom Filter on Flash-Based Storage , 2011, 2011 31st International Conference on Distributed Computing Systems.

[21]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.