A Progressive Method for Detecting Duplication Entities Based on Bloom Filters

With the volume of data grows rapidly, the cost of detecting duplication entities has increased significantly in data cleaning. However, some real-time applications only need to identify as many duplicate entities as possible in a limited time, rather than all of them. The existing works adopt the sorting method to divide similar records into blocks, and arrange the processing order of blocks to detect duplicate entity progressively. However, this method only works well when the attributes of records are suitable for sorting. Therefore, this paper proposes a novel progressive de-duplicate method for records that can't be sorted by their attributes. The method distributes records into different blocks based on their features and generates a modified bloom filter index for each block. Then it uses the bloom filter to predict the probability of duplicate entities in this block, which determines the processing order of blocks to detect the duplicate entities more quickly. The comprehensive experiment shows that the number of duplicate detection by this algorithm in the finite time is far more efficient than other algorithms involved in the related works.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[4]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[5]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[9]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[10]  Hyesook Lim,et al.  Cache sharing using bloom filters in named data networking , 2017, J. Netw. Comput. Appl..

[11]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[12]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[13]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[14]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.