Efficient Duplicate Record Detection Based on Similarity Estimation

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.

[1]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[2]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[3]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[4]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[6]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[7]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[9]  James Munkres On the assignment and transportation problems (abstract) , 1957 .

[10]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[11]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[12]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[13]  Divesh Srivastava,et al.  Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[14]  Paul A. Viola,et al.  Learning to extract information from semi-structured text using a discriminative context free grammar , 2005, SIGIR '05.

[15]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[16]  Raghav Kaushik,et al.  A grammar-based entity representation framework for data cleaning , 2009, SIGMOD Conference.

[17]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .