Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases

The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance metrics are learned for each field, adapting to the specific notion of similarity that is appropriate for the field’s domain. Second, a classifier is employed that uses several diverse metrics for each field as distance features and classifies pairs of records as duplicates or non-duplicates. We also propose an extended model of learnable string distance which improves over an existing approach. Experimental results on real and synthetic datasets show that our method outperforms traditional techniques.

[1]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[6]  Editors , 1986, Brain Research Bulletin.

[7]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[8]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[11]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[12]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[13]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[14]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[16]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[21]  Un Yong Nahm and Raymond J. Mooney,et al.  Using Information Extraction to Aid the Discovery of Prediction Rules from Text , 2000 .

[22]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[25]  Seán Slattery,et al.  Data Mining on Symbolic Knowledge Extracted from the Web , 2000 .

[26]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[28]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[29]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.