Probabilistic Iterative Duplicate Detection

The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.

[1]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[2]  Daniela Florescuand An Extensible Framework for Data Cleaning , 2000, ICDE 2000.

[3]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[4]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[5]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[6]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[7]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[8]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[9]  Peter Fankhauser,et al.  A Precise Blocking Method for Record Linkage , 2005, DaWaK.

[10]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[12]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[15]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[16]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[17]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[18]  R. Mooney,et al.  Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases , 2002 .

[19]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[20]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[21]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).