Iterative record linkage for cleaning and integration

Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of join information when comparing records.

[1]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[3]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[4]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[5]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[6]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[7]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[8]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[9]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[10]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[11]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[12]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[13]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[14]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[15]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[16]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[17]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[18]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[19]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[20]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[21]  E. Monge,et al.  The Eld Matching Problem: Algorithms and Applications , 1996 .

[22]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[23]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[25]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[26]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[27]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[28]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[29]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .