Learning to Match and Cluster Entity Names

Information retrieval is, in large part, the study of methods for assessing the similarity of pairs of documents. Document similarity metrics have been used for many tasks including ad hoc document retrieval, text classification [YC1994], and summarization [GC1998,SSMB1997]. Another problem area in which similarity metrics are central is record linkage (e.g., [KA1985]), where one wishes to determine if two database records taken from different source databases refer to the same entity. For instance, one might wish to determine if two database records from two different hospitals, each containing a patient’s name, address, and insurance information, refer to the same person; as another example, one might wish to determine if two bibliography records, each containing a paper title, list of authors, and journal name, refer to the same publication.

[1]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[2]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[3]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[4]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[5]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6]  Joachim M. Buhmann,et al.  Central and Pairwise Data Clustering by Competitive Neural Networks , 1993, NIPS.

[7]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[8]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[9]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[10]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[11]  Michael Werman,et al.  A Randomized Algorithm for Pairwise Clustering , 1998, NIPS.

[12]  Jaime G. Carbonell,et al.  The Use of MMR and Diversity-Based Reranking in Document Reranking and Summarization , 1998 .

[13]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[14]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[15]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[16]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[17]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.