A Model-based K-means Algorithm for Name Disambiguation

Unambiguous identities of resources are important aspect for semantic web. This paper addresses the personal identity issue in the context of bibliographies. Because of abbreviations or misspelling of names in publications or bibliographies, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of identity matching, document retrieval and database federation, and causes improper attribution of research credit. This paper describes a new K-means clustering algorithm based on an extensible Naïve Bayes probability model to disambiguate authors with the same first name initial and last name in the bibliographies and proposes a canonical name. The model captures three types of bibliographic information: coauthor names, the title of the paper and the title of the journal or proceeding. The algorithm achieves best accuracies of 70.1% and 73.6% on disambiguating 6 different “J Anderson” s and 9 different "J Smith" s based on the citations collected from researchers’ publication web pages.

[1]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[2]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[3]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[4]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[5]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[6]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[7]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[8]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[9]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10]  Y. Bar-Shalom Tracking and data association , 1988 .

[11]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[12]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[13]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[14]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[15]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[16]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[17]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[18]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[19]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[20]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[21]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[22]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[23]  Stuart J. Russell,et al.  First-Order Probabilistic Models for Information Extraction , 2003 .

[24]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[25]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[26]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[27]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.