A hierarchical naive Bayes mixture model for name disambiguation in author citations

Because of name variations, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper presents a hierarchical naive Bayes mixture model, an unsupervised learning approach, for name disambiguation in author citations. This method partitions a collection of citations1 into clusters, with each cluster containing only citations authored by the same author, thus disambiguating authorship in citations to induce author name identities. Three types of citation features are used: co-author names, paper title words, and journal or proceeding title words. The approach is illustrated with 16 name datasets that are constructed based on the publication lists collected from author homepages and DBLP computer science bibliography.

[1]  Patrick Pantel,et al.  Concept Discovery from Text , 2002, COLING.

[2]  L. Ryd,et al.  On bias. , 1994, Acta orthopaedica Scandinavica.

[3]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[4]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[5]  G. Sayeed Choudhury,et al.  Automated Name Authority Control and Enhanced Searching in the Levy Collection , 2001, D Lib Mag..

[6]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[7]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[10]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[11]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[12]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[13]  Karl Branting,et al.  Name-Matching Algorithms for Legal Case-Management Systems , 2002, J. Inf. Law Technol..

[14]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[15]  Y. Bar-Shalom Tracking and data association , 1988 .

[16]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[17]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[18]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[19]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[20]  P. Gillman National name authority file: report to the national council on archives , 1998 .

[21]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[22]  W. Bruce Croft,et al.  Uncertainty in Information Retrieval Systems , 1996, Uncertainty Management in Information Systems.