Two supervised learning approaches for name disambiguation in author citations

Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, Web search, database integration, and may cause improper attribution to authors. We investigate two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) [V. Vapnik (1995)] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: coauthor names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the Web, mainly publication lists from homepages, the other collected from the DBLP citation databases.

[1]  G. Sayeed Choudhury,et al.  Automated Name Authority Control and Enhanced Searching in the Levy Collection , 2001, D Lib Mag..

[2]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[3]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[4]  Suzanne H. Brewster Personal Name Formation of Victorian Era Painters: A Comparison of Scholar-Created Bibliographic Tools and the Library of Congress Name Authority File. , 1996 .

[5]  P. Gillman National name authority file: report to the national council on archives , 1998 .

[6]  Y. Yao,et al.  A NON-NUMERIC APPROACH TO UNCERTAIN REASONING , 1995 .

[7]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[8]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[9]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[10]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[11]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[12]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[14]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[15]  James W. Warner,et al.  Automated name authority control , 2001, JCDL '01.

[16]  W. Hersh,et al.  Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries , 2002 .

[17]  Amy Friedlander,et al.  D-Lib Magazine: Publishing as the Honest Broker , 1998 .

[18]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[21]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[22]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[23]  Jun'ichi Tsujii,et al.  Training a Naive Bayes Classifier via the EM Algorithm with a Class Distribution Constraint , 2003, CoNLL.

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[26]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[27]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[28]  Karl Branting,et al.  Name-Matching Algorithms for Legal Case-Management Systems , 2002, J. Inf. Law Technol..

[29]  Y. Bar-Shalom Tracking and data association , 1988 .

[30]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[31]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[32]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[33]  Hui Han,et al.  A Model-based K-means Algorithm for Name Disambiguation , 2003 .

[34]  Xuegong Zhang,et al.  Recursive Sample Classification and Gene Selection based on SVM: Method and Software Description # , 2001 .

[35]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[36]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[37]  Patrick Pantel,et al.  Concept Discovery from Text , 2002, COLING.

[38]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[39]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[40]  Lluís Màrquez i Villodre,et al.  Naive Bayes and Exemplar-based Approaches to Word Sense Disambiguation Revisited , 2000, ECAI.

[41]  Inderjit S. Dhillon,et al.  Generative model-based clustering of directional data , 2003, KDD '03.

[42]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[43]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[44]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[45]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[46]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.