Two supervised learning approaches for name disambiguation in author citations

Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, Web search, database integration, and may cause improper attribution to authors. We investigate two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) [V. Vapnik (1995)] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: coauthor names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the Web, mainly publication lists from homepages, the other collected from the DBLP citation databases.

[1]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[2]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[3]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[4]  P. Gillman National name authority file: report to the national council on archives , 1998 .

[5]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[6]  S. K. Wong,et al.  A NON-NUMERIC APPROACH TO UNCERTAIN REASONING , 1995 .

[7]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[8]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[9]  Patrick Pantel,et al.  Concept Discovery from Text , 2002, COLING.

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[12]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[13]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[14]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[15]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[16]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[17]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[18]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[19]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[20]  James W. Warner,et al.  Automated name authority control , 2001, JCDL '01.

[21]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[22]  Hui Han,et al.  A Model-based K-means Algorithm for Name Disambiguation , 2003 .

[23]  Inderjit S. Dhillon,et al.  Generative model-based clustering of directional data , 2003, KDD '03.

[24]  William R. Hersh,et al.  Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries , 2002 .

[25]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[26]  G. Sayeed Choudhury,et al.  Automated Name Authority Control and Enhanced Searching in the Levy Collection , 2001, D Lib Mag..

[27]  Lluís Màrquez i Villodre,et al.  Naive Bayes and Exemplar-based Approaches to Word Sense Disambiguation Revisited , 2000, ECAI.

[28]  Xuegong Zhang,et al.  Recursive Sample Classification and Gene Selection based on SVM: Method and Software Description # , 2001 .

[29]  Amy Friedlander,et al.  D-Lib Magazine: Publishing as the Honest Broker , 1998 .

[30]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[31]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[32]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[33]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[34]  Suzanne H. Brewster Personal Name Formation of Victorian Era Painters: A Comparison of Scholar-Created Bibliographic Tools and the Library of Congress Name Authority File. , 1996 .

[35]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[36]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[37]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[38]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[39]  Y. Bar-Shalom Tracking and data association , 1988 .

[40]  Karl Branting,et al.  Name-Matching Algorithms for Legal Case-Management Systems , 2002, J. Inf. Law Technol..

[41]  Jun'ichi Tsujii,et al.  Training a Naive Bayes Classifier via the EM Algorithm with a Class Distribution Constraint , 2003, CoNLL.

[42]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[43]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[44]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[45]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..