Name disambiguation in author citations using a K-way spectral clustering method

An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies (citations). This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution of credit. Proposed here is an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. The approach utilizes three types of citation attributes: co-author names, paper titles, and publication venue titles. The approach is illustrated with 16 name datasets with citations collected from the DBLP database bibliography and author home pages and shows that name disambiguation can be achieved using these citation attributes

[1]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .

[2]  James W. Warner,et al.  Automated name authority control , 2001, JCDL '01.

[3]  Suzanne H. Brewster Personal Name Formation of Victorian Era Painters: A Comparison of Scholar-Created Bibliographic Tools and the Library of Congress Name Authority File. , 1996 .

[4]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[5]  W. Bruce Croft,et al.  Uncertainty in Information Retrieval Systems , 1996, Uncertainty Management in Information Systems.

[6]  William B. Dolan,et al.  Word Sense Ambiguation: Clustering Related Senses , 1994, COLING.

[7]  Y. Bar-Shalom Tracking and data association , 1988 .

[8]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[10]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[11]  G. Sayeed Choudhury,et al.  Automated Name Authority Control and Enhanced Searching in the Levy Collection , 2001, D Lib Mag..

[12]  P. Gillman National name authority file: report to the national council on archives , 1998 .

[13]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[14]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[15]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[16]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[17]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[18]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[19]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[20]  Amy Friedlander,et al.  D-Lib Magazine: Publishing as the Honest Broker , 1998 .

[21]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[22]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[23]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[24]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[25]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[26]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[27]  Dror G. Feitelson,et al.  On identifying name equivalences in digital libraries , 2004, Inf. Res..

[28]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[29]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[30]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[31]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[32]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[33]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[34]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[35]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[36]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[37]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[38]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[39]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[40]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[41]  Hang Li,et al.  Word Clustering and Disambiguation Based on Co-occurrence Data , 1998, COLING.

[42]  Dominic Widdows,et al.  A Graph Model for Unsupervised Lexical Acquisition , 2002, COLING.

[43]  W. Bruce Croft,et al.  Word sense disambiguation using machine-readable dictionaries , 1989, SIGIR '89.

[44]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[45]  Patrick Pantel,et al.  Concept Discovery from Text , 2002, COLING.

[46]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[47]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Karl Branting,et al.  Name-Matching Algorithms for Legal Case-Management Systems , 2002, J. Inf. Law Technol..

[49]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[50]  Yoshinori Uesaka,et al.  Foundations of real-world intelligence , 2001 .

[51]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[52]  Y. Yao,et al.  A NON-NUMERIC APPROACH TO UNCERTAIN REASONING , 1995 .

[53]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..