Topic Representation of Researchers' Interests in a Large-Scale Academic Database and Its Application to Author Disambiguation

It is crucial to promote interdisciplinary research and recommend collaborators from different research fields via academic database analysis. This paper addresses a problem to characterize researchers’ interests with a set of diverse research topics found in a large-scale academic database. Specifically, we first use latent Dirichlet allocation to extract topics as distributions over words from a training dataset. Then, we convert the textual features of a researcher’s publications to topic vectors, and calculate the centroid of these vectors to summarize the researcher’s interest as a single vector. In experiments conducted on CiNii Articles, which is the largest academic database in Japan, we show that the extracted topics reflect the diversity of the research fields in the database. The experiment results also indicate the applicability of the proposed topic representation to the author disambiguation problem. key words: researcher analysis, academic database, topic model, author disambiguation

[1]  Atsuhiro Takasu,et al.  Hybrid Recommender System Using Latent Features , 2009, 2009 International Conference on Advanced Information Networking and Applications Workshops.

[2]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[3]  James R. Glass,et al.  A Conversational Movie Search System Based on Conditional Random Fields , 2012, INTERSPEECH.

[4]  Howard D. White,et al.  Author cocitation: A literature measure of intellectual structure , 1981, J. Am. Soc. Inf. Sci..

[5]  Weimao Ke,et al.  Studying the emerging global brain: Analyzing and visualizing the impact of co-authorship teams , 2005, Complex..

[6]  Feng Xia,et al.  ACRec: a co-authorship based random walk model for academic collaboration recommendation , 2014, WWW.

[7]  James Caverlee,et al.  PageRank for ranking authors in co-citation networks , 2009, J. Assoc. Inf. Sci. Technol..

[8]  Kei Kurakawa,et al.  Researcher Name Resolver: identifier management system for Japanese researchers , 2014, International Journal on Digital Libraries.

[9]  Georgios Evangelidis,et al.  The Universal Author Identifier System (UAI_Sys) , 2006 .

[10]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[11]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Peter van den Besselaar,et al.  Author disambiguation using multi-aspect similarity indicators , 2011, Scientometrics.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[16]  Ryutaro Ichise,et al.  Community mining tool using bibliography data , 2005, Ninth International Conference on Information Visualisation (IV'05).

[17]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[18]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[19]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[20]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[21]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Kun Lu,et al.  Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches , 2012, J. Assoc. Inf. Sci. Technol..

[23]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[24]  Gregor Heinrich “ Infinite LDA ” – Implementing the HDP with minimum code complexity , 2011 .

[25]  Laura Paglione,et al.  ORCID: a system to uniquely identify researchers , 2012, Learn. Publ..

[26]  Ahmed E. Hassan,et al.  Studying software evolution using topic models , 2014, Sci. Comput. Program..

[27]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[28]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[29]  Kevin W. Boyack,et al.  Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? , 2010 .

[30]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[31]  Li Tang,et al.  Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps , 2010, Scientometrics.