Using semi-structured data for assessing research paper similarity

The task of assessing the similarity of research papers is of interest in a variety of application contexts. It is a challenging task, however, as the full text of the papers is often not available, and similarity needs to be determined based on the papers' abstract, and some additional features such as their authors, keywords, and the journals in which they were published. Our work explores several methods to exploit this information, first by using methods based on the vector space model and then by adapting language modeling techniques to this end. In the first case, in addition to a number of standard approaches we experiment with the use of a form of explicit semantic analysis. In the second case, the basic strategy we pursue is to augment the information contained in the abstract by interpolating the corresponding language model with language models for the authors, keywords and journal of the paper. This strategy is then extended by revealing the latent topic structure of the collection using an adaptation of Latent Dirichlet Allocation, in which the keywords that were provided by the authors are used to guide the process. Experimental analysis shows that a well-considered use of these techniques significantly improves the results of the standard vector space model approach.

[1]  C. Lee Giles,et al.  Finding topic trends in digital libraries , 2009, JCDL '09.

[2]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[3]  Hongbo Deng,et al.  Enhancing expertise retrieval using community-aware strategies , 2009, CIKM.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[8]  Antal van den Bosch,et al.  Recommending scientific articles using citeulike , 2008, RecSys '08.

[9]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[10]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[11]  Matthew Lease,et al.  A Dirichlet-Smoothed Bigram Model for Retrieving Spontaneous Speech , 2007, CLEF.

[12]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[13]  Xiangji Huang,et al.  Integrating multiple document features in language models for expert finding , 2010, Knowledge and Information Systems.

[14]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[15]  Gang Liu,et al.  Short text similarity based on probabilistic topics , 2009, Knowledge and Information Systems.

[16]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[17]  José A. Olivas,et al.  Concept-matching IR systems versus word-matching information retrieval systems: Considering fuzzy interrelations for indexing Web pages: Special Topic Section on Soft Approaches to Information Retrieval and Information Access on the Web , 2006 .

[18]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[19]  Sean M. McNee,et al.  On the recommending of citations for research papers , 2002, CSCW '02.

[20]  Pablo J. Garcés,et al.  Concept-matching IR systems versus word-matching information retrieval systems: Considering fuzzy interrelations for indexing Web pages , 2006, J. Assoc. Inf. Sci. Technol..

[21]  W. Bruce Croft,et al.  Hierarchical Language Models for Expert Finding in Enterprise Corpora , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[22]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[23]  Markus Franke,et al.  Recommender Services in Scientific Digital Libraries , 2008 .

[24]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[25]  Hao Wang,et al.  Adapting LDA Model to Discover Author-Topic Relations for Email Analysis , 2008, DaWaK.

[26]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[27]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[28]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[29]  Chris Cornelis,et al.  Finding Similar Research Papers using Language Models , 2011, SPIM.

[30]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[31]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[32]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[33]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[34]  Ralf Krestel,et al.  Latent dirichlet allocation for tag recommendation , 2009, RecSys '09.

[35]  Chris Cornelis,et al.  Metadata Impact on Research Paper Similarity , 2010, ECDL.

[36]  Rajeev Rastogi,et al.  Entity disambiguation with hierarchical topic models , 2011, KDD.

[37]  Ryen W. White,et al.  Enhancing Expert Finding Using Organizational Hierarchies , 2009, ECIR.

[38]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[39]  Enrique Herrera-Viedma,et al.  A hybrid recommender system for the selective dissemination of research resources in a Technology Transfer Office , 2012, Inf. Sci..