DiSCern: A diversified citation recommendation system for scientific queries

Performing literature survey for scholarly activities has become a challenging and time consuming task due to the rapid growth in the number of scientific articles. Thus, automatic recommendation of high quality citations for a given scientific query topic is immensely valuable. The state-of-the-art on the problem of citation recommendation suffers with the following three limitations. First, most of the existing approaches for citation recommendation require input in the form of either the full article or a seed set of citations, or both. Nevertheless, obtaining the recommendation for citations given a set of keywords is extremely useful for many scientific purposes. Second, the existing techniques for citation recommendation aim at suggesting prestigious and well-cited articles. However, we often need recommendation of diversified citations of the given query topic for many scientific purposes; for instance, it helps authors to write survey papers on a topic and it helps scholars to get a broad view of key problems on a topic. Third, one of the problems in the keyword based citation recommendation is that the search results typically would not include the semantically correlated articles if these articles do not use exactly the same keywords. To the best of our knowledge, there is no known citation recommendation system in the literature that addresses the above three limitations simultaneously. In this paper, we propose a novel citation recommendation system called DiSCern to precisely address the above research gap. DiSCern finds relevant and diversified citations in response to a search query, in terms of keyword(s) to describe the query topic, while using only the citation graph and the keywords associated with the articles, and no latent information. We use a novel keyword expansion step, inspired by community finding in social network analysis, in DiSCern to ensure that the semantically correlated articles are also included in the results. Our proposed approach primarily builds on the Vertex Reinforced Random Walk (VRRW) to balance prestige and diversity in the recommended citations. We demonstrate the efficacy of DiSCern empirically on two datasets: a large publication dataset of more than 1.7 million articles in computer science domain and a dataset of more than 29,000 articles in theoretical high-energy physics domain. The experimental results show that our proposed approach is quite efficient and it outperforms the state-of-the-art algorithms in terms of both relevance and diversity.

[1]  Matthias Hagen,et al.  Query segmentation revisited , 2011, WWW.

[2]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[3]  Jingrui He,et al.  Diversified ranking on large graphs: an optimization viewpoint , 2011, KDD.

[4]  Xiaojin Zhu,et al.  Improving Diversity in Ranking using Absorbing Random Walks , 2007, NAACL.

[5]  Jian Pei,et al.  Citation recommendation without author supervision , 2011, WSDM '11.

[6]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[7]  Sihem Amer-Yahia,et al.  Efficient Computation of Diverse Query Results , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[9]  Sean M. McNee,et al.  Enhancing digital libraries with TechLens , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[10]  Jiang Li,et al.  ArticleRank: a PageRank-based alternative to numbers of citations for analysing citation networks , 2009, Aslib Proc..

[11]  Yixin Chen,et al.  Ranking on Data Manifold with Sink Points , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Hiep Phuc Luong,et al.  Concept-Based Document Recommendations for CiteSeer Authors , 2008, AH.

[13]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[14]  Ümit V. Çatalyürek,et al.  Diversifying Citation Recommendations , 2012, ACM Trans. Intell. Syst. Technol..

[15]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[16]  Filip Radlinski,et al.  Recommending related papers based on digital library access records , 2007, JCDL '07.

[17]  Thorsten Joachims,et al.  Identifying the original contribution of a document via language modeling , 2009, ECML/PKDD.

[18]  Sean M. McNee,et al.  On the recommending of citations for research papers , 2002, CSCW '02.

[19]  Pierre Tarres,et al.  Dynamic of vertwx-reinforced random walks. , 2008, 0809.2739.

[20]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[21]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[22]  Mark E. J. Newman,et al.  Community detection and graph partitioning , 2013, ArXiv.

[23]  Michel Bena Dynamics of Vertex-Reinforced Random Walks , 2009 .

[24]  Michael Ubell The Intelligent Database Machine (IDM) , 1985, Query Processing in Database Systems.

[25]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[26]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[27]  Marco Gori,et al.  Recommender Systems : A Random-Walk Based Approach , 2006 .

[28]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[29]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[30]  Ümit V. Çatalyürek,et al.  Direction Awareness in Citation Recommendation , 2012 .

[31]  Sean M. McNee,et al.  Enhancing digital libraries with TechLens+ , 2004, JCDL.

[32]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[33]  Cornelia Caragea,et al.  Can't see the forest for the trees?: a citation recommendation system , 2013, JCDL '13.

[34]  ChengXiang Zhai,et al.  A general optimization framework for smoothing language models on graph structures , 2008, SIGIR '08.

[35]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[36]  Pablo Castells,et al.  Novelty and diversity metrics for recommender systems: Choice, discovery and relevance , 2011 .

[37]  Jeffrey Xu Yu,et al.  Scalable Diversified Ranking on Large Graphs , 2011, IEEE Transactions on Knowledge and Data Engineering.

[38]  W. Bruce Croft,et al.  Recommending citations for academic papers , 2007, SIGIR.

[39]  Srinivasan Parthasarathy,et al.  Local graph sparsification for scalable clustering , 2011, SIGMOD '11.

[40]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[41]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[42]  Daniel Kifer,et al.  Context-aware citation recommendation , 2010, WWW '10.

[43]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[44]  R. Pemantle Vertex-reinforced random walk , 1992, math/0404041.

[45]  Marcos André Gonçalves,et al.  A source independent framework for research paper recommendation , 2011, JCDL '11.