On combining text-based and link-based similarity measures for scientific papers

In computing the similarity of scientific papers, text-based and link-based similarity measures look at only a single side of the content or citations. In this paper, we propose a new approach to compute the similarity of scientific papers accurately by combining the text-based and link-based similarity measures. Our proposed method considers the content and citations of the scientific papers simultaneously and combines the similarity scores based on the content and citations by using SVMrank. The effectiveness of our proposed method is demonstrated via extensive experiments on a real-world dataset of scientific papers. The results show that more than 20% improvement in accuracy is obtained with our approach compared with previous methods.

[1]  ChengXiang Zhai,et al.  When documents are very long, BM25 fails! , 2011, SIGIR.

[2]  Xiaojun Wan,et al.  A novel document similarity measure based on earth mover's distance , 2007, Inf. Sci..

[3]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[4]  W. Bruce Croft,et al.  Recommending citations for academic papers , 2007, SIGIR.

[5]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[6]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[7]  Jiawei Han,et al.  Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[8]  Yannick Assogba,et al.  Detecting outlier sections in us congressional legislation , 2011, SIGIR.

[9]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[10]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[11]  Seok-Ho Yoon,et al.  On computing text-based similarity in scientific literature , 2011, WWW.

[12]  Alberto Barrón-Cedeño,et al.  A Comparison of Models over Wikipedia Articles Revisions , 2009 .

[13]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[14]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[15]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[16]  ChengXiang Zhai,et al.  Mining long-term search history to improve search accuracy , 2006, KDD '06.

[17]  Jussara M. Almeida,et al.  Associative tag recommendation exploiting multiple textual features , 2011, SIGIR.

[18]  Brigitte Bigi,et al.  Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.

[19]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .