S-SimRank: Combining Content and Link Information to Cluster Papers Effectively and Efficiently

Both Content analysis and link analysis have its advantages in measuring relationships among documents. In this paper, we propose a new method to combine these two methods to compute the similarity of research papers so that we can do clustering of these papers more accurately. In order to improve the efficiency of similarity calculation, we develop a strategy to deal with the relationship graph separately without affecting the accuracy. We also design an approach to assign different weights to different links to the papers, which can enhance the accuracy of similarity calculation. The experimental results conducted on ACM Data Set show that our new algorithm, S-SimRank,outperforms other algorithms.

[1]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[2]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[3]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[4]  F. Göbel,et al.  Random walks on graphs , 1974 .

[5]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[6]  Jennifer Widom,et al.  Exploiting hierarchical domain structure to compute similarity , 2003, TOIS.

[7]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[8]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[10]  Wei-Ying Ma,et al.  TSSP: A Reinforcement Algorithm to Find Related Papers , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[11]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[12]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[13]  Gerard Salton,et al.  Associative Document Retrieval Techniques Using Bibliographic Information , 1963, JACM.

[14]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[15]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[16]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[17]  Wei-Ying Ma,et al.  Similarity spreading: a unified framework for similarity calculation of interrelated objects , 2004, WWW Alt. '04.

[18]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.