Link-Based Similarity Measures Using Reachability Vectors

We present a novel approach for computing link-based similarities among objects accurately by utilizing the link information pertaining to the objects involved. We discuss the problems with previous link-based similarity measures and propose a novel approach for computing link based similarities that does not suffer from these problems. In the proposed approach each target object is represented by a vector. Each element of the vector corresponds to all the objects in the given data, and the value of each element denotes the weight for the corresponding object. As for this weight value, we propose to utilize the probability of reaching from the target object to the specific object, computed using the “Random Walk with Restart” strategy. Then, we define the similarity between two objects as the cosine similarity of the two vectors. In this paper, we provide examples to show that our approach does not suffer from the aforementioned problems. We also evaluate the performance of the proposed methods in comparison with existing link-based measures, qualitatively and quantitatively, with respect to two kinds of data sets, scientific papers and Web documents. Our experimental results indicate that the proposed methods significantly outperform the existing measures.

[1]  Jie Shen,et al.  A Content-Based Algorithm for Blog Ranking , 2008, 2008 International Conference on Internet Computing in Science and Engineering.

[2]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[3]  Seok-Ho Yoon,et al.  On computing text-based similarity in scientific literature , 2011, WWW.

[4]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[5]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[6]  Gilles Bisson,et al.  Chi-Sim: A New Similarity Measure for the Co-clustering Task , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[7]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[8]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Christos Faloutsos,et al.  Constructing seminal paper genealogy , 2011, CIKM '11.

[10]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[11]  Ioannis Antonellis,et al.  Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[12]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[13]  Dong-Jin Kim,et al.  On exploiting content and citations together to compute similarity of scientific papers , 2013, CIKM.

[14]  Sunju Park,et al.  A link-based similarity measure for scientific literature , 2010, WWW '10.

[15]  B. S. Robinson Number 9 , November 2005 Toward an Optimal Algorithm for Matrix Multiplication , 2005 .

[16]  Jimeng Sun,et al.  Neighborhood formation and anomaly detection in bipartite graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[18]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[19]  Pattie Maes,et al.  Social information filtering: algorithms for automating “word of mouth” , 1995, CHI '95.

[20]  Michael R. Lyu,et al.  MatchSim: a novel neighbor-based similarity measure with maximum neighborhood matching , 2009, CIKM.

[21]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[22]  Filippo Menczer,et al.  Combining link and content analysis to estimate semantic similarity , 2004, WWW Alt. '04.

[23]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[24]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[25]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[26]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[27]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[28]  Xu Jia,et al.  Efficient Algorithm for Computing Link-Based Similarity in Real World Networks , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[29]  Boon Chong Lim,et al.  Word-of-mouth: The use of source expertise in the evaluation of familiar and unfamiliar brands , 2014 .

[30]  Hong Chen,et al.  Parallel SimRank computation on large graphs with iterative aggregation , 2010, KDD.

[31]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.