A novel way of computing dissimilarities between nodes of a graph

This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted, undirected, graph. It is based on a Markov-chain model of random walk through the database. More precisely, we compute quantities (the average commute time, the pseudoinverse of the Laplacian matrix of the graph, etc) that provide similarities between any pair of nodes, having the nice property of increasing when the number of paths connecting those elements increases and when the “length” of paths decreases. It turns out that the square root of the average commute time is a Euclidean distance and that the pseudoinverse of the Laplacian matrix is a kernel (it contains inner-products closely related to commute times). A procedure for computing the subspace projection of the node vectors of the graph that preserves as much variance as possible in terms of the commute-time distance – a principal components analysis (PCA) of the graph – is also introduced. This graph PCA provides a nice interpretation to the “Fiedler vector”, widely used for graph partitioning. The model is evaluated on a collaborative-recommendation task where suggestions are made about which movies people should watch based upon what they watched in the past. Experimental results on the MovieLens database show that the Laplacian-based similarities perform well in comparison with other methods. The model, which nicely fits into the so-called “statistical relational learning” framework, could also be used to compute document or word similarities, and, more generally, could be applied to machine-learning and pattern-recognition tasks involving a database. François Fouss, Alain Pirotte and Marco Saerens are with the Information Systems Research Unit (ISYS), IAG, Université catholique de Louvain, Place des Doyens 1, B-1348 Louvain-la-Neuve, Belgium. Email: {saerens, pirotte, fouss}@isys.ucl.ac.be. Jean-Michel Renders is with the Xerox Research Center Europe, Chemin de Maupertuis 6, 38240 Meylan (Grenoble), France. Email: jean-michel.renders@xrce.xerox.com.

[1]  S. H. Cheng,et al.  A Modified Cholesky Algorithm Based on a Symmetric Indefinite Factorization , 1998, SIAM J. Matrix Anal. Appl..

[2]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[3]  Bojan Mohar,et al.  Laplace eigenvalues of graphs - a survey , 1992, Discret. Math..

[4]  Ronald R. Coifman,et al.  Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators , 2005, NIPS.

[5]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[6]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[7]  A. Zinober Matrices: Methods and Applications , 1992 .

[8]  George Karypis,et al.  Evaluation of Item-Based Top-N Recommendation Algorithms , 2001, CIKM '01.

[9]  Padhraic Smyth,et al.  Algorithms for estimating relative importance in networks , 2003, KDD '03.

[10]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[11]  R. Róbert Generalized Inverses (Theory and Applications. Second edition) by A. Ben-Israel and T.N.E. Neville (deceased) , 2005 .

[12]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[13]  Pavel Yu. Chebotarev,et al.  The Matrix-Forest Theorem and Measuring Relations in Small Social Groups , 2006, ArXiv.

[14]  Mark E. J. Newman A measure of betweenness centrality based on random walks , 2005, Soc. Networks.

[15]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[16]  Christos Faloutsos,et al.  Electricity Based External Similarity of Categorical Attributes , 2003, PAKDD.

[17]  D. Vere-Jones Markov Chains , 1972, Nature.

[18]  Hsinchun Chen,et al.  Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering , 2004, TOIS.

[19]  Michael R. Frey,et al.  An Introduction to Stochastic Modeling (2nd Ed.) , 1994 .

[20]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[21]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[22]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[23]  Pavel Yu. Chebotarev,et al.  On Proximity Measures for Graph Vertices , 2006, ArXiv.

[24]  Adi Ben-Israel,et al.  Generalized inverses: theory and applications , 1974 .

[25]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[26]  Pavel Yu. Chebotarev,et al.  Spanning Forests of a Digraph and Their Applications , 2001, ArXiv.

[27]  C. D. Meyer,et al.  Generalized inverses of linear transformations , 1979 .

[28]  Tony F. Chan,et al.  On the Optimality of the Median Cut Spectral Bisection Graph Partitioning Method , 1997, SIAM J. Sci. Comput..

[29]  J. Schmee Matrices with Applications in Statistics , 1982 .

[30]  J. Delvenne,et al.  Random walks on graphs , 2004 .

[31]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[32]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[33]  A. B. Rami Shani,et al.  Matrices: Methods and Applications , 1992 .

[34]  Jiming Liu,et al.  Extended latent class models for collaborative recommendation , 2004, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[35]  Prabhakar Raghavan,et al.  The electrical resistance of a graph captures its commute and cover times , 2005, computational complexity.

[36]  John Riedl,et al.  Recommender Systems for Large-scale E-Commerce : Scalable Neighborhood Formation Using Clustering , 2002 .

[37]  I. N. Herstein,et al.  Matrix Theory and Linear Algebra , 2018, Formation Control of Multi-Agent Systems.

[38]  David Harel,et al.  On Clustering Using Random Walks , 2001, FSTTCS.

[39]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[40]  K. S. Banerjee Generalized Inverse of Matrices and Its Applications , 1973 .

[41]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms. By Pierre Baldi, Paolo Frasconi, Padhraic Smith, John Wiley and Sons Ltd., West Sussex, England, 2003. 285 pp ISBN 0 470 84906 1 , 2006, Inf. Process. Manag..

[42]  R. Cuninghame-Green,et al.  Applied Linear Algebra , 1979 .

[43]  Paul Van Dooren,et al.  On the pseudo-inverse of the Laplacian of a bipartite graph , 2005, Appl. Math. Lett..

[44]  Yuji Matsumoto,et al.  Application of Kernels to Link Analysis: First Results , 2004 .

[45]  M. Randic,et al.  Resistance distance , 1993 .

[46]  François Fouss,et al.  The Principal Components Analysis of a Graph, and Its Relationships to Spectral Clustering , 2004, ECML.

[47]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[48]  Pavel Yu. Chebotarev,et al.  The Matrix of Maximum Out Forests of a Digraph and Its Applications , 2006, ArXiv.

[49]  Alan J. Mayne,et al.  Generalized Inverse of Matrices and its Applications , 1972 .

[50]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[51]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[52]  Pavel Yu. Chebotarev,et al.  On a Duality between Metrics and Σ-Proximities , 2002, ArXiv.

[53]  Frank Harary,et al.  Distance in graphs , 1990 .

[54]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[55]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[56]  Anthony Unwin,et al.  Markov Chains — Theory and Applications , 1977 .

[57]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[58]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[59]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[60]  Matthew Brand,et al.  A Random Walks Perspective on Maximizing Satisfaction and Profit , 2005, SDM.

[61]  Peter G. Doyle,et al.  Random Walks and Electric Networks: REFERENCES , 1987 .

[62]  François Fouss,et al.  A novel way of computing similarities between nodes of a graph, with application to collaborative recommendation , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[63]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[64]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[65]  Alexander J. Smola,et al.  Learning with kernels , 1998 .