Fast Algorithms for Proximity Search on Large Graphs Purnamrita

The main focus of this proposal is on understanding and analyzing entity relationships in large social networks. The broad range of applications of graph based learning problems includes collaborative filtering in recommender networks, link prediction in social networks (e.g. predicting future links from the current snapshot of a graph), fraud detection and personalized graph search. In all these real world problems the main question is: given a node, which other nodes are similar to it? We use proximity measures based on short term random walks, namely truncated hitting and commute times, to compute approximate nearest neighbors in large graphs. We show that these measures are local in nature, and we design algorithms which adaptively find the right neighborhood containing the potential nearest neighbors of a query node and prune away the rest of the graph. In order to achieve this we combine sampling with deterministic branch and bound techniques to retrieve top k neighbors of a query. This enables one to do local search without caching information about all nodes in the graph. Our algorithms can answer nearest neighbor queries on keyword-author-citation graphs from Citeseer of size 600, 000 nodes in 3 seconds on average on a single CPU machine. We have shown that on several link prediction tasks on real world datasets these measures outperform other popular proximity measures (e.g. personalized pagerank) in terms of predictive power. In this thesis we propose to analyze the short-term behavior of random walks to improve upon our algorithm, investigate other graph-based proximity measures, and apply these algorithms on real-world tasks such as semi-supervised learning, information retrieval, recommender networks, and a meta-learning approach for link prediction.

[1]  John E. Hopcroft,et al.  Manipulation-Resistant Reputations Using Hitting Time , 2007, Internet Math..

[2]  John Yen,et al.  Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis , 2007, KDD 2007.

[3]  William W. Cohen,et al.  Learning to rank typed graph walks: local and global approaches , 2007, WebKDD/SNA-KDD '07.

[4]  Christos Faloutsos,et al.  Fast direction-aware proximity for graph mining , 2007, KDD '07.

[5]  Purnamrita Sarkar,et al.  A Tractable Approach to Finding Closest Truncated-commute-time Neighbors in Large Graphs , 2007, UAI.

[6]  Soumen Chakrabarti,et al.  Dynamic personalized pagerank in entity-relation graphs , 2007, WWW '07.

[7]  Baoning Wu,et al.  Extracting link spam using biased random walks from spam seed sets , 2007, AIRWeb '07.

[8]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[9]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[10]  Edwin R. Hancock,et al.  Robust Multi-body Motion Tracking Using Commute Time Clustering , 2006, ECCV.

[11]  Leo Grady,et al.  Isoperimetric Partitioning: A New Algorithm for Graph Partitioning , 2005, SIAM J. Sci. Comput..

[12]  A. Moore,et al.  Dynamic social network analysis using latent space models , 2005, SKDD.

[13]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[14]  F. Chung Laplacians and the Cheeger Inequality for Directed Graphs , 2005 .

[15]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[16]  Matthew Brand,et al.  A Random Walks Perspective on Maximizing Satisfaction and Profit , 2005, SDM.

[17]  Edwin R. Hancock,et al.  Image Segmentation using Commute Times , 2005, BMVC.

[18]  Marco Saerens,et al.  Clustering Using a Random Walk Based Distance Measure , 2005, ESANN.

[19]  François Fouss,et al.  The Principal Components Analysis of a Graph, and Its Relationships to Spectral Clustering , 2004, ECML.

[20]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[21]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[22]  Andrew Y. Ng,et al.  Learning random walk models for inducing word dependency distributions , 2004, ICML.

[23]  R. Basri,et al.  Shape representation and classification using the Poisson equation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[24]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[25]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[26]  E. Schwartz,et al.  Isoperimetric Graph Partitioning for Data Clustering and Image Segmentation , 2003 .

[27]  Robert Krauthgamer,et al.  The intrinsic dimensionality of graphs , 2003, STOC '03.

[28]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[29]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[30]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[31]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[32]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[33]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[34]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[35]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[36]  M.W. Berry,et al.  Computational Methods for Intelligent Information Access , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[37]  D. Aldous,et al.  Chapter 3 Reversible Markov Chains , 1994 .

[38]  Prabhakar Raghavan,et al.  The electrical resistance of a graph captures its commute and cover times , 1989, STOC '89.

[39]  Peter G. Doyle,et al.  Random walks and electric networks , 1987, math/0001057.

[40]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[41]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[42]  L. Asz Random Walks on Graphs: a Survey , 2022 .