Symmetrizations for clustering directed graphs

Graph clustering has generally concerned itself with clustering undirected graphs; however the graphs from a number of important domains are essentially directed, e.g. networks of web pages, research papers and Twitter users. This paper investigates various ways of symmetrizing a directed graph into an undirected graph so that previous work on clustering undirected graphs may subsequently be leveraged. Recent work on clustering directed graphs has looked at generalizing objective functions such as conductance to directed graphs and minimizing such objective functions using spectral methods. We show that more meaningful clusters (as measured by an external ground truth criterion) can be obtained by symmetrizing the graph using measures that capture in- and out-link similarity, such as bibliographic coupling and co-citation strength. However, direct application of these similarity measures to modern large-scale power-law networks is problematic because of the presence of hub nodes, which become connected to the vast majority of the network in the transformed undirected graph. We carefully analyze this problem and propose a Degree-discounted similarity measure which is much more suitable for large-scale networks. We show extensive empirical validation.

[1]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[2]  Mehran Sahami,et al.  Evaluating similarity measures: a large-scale study in the orkut social network , 2005, KDD '05.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[5]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[6]  Marina Meila,et al.  Clustering by weighted cuts in directed graphs , 2007, SDM.

[7]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[8]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[9]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[10]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[11]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[12]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[13]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[14]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[15]  Fan Chung Graham,et al.  Local Partitioning for Directed Graphs Using PageRank , 2007, Internet Math..

[16]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[17]  Dale Schuurmans,et al.  Web Communities Identification from Random Walks , 2006, PKDD.

[18]  David Gleich Hierarchical Directed Spectral Graph Partitioning MS&E 337 - Information Networks , 2006 .

[19]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[21]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[23]  F. Chung Laplacians and the Cheeger Inequality for Directed Graphs , 2005 .

[24]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[25]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.