Local graph sparsification for scalable clustering

In this paper we look at how to sparsify a graph i.e. how to reduce the edgeset while keeping the nodes intact, so as to enable faster graph clustering without sacrificing quality. The main idea behind our approach is to preferentially retain the edges that are likely to be part of the same cluster. We propose to rank edges using a simple similarity-based heuristic that we efficiently compute by comparing the minhash signatures of the nodes incident to the edge. For each node, we select the top few edges to be retained in the sparsified graph. Extensive empirical results on several real networks and using four state-of-the-art graph clustering and community discovery algorithms reveal that our proposed approach realizes excellent speedups (often in the range 10-50), with little or no deterioration in the quality of the resulting clusters. In fact, for at least two of the four clustering algorithms, our sparsification consistently enables higher clustering accuracies.

[1]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[3]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Alan M. Frieze,et al.  Min-Wise Independent Linear Permutations , 2000, Electron. J. Comb..

[5]  S. Sudarshan,et al.  Graph Clustering for Keyword Search , 2009, COMAD.

[6]  David R. Karger,et al.  Approximating s – t Minimum Cuts in ~ O(n 2 ) Time , 2007 .

[7]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Satish Rao,et al.  A Flow-Based Method for Improving the Expansion or Conductance of Graph Cuts , 2004, IPCO.

[9]  S Battiston,et al.  Backbone of complex networks of corporations: the flow of control. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[11]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[12]  Sanjeev Arora,et al.  A Fast Random Sampling Algorithm for Sparsifying Matrices , 2006, APPROX-RANDOM.

[13]  S. Sudarshan,et al.  Keyword search on external memory data graphs , 2008, Proc. VLDB Endow..

[14]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[15]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[16]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[18]  S. vanDongen Graph Clustering by Flow Simulation , 2000 .

[19]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[20]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[21]  S. Pu,et al.  Up-to-date catalogues of yeast protein complexes , 2008, Nucleic acids research.

[22]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[23]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[24]  Tanya Y. Berger-Wolf,et al.  Sampling community structure , 2010, WWW '10.

[25]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[26]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[27]  M Tumminello,et al.  A tool for filtering information in complex systems. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Hans-Werner Mewes,et al.  CORUM: the comprehensive resource of mammalian protein complexes , 2007, Nucleic Acids Res..

[29]  Nikhil Srivastava,et al.  Graph Sparsification by Effective Resistances , 2011, SIAM J. Comput..

[30]  Gary D Bader,et al.  The Genetic Landscape of a Cell , 2010, Science.

[31]  David R. Karger,et al.  Random Sampling in Cut, Flow, and Network Design Problems , 1999, Math. Oper. Res..

[32]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[33]  Dimitris Achlioptas,et al.  Fast computation of low-rank matrix approximations , 2007, JACM.

[34]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.

[35]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[36]  Marek Chrobak,et al.  Reducing Large Internet Topologies for Faster Simulations , 2005, NETWORKING.

[37]  Srinivasan Parthasarathy,et al.  Symmetrizations for clustering directed graphs , 2011, EDBT/ICDT '11.