Geographically Organized Small Communities and the Hardness of Clustering Social Networks

Spectral clustering, while perhaps the most efficient heuristics for graph partitioning, has recently gathered bad reputation for failure over large-scale power law graphs. In this chapter we identify the abundance of small-size communities connected by long tentacles as the major obstacle for spectral clustering. These subgraphs hide the higher level structure and result in a highly degenerate adjacency matrix with several hundreds of eigenvalues very close to 1. Our results on clustering social networks, telephone call graphs, and Web graphs are twofold. (1) We show that graphs generated by existing social network models are not as difficult to cluster as they are in the real world. For this end we give a new combined model that yields degenerate adjacency matrices and hard-to-partition graphs. (2) We give heuristics for spectral clustering for large-scale real-world social networks that handle tentacles and small dense communities. Our algorithm outperforms all previous methods for power law graph partitioning both in speed and in cluster quality. In a combination of heuristics for the contraction of tentacles as well as the removal of community cores that involve the recent SCAN (Structural Clustering Algorithm for Networks) algorithm, we are able to efficiently find balanced partitioning of over 10 million edge power law graphs. In particular, our heuristics promise similar or better performance than semidefinite relaxation with orders of magnitude lower running time.

[1]  Michael W. Berry,et al.  SVDPACK: A Fortran-77 Software Library for the Sparse Singular Value Decomposition , 1992 .

[2]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[3]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[4]  Ichigaku Takigawa,et al.  A spectral clustering approach to optimally combining numericalvectors with a modular network , 2007, KDD '07.

[5]  Miklos Kurucz,et al.  Spectral clustering in telephone call graphs , 2007, WebKDD/SNA-KDD '07.

[6]  András A. Benczúr,et al.  Large-scale principal component analysis on LiveJournal friends network , 2008 .

[7]  Renato D. C. Monteiro,et al.  A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization , 2003, Math. Program..

[8]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[9]  Andrew B. Kahng,et al.  Recent directions in netlist partitioning: a survey , 1995, Integr..

[10]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[11]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[12]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[13]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[14]  Pavel Zakharov Structure of LiveJournal social network , 2007, SPIE International Symposium on Fluctuations and Noise.

[15]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[17]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[18]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  C. University,et al.  Finding Patterns in Blog Shapes and Blog Evolution , 2015 .

[20]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[21]  David M. Pennock,et al.  Winners don't take all: A model of web link accumulation , 2000 .

[22]  Charles J. Alpert,et al.  Spectral Partitioning: The More Eigenvectors, The Better , 1995, 32nd Design Automation Conference.

[23]  Kevin J. Lang Fixing two weaknesses of the Spectral Method , 2005, NIPS.

[24]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[25]  Robert E. Tarjan,et al.  Graph Clustering and Minimum Cut Trees , 2004, Internet Math..

[26]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[27]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[28]  András A. Benczúr,et al.  Methods for large scale SVD with missing values , 2007 .

[29]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[30]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[31]  Martine D. F. Schlag,et al.  Spectral K-Way Ratio-Cut Partitioning and Clustering , 1993, 30th ACM/IEEE Design Automation Conference.

[32]  András A. Benczúr,et al.  Telephone Call Network Data Mining: A Survey with Experiments , 2008 .

[33]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[34]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[35]  A. Barabasi,et al.  Scale-free characteristics of random networks: the topology of the world-wide web , 2000 .

[36]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[37]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[39]  Hector Garcia-Molina,et al.  Web Content Categorization Using Link Information , 2006 .

[40]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.