FAST COMMUNITY DETECTION BY SCORE

communities. The community labels for the nodes are unknown and it is of major interest to estimate them (i.e., community detection). Degree Corrected Block Model (DCBM) is a popular network model. How to detect communities with the DCBM is an interesting problem, where the main challenge lies in the degree heterogeneity. We propose a new approach to community detection which we call the Spectral Clustering On Ratios-of-Eigenvectors (SCORE). Compared to classical spectral methods, the main innovation is to use the entry-wise ratios between the rst leading eigenvector and each of the other leading eigenvectors for clustering. Let X be the adjacency matrix of the network. We rst obtain the K leading eigenvectors, say, ^ 1;:::; ^ K, and let ^ R be then (K 1) matrix such that ^ R(i;k) = ^ k+1(i)=^ 1(i), 1 i n, 1 k K 1. We then use ^ R for clustering by applying the k-means method. The central surprise is, the eect of degree heterogeneity is largely ancillary, and can be eectively removed by taking entry-wise ratios between ^ k+1 and ^ 1, 1 k K 1. The method is successfully applied to the web blogs data and the karate club data, with error rates of 58=1222 and 1=34, respectively. These results are much more satisfactory than those by the classical spectral methods. Also, compared to modularity methods, SCORE is computationally much faster and has smaller error rates. We develop a theoretic framework where we show that under mild conditions, the SCORE stably yields successful community detection. In the core of the analysis is the recent development on Random Matrix Theory (RMT), where the matrix-form Bernstein inequality is especially helpful.

[1]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[2]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[3]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[4]  J. Wellner,et al.  Empirical Processes with Applications to Statistics , 2009 .

[5]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Edoardo M. Airoldi,et al.  Stochastic blockmodels with growing number of classes , 2010, Biometrika.

[7]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[8]  Fan Chung Graham,et al.  Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model , 2012, COLT.

[9]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[10]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[11]  George E. P. Box,et al.  Empirical Model‐Building and Response Surfaces , 1988 .

[12]  Peter D. Hoff,et al.  Modeling homophily and stochastic equivalence in symmetric relational data , 2007, NIPS.

[13]  Ji Zhu,et al.  Consistency of community detection in networks under degree-corrected stochastic block models , 2011, 1110.3854.

[14]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[16]  Patrick J. Wolfe,et al.  Null models for network data , 2012, ArXiv.

[17]  Byron Boots,et al.  Online Spectral Identification of Dynamical Systems , 2011 .

[18]  H. Yau,et al.  Bulk universality for generalized Wigner matrices , 2010, 1001.3453.

[19]  Elchanan Mossel,et al.  Spectral redemption in clustering sparse networks , 2013, Proceedings of the National Academy of Sciences.

[20]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[21]  Hongyu Zhao,et al.  Community identification in networks with unbalanced structure. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  Christos Faloutsos,et al.  EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[23]  Antonia Maria Tulino,et al.  Random Matrix Theory and Wireless Communications , 2004, Found. Trends Commun. Inf. Theory.

[24]  Larry Wasserman,et al.  Forest Density Estimation , 2010, J. Mach. Learn. Res..

[25]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[26]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[27]  Cristopher Moore,et al.  Model selection for degree-corrected block models , 2012, Journal of statistical mechanics.

[28]  Edoardo M. Airoldi,et al.  A Survey of Statistical Network Models , 2009, Found. Trends Mach. Learn..

[29]  R. Spielman,et al.  expression reveals gene interactions and functions Coexpression network based on natural variation in human gene Material , 2009 .

[30]  Carey E. Priebe,et al.  Consistent Adjacency-Spectral Partitioning for the Stochastic Block Model When the Model Parameters Are Unknown , 2012, SIAM J. Matrix Anal. Appl..

[31]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[32]  Cedric E. Ginestet Spectral Analysis of Large Dimensional Random Matrices, 2nd edn , 2012 .

[33]  Christos Faloutsos,et al.  EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs , 2010, PAKDD.

[34]  Peter J. Bickel,et al.  Fitting community models to large sparse networks , 2012, ArXiv.

[35]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[37]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[38]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  J. Tukey WHICH PART OF THE SAMPLE CONTAINS THE INFORMATION? , 1965, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[41]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.