Robust and Computationally Feasible Community Detection in the Presence of Arbitrary Outlier Nodes

Community detection, which aims to cluster $N$ nodes in a given graph into $r$ distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by $k$-means algorithm with $k=r$. Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$. This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors' knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

[1]  Babak Hassibi,et al.  Finding Dense Clusters via "Low Rank + Sparse" Decomposition , 2011, ArXiv.

[2]  Xiangyu Chang,et al.  Asymptotic Normality of Maximum Likelihood and its Variational Approximation for Stochastic Blockmodels , 2012, ArXiv.

[3]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[4]  Zhixun Su,et al.  Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation , 2011, NIPS.

[5]  Dieter Mitsche,et al.  Reconstructing Many Partitions Using Spectral Techniques , 2005, FCT.

[6]  Cristopher Moore,et al.  Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Stephen A. Vavasis,et al.  Convex optimization for the planted k-disjoint-clique problem , 2010, Math. Program..

[8]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[9]  Jiashun Jin,et al.  Fast network community detection by SCORE , 2012, ArXiv.

[10]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Yudong Chen,et al.  Clustering Partially Observed Graphs via Convex Optimization , 2011, ICML.

[12]  Elchanan Mossel,et al.  Spectral redemption in clustering sparse networks , 2013, Proceedings of the National Academy of Sciences.

[13]  Edwin R. Hancock,et al.  Spectral Clustering of Graphs , 2003, GbRPR.

[14]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[15]  Alessandro Rinaldo,et al.  Consistency of Spectral Clustering in Sparse Stochastic Block Models , 2013 .

[16]  Rudolf Ahlswede,et al.  Strong converse for identification via quantum channels , 2000, IEEE Trans. Inf. Theory.

[17]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[18]  Fan Chung Graham,et al.  Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model , 2012, COLT.

[19]  Alain Celisse,et al.  Consistency of maximum-likelihood and variational estimators in the Stochastic Block Model , 2011, 1105.3288.

[20]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[21]  Van H. Vu,et al.  Spectral norm of random matrices , 2005, STOC '05.

[22]  R. Shamir,et al.  Improved algorithms for the random cluster graph model , 2007 .

[23]  Brendan P. W. Ames Guaranteed clustering and biclustering via semidefinite programming , 2012, Mathematical Programming.

[24]  Bin Yu,et al.  Impact of regularization on spectral clustering , 2013, 2014 Information Theory and Applications Workshop (ITA).

[25]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[26]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[27]  Edoardo M. Airoldi,et al.  A Survey of Statistical Network Models , 2009, Found. Trends Mach. Learn..

[28]  Sivaraman Balakrishnan,et al.  Noise Thresholds for Spectral Clustering , 2011, NIPS.

[29]  János Komlós,et al.  The eigenvalues of random symmetric matrices , 1981, Comb..

[30]  VU Van,et al.  A Simple SVD Algorithm for Finding Hidden Partitions , 2014, Combinatorics, Probability and Computing.

[31]  Stephen E. Fienberg Introduction to papers on the modeling and analysis of network data , 2010 .

[32]  Claire Mathieu,et al.  Correlation clustering with noisy input , 2010, SODA '10.

[33]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[34]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[35]  Carey E. Priebe,et al.  A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs , 2011, 1108.2228.

[36]  Ron Shamir,et al.  Improved algorithms for the random cluster graph model , 2002, Random Struct. Algorithms.

[37]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  Emmanuel J. Candès,et al.  PhaseLift: Exact and Stable Signal Recovery from Magnitude Measurements via Convex Programming , 2011, ArXiv.

[39]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[40]  H. Chernoff A Note on an Inequality Involving the Normal Distribution , 1981 .

[41]  E A Leicht,et al.  Mixture models and exploratory analysis in networks , 2006, Proceedings of the National Academy of Sciences.

[42]  Xiaodong Li,et al.  Sparse Signal Recovery from Quadratic Measurements via Convex Programming , 2012, SIAM J. Math. Anal..

[43]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[44]  Jiashun Jin,et al.  FAST COMMUNITY DETECTION BY SCORE , 2012, 1211.5803.

[45]  Ji Zhu,et al.  Consistency of community detection in networks under degree-corrected stochastic block models , 2011, 1110.3854.

[46]  Peter J. Bickel,et al.  Community Detection in Networks using Graph Distance , 2014, ArXiv.

[47]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .

[48]  Stephen E. Fienberg,et al.  A Brief History of Statistical Models for Network Analysis and Open Challenges , 2012 .

[49]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[50]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[51]  A. Rinaldo,et al.  Consistency of spectral clustering in stochastic block models , 2013, 1312.2050.

[52]  P. Bickel,et al.  Role of normalization in spectral clustering for stochastic blockmodels , 2013, 1310.1495.

[53]  Carey E. Priebe,et al.  Consistent Adjacency-Spectral Partitioning for the Stochastic Block Model When the Model Parameters Are Unknown , 2012, SIAM J. Matrix Anal. Appl..

[54]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[55]  A. D. Medus,et al.  Community Detection in Networks , 2010, Int. J. Bifurc. Chaos.

[56]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[57]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[58]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[59]  Amin Coja-Oghlan,et al.  Finding Planted Partitions in Random Graphs with General Degree Distributions , 2009, SIAM J. Discret. Math..