Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery

New phase transition phenomena have recently been discovered for the stochastic block model, for the special case of two non-overlapping symmetric communities. This gives raise in particular to new algorithmic challenges driven by the thresholds. This paper investigates whether a general phenomenon takes place for multiple communities, without imposing symmetry. In the general stochastic block model SBM(n,p,W), n vertices are split into k communities of relative size {pi}i∈[k], and vertices in community i and j connect independently with probability {Wij}i,j∈[k]. This paper investigates the partial and exact recovery of communities in the general SBM (in the constant and logarithmic degree regimes), and uses the generality of the results to tackle overlapping communities. The contributions of the paper are: (i) an explicit characterization of the recovery threshold in the general SBM in terms of a new f-divergence function D+, which generalizes the Hellinger and Chernoff divergences, and which provides an operational meaning to a divergence function analog to the KL-divergence in the channel coding theorem, (ii) the development of an algorithm that recovers the communities all the way down to the optimal threshold and runs in quasi-linear time, showing that exact recovery has no information-theoretic to computational gap for multiple communities, (iii) the development of an efficient algorithm that detects communities in the constant degree regime with an explicit accuracy bound that can be made arbitrarily close to 1 when a prescribed signal-to-noise ratio [defined in terms of the spectrum of diag(p)W] tends to infinity.

[1]  Amin Coja-Oghlan,et al.  Graph Partitioning via Adaptive Spectral Techniques , 2009, Combinatorics, Probability and Computing.

[2]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[3]  Y. Peres,et al.  Broadcasting on trees and the Ising model , 2000 .

[4]  Cristopher Moore,et al.  Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Andrea Montanari,et al.  Conditional Random Fields, Planted Constraint Satisfaction and Entropy Concentration , 2013, APPROX-RANDOM.

[6]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[8]  Yudong Chen,et al.  Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices , 2014, J. Mach. Learn. Res..

[9]  Richard M. Karp,et al.  Algorithms for graph partitioning on the planted partition model , 2001, Random Struct. Algorithms.

[10]  Florent Krzakala,et al.  Spectral detection in the censored block model , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[11]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[12]  Béla Bollobás,et al.  Max Cut for Random Graphs with a Planted Partition , 2004, Combinatorics, Probability and Computing.

[13]  Elchanan Mossel,et al.  Belief propagation, robust reconstruction and optimal recovery of block models , 2013, COLT.

[14]  Russell Impagliazzo,et al.  Hill-climbing finds random planted bisections , 2001, SODA '01.

[15]  Bruce E. Hajek,et al.  Achieving exact cluster recovery threshold via semidefinite programming , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[16]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[17]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[18]  Tim Roughgarden,et al.  Tight Error Bounds for Structured Prediction , 2014, ArXiv.

[19]  D. Welsh,et al.  A Spectral Technique for Coloring Random 3-Colorable Graphs , 1994 .

[20]  Sergio Verdú Asymptotic error probability of binary hypothesis testing for Poisson point-process observations , 1986, IEEE Trans. Inf. Theory.

[21]  Anup Rao,et al.  Stochastic Block Model and Community Detection in Sparse Graphs: A spectral algorithm with optimal rate of recovery , 2015, COLT.

[22]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[23]  Andrea J. Goldsmith,et al.  Information Recovery From Pairwise Measurements , 2015, IEEE Transactions on Information Theory.

[24]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[25]  R. Srikant,et al.  Jointly clustering rows and columns of binary matrices: algorithms and trade-offs , 2013, SIGMETRICS '14.

[26]  T. Morimoto Markov Processes and the H -Theorem , 1963 .

[27]  Edoardo M. Airoldi,et al.  Stochastic blockmodels with growing number of classes , 2010, Biometrika.

[28]  Emmanuel Abbe,et al.  Exact Recovery in the Stochastic Block Model , 2014, IEEE Transactions on Information Theory.

[29]  Ravi B. Boppana,et al.  Eigenvalues and graph bisection: An average-case analysis , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[30]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[31]  Igal Sason,et al.  Concentration of Measure Inequalities in Information Theory, Communications, and Coding , 2012, Found. Trends Commun. Inf. Theory.

[32]  David M Blei,et al.  Efficient discovery of overlapping communities in massive networks , 2013, Proceedings of the National Academy of Sciences.

[33]  Amit Singer,et al.  Decoding Binary Node Labels from Censored Edge Measurements: Phase Transition and Efficient Recovery , 2014, IEEE Transactions on Network Science and Engineering.

[34]  Frank Thomson Leighton,et al.  Graph Bisection Algorithms with Good Average Case Behavior , 1984, FOCS.

[35]  Leonidas J. Guibas,et al.  Near-Optimal Joint Object Matching via Convex Relaxation , 2014, ICML.

[36]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[38]  Elizaveta Levina,et al.  On semidefinite relaxations for the block model , 2014, ArXiv.

[39]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[40]  S. Strogatz Exploring complex networks , 2001, Nature.

[41]  Laurent Massoulié,et al.  Community Detection in the Labelled Stochastic Block Model , 2012, ArXiv.

[42]  Leonidas J. Guibas,et al.  Consistent Shape Maps via Semidefinite Programming , 2013, SGP '13.

[43]  Martin E. Dyer,et al.  The Solution of Some Random NP-Hard Problems in Polynomial Expected Time , 1989, J. Algorithms.

[44]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[45]  M. M. Meyer,et al.  Statistical Analysis of Multiple Sociometric Relations. , 1985 .

[46]  Cynthia Rudin,et al.  Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society , 2014 .

[47]  Mark Jerrum,et al.  The Metropolis Algorithm for Graph Bisection , 1998, Discret. Appl. Math..

[48]  Yuchung J. Wang,et al.  Stochastic Blockmodels for Directed Graphs , 1987 .

[49]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[50]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[51]  Roman Vershynin,et al.  Community detection in sparse networks via Grothendieck’s inequality , 2014, Probability Theory and Related Fields.

[52]  S H Strogatz,et al.  Random graph models of social networks , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Emmanuel Abbe,et al.  Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , 2015, ArXiv.

[54]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[55]  Assaf Naor,et al.  Rigorous location of phase transitions in hard optimization problems , 2005, Nature.

[56]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Sanjeev Arora,et al.  Finding overlapping communities in social networks: toward a rigorous approach , 2011, EC '12.

[58]  Laurent Massoulié,et al.  Community detection thresholds and the weak Ramanujan property , 2013, STOC.

[59]  YuanBo,et al.  Detecting functional modules in the yeast protein--protein interaction network , 2006 .

[60]  S. Boorman,et al.  Social Structure from Multiple Networks. I. Blockmodels of Roles and Positions , 1976, American Journal of Sociology.

[61]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[62]  Laurent Massoulié,et al.  Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results , 2014, COLT.

[63]  Amit Singer,et al.  Linear inverse problems on Erdős-Rényi graphs: Information-theoretic limits and efficient recovery , 2014, 2014 IEEE International Symposium on Information Theory.

[64]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  S. Boorman,et al.  Social structure from multiple networks: I , 1976 .