Information limits for recovering a hidden community

We study the problem of recovering a hidden community of cardinality <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula> from an <inline-formula> <tex-math notation="LaTeX">$n \times n$ </tex-math></inline-formula> symmetric data matrix <inline-formula> <tex-math notation="LaTeX">$A$ </tex-math></inline-formula>, where for distinct indices <inline-formula> <tex-math notation="LaTeX">$i,j$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$A_{ij} \sim P$ </tex-math></inline-formula> if <inline-formula> <tex-math notation="LaTeX">$i, j$ </tex-math></inline-formula> both belong to the community and <inline-formula> <tex-math notation="LaTeX">$A_{ij} \sim Q$ </tex-math></inline-formula> otherwise, for two known probability distributions <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> depending on <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>. If <inline-formula> <tex-math notation="LaTeX">$P={\mathrm{ Bern}}(p)$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$Q={\mathrm{ Bern}}(q)$ </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">$p>q$ </tex-math></inline-formula>, it reduces to the problem of finding a densely connected <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula>-subgraph planted in a large Erdös–Rényi graph; if <inline-formula> <tex-math notation="LaTeX">$P=\mathcal {N}(\mu ,1)$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$Q=\mathcal {N}(0,1)$ </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">$\mu >0$ </tex-math></inline-formula>, it corresponds to the problem of locating a <inline-formula> <tex-math notation="LaTeX">$K \times K$ </tex-math></inline-formula> principal submatrix of elevated means in a large Gaussian random matrix. We focus on two types of asymptotic recovery guarantees as <inline-formula> <tex-math notation="LaTeX">$n \to \infty $ </tex-math></inline-formula>: 1) weak recovery: expected number of classification errors is <inline-formula> <tex-math notation="LaTeX">$o(K)$ </tex-math></inline-formula> and 2) exact recovery: probability of classifying all indices correctly converges to one. Under mild assumptions on <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula>, and allowing the community size to scale sublinearly with <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>, we derive a set of sufficient conditions and a set of necessary conditions for recovery, which are asymptotically tight with sharp constants. The results hold, in particular, for the Gaussian case, and for the case of bounded log likelihood ratio, including the Bernoulli case whenever <inline-formula> <tex-math notation="LaTeX">$({p}/{q})$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$({1-p})/({1-q})$ </tex-math></inline-formula> are bounded away from zero and infinity. Previous work has shown that if weak recovery is achievable; then, exact recovery is achievable in linear additional time by a simple voting procedure. We provide a converse, showing the condition for the voting procedure to succeed is almost necessary for exact recovery.

[1]  Yu. I. Ingster,et al.  Sharp Variable Selection of a Sparse Submatrix in a High-Dimensional Noisy Matrix , 2013, 1303.5647.

[2]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[3]  U. Feige,et al.  Finding hidden cliques in linear time , 2009 .

[4]  E. Arias-Castro,et al.  Community detection in dense random networks , 2014 .

[5]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[6]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[7]  Yu. I. Ingster,et al.  Detection of a sparse submatrix of a high-dimensional noisy matrix , 2011, 1109.0898.

[8]  Jess Banks,et al.  Information-theoretic thresholds for community detection in sparse networks , 2016, COLT.

[9]  Elchanan Mossel,et al.  A Proof of the Block Model Threshold Conjecture , 2013, Combinatorica.

[10]  Elchanan Mossel,et al.  Belief propagation, robust reconstruction and optimal recovery of block models , 2013, COLT.

[11]  Varun Jog,et al.  Information-theoretic bounds for exact recovery in weighted stochastic block models using the Renyi divergence , 2015, ArXiv.

[12]  Bruce E. Hajek,et al.  Recovering a Hidden Community Beyond the Spectral Limit in O(|E|log*|V|) Time , 2015, ArXiv.

[13]  Yuval Peres,et al.  Finding Hidden Cliques in Linear Time with High Probability , 2010, Combinatorics, Probability and Computing.

[14]  Bruce E. Hajek,et al.  Submatrix localization via message passing , 2015, J. Mach. Learn. Res..

[15]  Noga Alon,et al.  Finding a large hidden clique in a random graph , 1998, SODA '98.

[16]  Santosh S. Vempala,et al.  Beyond Spectral: Tight Bounds for Planted Gaussians , 2016, ArXiv.

[17]  Cristopher Moore,et al.  Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Andrea Montanari,et al.  Finding One Community in a Sparse Graph , 2015, Journal of Statistical Physics.

[19]  Yudong Chen,et al.  Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices , 2014, J. Mach. Learn. Res..

[20]  Richard M. Karp,et al.  Algorithms for graph partitioning on the planted partition model , 2001, Random Struct. Algorithms.

[21]  Afonso S. Bandeira,et al.  Random Laplacian Matrices and Convex Relaxations , 2015, Found. Comput. Math..

[22]  Alexandre Proutière,et al.  Optimal Cluster Recovery in the Labeled Stochastic Block Model , 2015, NIPS.

[23]  Laurent Massoulié,et al.  Community detection thresholds and the weak Ramanujan property , 2013, STOC.

[24]  Bruce E. Hajek,et al.  Computational Lower Bounds for Community Detection on Random Graphs , 2014, COLT.

[25]  Emmanuel Abbe,et al.  Exact Recovery in the Stochastic Block Model , 2014, IEEE Transactions on Information Theory.

[26]  Elchanan Mossel,et al.  Consistency Thresholds for the Planted Bisection Model , 2014, STOC.

[27]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[28]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[29]  Bruce E. Hajek,et al.  Achieving Exact Cluster Recovery Threshold via Semidefinite Programming: Extensions , 2015, IEEE Transactions on Information Theory.

[30]  Andrea Montanari,et al.  Finding Hidden Cliques of Size $$\sqrt{N/e}$$N/e in Nearly Linear Time , 2013, Found. Comput. Math..

[31]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[32]  Brendan P. W. Ames Guaranteed Recovery of Planted Cliques and Dense Subgraphs by Convex Relaxation , 2013, Journal of Optimization Theory and Applications.

[33]  Yihong Wu,et al.  Computational Barriers in Minimax Submatrix Detection , 2013, ArXiv.

[34]  Brendan P. W. Ames Guaranteed clustering and biclustering via semidefinite programming , 2012, Mathematical Programming.

[35]  Mark Jerrum,et al.  Large Cliques Elude the Metropolis Process , 1992, Random Struct. Algorithms.

[36]  Alexander S. Wein,et al.  A semidefinite program for unbalanced multisection in the stochastic block model , 2017, 2017 International Conference on Sampling Theory and Applications (SampTA).

[37]  Bruce E. Hajek,et al.  Achieving exact cluster recovery threshold via semidefinite programming , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[38]  A. A. Serov,et al.  A Complete Proof of Universal Inequalities for the Distribution Function of the Binomial Law , 2013 .

[39]  Emmanuel Abbe,et al.  Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[40]  Bruce E. Hajek,et al.  Semidefinite Programs for Exact Recovery of a Hidden Community , 2016, COLT.

[41]  Emmanuel Abbe,et al.  Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic BP, and the information-computation gap , 2015, ArXiv.

[42]  L. Wasserman,et al.  Statistical and computational tradeoffs in biclustering , 2011 .

[43]  Stephen A. Vavasis,et al.  Nuclear norm minimization for the planted clique and biclique problems , 2009, Math. Program..

[44]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[45]  Emmanuel Abbe,et al.  Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , 2015, ArXiv.

[46]  Sivaraman Balakrishnan,et al.  Minimax Localization of Structural Information in Large Noisy Matrices , 2011, NIPS.

[47]  A. Nobel,et al.  Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[48]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[49]  Alexandra Kolla,et al.  Multisection in the Stochastic Block Model using Semidefinite Programming , 2015, ArXiv.

[50]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[51]  Tengyuan Liang,et al.  Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix , 2015, 1502.01988.

[52]  Elchanan Mossel,et al.  Reconstruction and estimation in the planted partition model , 2012, Probability Theory and Related Fields.

[53]  G. Grimmett,et al.  On colouring random graphs , 1975 .