Multiple Hypothesis Testing To Estimate The Number of Communities in Sparse Stochastic Block Models

Network-based clustering methods frequently require the number of communities to be specified a priori. Moreover, most of the existing methods for estimating the number of communities assume the number of communities to be fixed and not scale with the network size n. The few methods that assume the number of communities to increase with the network size n are only valid when the average degree d of a network grows at least as fast as O(n) (i.e., the dense case) or lies within a narrow range. This presents a challenge in clustering large-scale network data, particularly when the average degree d of a network grows slower than the rate of O(n) (i.e., the sparse case). To address this problem, we proposed a new sequential procedure utilizing multiple hypothesis tests and the spectral properties of Erdös Rényi graphs for estimating the number of communities in sparse stochastic block models (SBMs). We prove the consistency of our method for sparse SBMs for a broad range of the sparsity parameter. As a consequence, we discover that our method can estimate the number of communities K ? with K (n) ? increasing at the rate as high as O(n(1−3γ)/(4−3γ)), where d=O(n1−γ). Moreover, we show that our method can be adapted as a stopping rule in estimating the number of communities in binary tree stochastic block models. We benchmark the performance of our method against other competing methods on six reference single-cell RNA sequencing datasets. Finally, we demonstrate the usefulness of our method through numerical simulations and by using it for clustering real single-cell RNA-sequencing datasets.

[1]  Alessandro Rinaldo,et al.  Consistency of Spectral Clustering in Sparse Stochastic Block Models , 2013 .

[2]  Aleksandra A. Kolodziejczyk,et al.  Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation , 2015, Cell stem cell.

[3]  F. Biase,et al.  Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing , 2014, Genome research.

[4]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  E. Levina,et al.  Network cross-validation by edge sampling , 2016, Biometrika.

[6]  Elizaveta Levina,et al.  On semidefinite relaxations for the block model , 2014, ArXiv.

[7]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[8]  Liangjun Su,et al.  Determining the Number of Communities in Degree-corrected Stochastic Block Models , 2018, J. Mach. Learn. Res..

[9]  Alex A. Pollen,et al.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex , 2014, Nature Biotechnology.

[10]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[11]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[12]  P. Bickel,et al.  Likelihood-based model selection for stochastic block models , 2015, 1502.02069.

[13]  Junhyong Kim,et al.  The promise of single-cell sequencing , 2013, Nature Methods.

[14]  Can M. Le,et al.  Estimating the number of communities in networks by spectral methods , 2015, ArXiv.

[15]  Sivaraman Balakrishnan,et al.  Noise Thresholds for Spectral Clustering , 2011, NIPS.

[16]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[17]  Leto Peel,et al.  Detecting Change Points in the Large-Scale Structure of Evolving Networks , 2014, AAAI.

[18]  Xingyi Zhang,et al.  Overlapping Community Detection based on Network Decomposition , 2016, Scientific Reports.

[19]  Purnamrita Sarkar,et al.  Hierarchical community detection by recursive bi-partitioning , 2018 .

[20]  Tai Qin,et al.  Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel , 2013, NIPS.

[21]  Stephanie C. Hicks,et al.  A systematic evaluation of single-cell RNA-sequencing imputation methods , 2020, Genome Biology.

[22]  Bin Yu,et al.  Impact of regularization on spectral clustering , 2013, 2014 Information Theory and Applications Workshop (ITA).

[23]  Hongkai Ji,et al.  A systematic evaluation of single-cell RNA-sequencing imputation methods , 2020, Genome biology.

[24]  A GOODNESS-OFFIT TEST FOR STOCHASTIC BLOCK MODELS By , .

[25]  Junjie Zhu,et al.  SIMLR: A Tool for Large‐Scale Genomic Analyses by Multi‐Kernel Learning , 2018, Proteomics.

[26]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[27]  Lior Pachter,et al.  A curated database reveals trends in single-cell transcriptomics , 2019, bioRxiv.

[28]  Chao Gao,et al.  Achieving Optimal Misclassification Proportion in Stochastic Block Models , 2015, J. Mach. Learn. Res..

[29]  Yunpeng Zhao,et al.  A survey on theoretical advances of community detection in networks , 2017, ArXiv.

[30]  Lingxue Zhu,et al.  A Generic Sample Splitting Approach for Refined Community Recovery in Stochastic Block Models , 2014, 1411.1469.

[31]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[32]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[33]  Ji Zhu,et al.  Consistency of community detection in networks under degree-corrected stochastic block models , 2011, 1110.3854.

[34]  Edoardo M. Airoldi,et al.  Stochastic blockmodels with growing number of classes , 2010, Biometrika.

[35]  Jing Lei,et al.  Network Cross-Validation for Determining the Number of Communities in Network Data , 2014, 1411.1715.

[36]  Kevin Schnelli,et al.  Local law and Tracy–Widom limit for sparse random matrices , 2016, 1605.08767.

[37]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[38]  C. J. Robbins,et al.  Differentially Expressed Genes and Signature Pathways of Human Prostate Cancer , 2015, PloS one.

[39]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression data , 2015 .

[40]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[41]  Mauricio Barahona,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[42]  M. Newman,et al.  Hierarchical structure and the prediction of missing links in networks , 2008, Nature.

[43]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[44]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.