Simultaneous Dimensionality and Complexity Model Selection for Spectral Graph Clustering

Our problem of interest is to cluster vertices of a graph by identifying underlying community structure. Among various vertex clustering approaches, spectral clustering is one of the most popular methods because it is easy to implement while often outperforming more traditional clustering algorithms. However, there are two inherent model selection problems in spectral clustering, namely estimating both the embedding dimension and number of clusters. This paper attempts to address the issue by establishing a novel model selection framework specifically for vertex clustering on graphs under a stochastic block model. The first contribution is a probabilistic model which approximates the distribution of the extended spectral embedding of a graph. The model is constructed based on a theoretical result of asymptotic normality of the informative part of the embedding, and on a simulation result providing a conjecture for the limiting behavior of the redundant part of the embedding. The second contribution is a simultaneous model selection framework. In contrast with the traditional approaches, our model selection procedure estimates embedding dimension and number of clusters simultaneously. Based on our conjectured distributional model, a theorem on the consistency of the estimates of model parameters is presented, providing support for the validity of our method. Algorithms for our simultaneous model selection for vertex clustering are proposed, demonstrating superior performance in simulation experiments. We illustrate our method via application to a collection of brain graphs.

[1]  Thomas L. Griffiths,et al.  Learning Systems of Concepts with an Infinite Relational Model , 2006, AAAI.

[2]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[3]  Carey E. Priebe,et al.  A statistical interpretation of spectral embedding: The generalised random dot product graph , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[4]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[5]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[6]  Carey E. Priebe,et al.  Spectral graph clustering via the Expectation-Solution algorithm , 2020 .

[7]  C. Priebe,et al.  A Limit Theorem for Scaled Eigenvectors of Random Dot Product Graphs , 2013, Sankhya A.

[8]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Carey E. Priebe,et al.  Limit theorems for eigenvectors of the normalized Laplacian for random graphs , 2016, The Annals of Statistics.

[11]  Patrick C Phillips,et al.  Network thinking in ecology and evolution. , 2005, Trends in ecology & evolution.

[12]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[13]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[14]  Adrian E. Raftery,et al.  Model-based methods for textile fault detection , 1999, Int. J. Imaging Syst. Technol..

[15]  Carey E. Priebe,et al.  The generalised random dot product graph , 2017 .

[16]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[17]  Edward R. Scheinerman,et al.  Random Dot Product Graph Models for Social Networks , 2007, WAW.

[18]  C. Priebe,et al.  Asymptotically efficient estimators for stochastic blockmodels: The naive MLE, the rank-constrained MLE, and the spectral estimator , 2017, Bernoulli.

[19]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[20]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[21]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[22]  B. Bollobás The evolution of random graphs , 1984 .

[23]  Carey E. Priebe,et al.  Universally Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs , 2012, 1207.6745.

[24]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[26]  Francesco Sanna Passino,et al.  Bayesian estimation of the latent dimension and communities in stochastic blockmodels , 2019, Statistics and Computing.

[27]  Vince D. Calhoun,et al.  A High-Throughput Pipeline Identifies Robust Connectomes But Troublesome Variability , 2017, bioRxiv.

[28]  A. Hardy On the number of clusters , 1996 .

[29]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[30]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Tai Qin,et al.  Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel , 2013, NIPS.

[32]  Carey E. Priebe,et al.  Statistical Inference on Random Dot Product Graphs: a Survey , 2017, J. Mach. Learn. Res..

[33]  Andrei Z. Broder,et al.  Workshop on Algorithms and Models for the Web Graph , 2007, WAW.

[34]  Luca Scrucca,et al.  mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models , 2016, R J..

[35]  Carey E. Priebe,et al.  On a two-truths phenomenon in spectral graph clustering , 2018, Proceedings of the National Academy of Sciences.

[36]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[37]  Danielle S Bassett,et al.  Brain graphs: graphical models of the human brain connectome. , 2011, Annual review of clinical psychology.

[38]  Carey E. Priebe,et al.  A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs , 2011, 1108.2228.

[39]  Disa Mhembere,et al.  A Comprehensive Cloud Framework for Accurate and Reliable Human Connectome Estimation and Meganalysis , 2017 .

[40]  L. Wasserman,et al.  Practical Bayesian Density Estimation Using Mixtures of Normals , 1997 .

[41]  G. Celeux,et al.  An entropy criterion for assessing the number of clusters in a mixture model , 1996 .

[42]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[44]  K. Stovel,et al.  Network Analysis and Political Science , 2011 .

[45]  A. Rinaldo,et al.  Consistency of spectral clustering in stochastic block models , 2013, 1312.2050.

[46]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[47]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[48]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[49]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[50]  Mu Zhu,et al.  Automatic dimensionality selection from the scree plot via the use of profile likelihood , 2006, Comput. Stat. Data Anal..

[51]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[52]  Adrian E. Raftery,et al.  Principal Curve Clustering With Noise , 1997 .

[53]  L. Hubert,et al.  Comparing partitions , 1985 .

[54]  Carey E. Priebe,et al.  On spectral embedding performance and elucidating network structure in stochastic blockmodel graphs , 2018, Network Science.

[55]  T. B. Murphy,et al.  Variable selection methods for model-based clustering , 2017, 1707.00306.

[56]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[57]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[58]  Lancelot F. James,et al.  Consistent estimation of mixture complexity , 2001 .

[59]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[60]  Carey E. Priebe,et al.  Community Detection and Classification in Hierarchical Stochastic Blockmodels , 2015, IEEE Transactions on Network Science and Engineering.

[61]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[62]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[63]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[64]  Peter D. Hoff,et al.  Modeling homophily and stochastic equivalence in symmetric relational data , 2007, NIPS.

[65]  Peter D. Hoff,et al.  Bilinear Mixed-Effects Models for Dyadic Data , 2005 .