Estimation and selection for the latent block model on categorical data

This paper deals with estimation and model selection in the Latent Block Model (LBM) for categorical data. First, after providing sufficient conditions ensuring the identifiability of this model, we generalise estimation procedures and model selection criteria derived for binary data. Secondly, we develop Bayesian inference through Gibbs sampling and with a well calibrated non informative prior distribution, in order to get the MAP estimator: this is proved to avoid the traps encountered by the LBM with the maximum likelihood methodology. Then model selection criteria are presented. In particular an exact expression of the integrated completed likelihood criterion requiring no asymptotic approximation is derived. Finally numerical experiments on both simulated and real data sets highlight the appeal of the proposed estimation and model selection procedures.

[1]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[3]  G. McLachlan,et al.  The EM Algorithm and Extensions: Second Edition , 2008 .

[4]  Gérard Govaert,et al.  Un protocole de simulation de données pour la classification croisée , 2012 .

[5]  G. Celeux,et al.  Stochastic versions of the em algorithm: an experimental study in the mixture case , 1996 .

[6]  C. Matias,et al.  Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[7]  Catherine Matias,et al.  Convergence of the groups posterior distribution in latent or stochastic block models , 2012, 1206.7101.

[8]  Christopher Joseph Pal,et al.  Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering , 2007, BMC Bioinformatics.

[9]  Nial Friel,et al.  Block clustering with collapsed latent block models , 2010, Statistics and Computing.

[10]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Jean-Patrick Baudry Sélection de modèle pour la classification non supervisée , 2009 .

[12]  Aurore Lomet,et al.  Sélection de modèle pour la classification croisée de données continues , 2013 .

[13]  G. Govaert,et al.  Latent Block Model for Contingency Table , 2010 .

[14]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[16]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[17]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[18]  S. Roweis,et al.  Nonparametric Bayesian Biclustering , 2007 .

[19]  Gérard Govaert La classification croisée , 1989, Monde des Util. Anal. Données.

[20]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[21]  Miguel Á. Carreira-Perpiñán,et al.  Practical Identifiability of Finite Mixtures of Multivariate Bernoulli Distributions , 2000, Neural Computation.

[22]  Alain Celisse,et al.  Consistency of maximum-likelihood and variational estimators in the Stochastic Block Model , 2011, 1105.3288.

[23]  Sylvia Frühwirth-Schnatter,et al.  Dealing with Label Switching under Model Uncertainty , 2011 .

[24]  Christine Keribin,et al.  Méthodes bayésiennes variationnelles : concepts et applications en neuroimagerie , 2011 .

[25]  K. Mengersen,et al.  Asymptotic behaviour of the posterior distribution in overfitted mixture models , 2011 .

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  Gérard Govaert,et al.  Model selection for the binary latent block model , 2012 .

[28]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[29]  M. Verlaan,et al.  Non-uniqueness in probabilistic numerical identification of bacteria , 1994, Journal of Applied Probability.

[30]  Agostino Nobile,et al.  Bayesian finite mixtures with an unknown number of components: The allocation sampler , 2007, Stat. Comput..

[31]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.