Identifying the number of clusters in discrete mixture models

Research on cluster analysis for categorical data continues to develop, with new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. In this paper, we propose a new approach in which clustering of categorical data and the estimation of the number of clusters is carried out simultaneously. Assuming that the data originate from a finite mixture of multinomial distributions, we develop a method to select the number of mixture components based on a minimum message length (MML) criterion and implement a new expectationmaximization (EM) algorithm to estimate all the model parameters. The proposed EM-MML approach, rather than selecting one among a set of pre-estimated candidate models (which requires running EM several times), seamlessly integrates estimation and model selection in a single algorithm. The performance of the proposed approach is compared with other well-known criteria (such as the Bayesian information criterion–BIC), resorting to synthetic data and to two real applications from the European Social Survey. The EM-MML computation time is a clear advantage of the proposed method. Also, the real data solutions are much more parsimonious than the solutions provided by competing methods, which reduces the risk of model order overestimation and increases interpretability.

[1]  M. Giordan,et al.  A Clustering Method for Categorical Ordinal Data , 2011 .

[2]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[3]  Aristidis Likas,et al.  Bayesian feature and model selection for Gaussian mixture models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Gilles Celeux,et al.  Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering. , 2013, Journal de la Societe francaise de statistique.

[5]  Nizar Bouguila,et al.  Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions , 2008, IEEE Transactions on Knowledge and Data Engineering.

[6]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[7]  S. Newcomb A Generalized Theory of the Combination of Observations so as to Obtain the Best Result , 1886 .

[8]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  H. Akaike Maximum likelihood identification of Gaussian autoregressive moving average models , 1973 .

[11]  Hamparsum Bozdogan,et al.  Mixture-Model Cluster Analysis Using Model Selection Criteria and a New Informational Measure of Complexity , 1994 .

[12]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[13]  Minqiang Li,et al.  Multinomial mixture model with feature selection for text clustering , 2008, Knowl. Based Syst..

[14]  Keke Chen,et al.  “Best K”: critical clustering structures in categorical datasets , 2008, Knowledge and Information Systems.

[15]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Geoffrey J. McLachlan,et al.  Estimation of Mixing Proportions: A Case Study , 1984 .

[17]  Vikram Pudi,et al.  DISC: Data-Intensive Similarity Measure for Categorical Data , 2011, PAKDD.

[18]  C. Lim,et al.  Comparisons of computational methods for clustered binary data , 2013 .

[19]  J. Portela,et al.  Clustering Discrete Data Through the Multinomial Mixture Model , 2008 .

[20]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[21]  Volodymyr Melnykov,et al.  Finite mixture models and model-based clustering , 2010 .

[22]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[23]  Maya R. Gupta,et al.  Theory and Use of the EM Algorithm , 2011, Found. Trends Signal Process..

[24]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[25]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  T. Asano,et al.  ENTROPY , RELATIVE ENTROPY , AND MUTUAL INFORMATION , 2008 .

[27]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[28]  G. Celeux,et al.  Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments , 2005 .

[29]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[30]  J. Vermunt,et al.  Latent class cluster analysis , 2002 .

[31]  Tao Chen,et al.  Model-based multidimensional clustering of categorical data , 2012, Artif. Intell..

[32]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[34]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[35]  Tony J. Pitcher,et al.  Age-Groups from Size-Frequency Data: A Versatile and Efficient Method of Analyzing Distribution Mixtures , 1979 .

[36]  E. K. Al-Hussaini,et al.  Accelerated life tests under finite mixture models , 2006 .

[37]  M. Wedel,et al.  Market Segmentation: Conceptual and Methodological Foundations , 1997 .

[38]  Jonathan J. Oliver,et al.  Finding overlapping components with MML , 2000, Stat. Comput..

[39]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.