Addressing overfitting and underfitting in Gaussian model-based clustering

Abstract The expectation–maximization (EM) algorithm is a common approach for parameter estimation in the context of cluster analysis using finite mixture models. This approach suffers from the well-known issue of convergence to local maxima, but also the less obvious problem of overfitting. These combined, and competing, concerns are illustrated through simulation and then addressed by introducing an algorithm that augments the traditional EM with the nonparametric bootstrap. Further simulations and applications to real data lend support for the usage of this bootstrap augmented EM-style algorithm to avoid both overfitting and local maxima.

[1]  P. McNicholas Mixture Model-Based Classification , 2016 .

[2]  D. M. Titterington,et al.  Variational approximations in Bayesian model selection for finite mixture distributions , 2007, Comput. Stat. Data Anal..

[3]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[4]  Dimitris Karlis,et al.  Choosing Initial Values for the EM Algorithm for Finite Mixtures , 2003, Comput. Stat. Data Anal..

[5]  Salvatore Ingrassia,et al.  Constrained monotone EM algorithms for mixtures of multivariate t distributions , 2010, Stat. Comput..

[6]  B. Stocks Fire behavior in mature jack pine , 1987 .

[7]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[8]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[9]  C. Biernacki,et al.  Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM , 2003 .

[10]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[11]  Christophe Biernacki,et al.  Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models , 2003, Comput. Stat. Data Anal..

[12]  R. Tibshirani,et al.  Model Search by Bootstrap “Bumping” , 1999 .

[13]  Jordi Vitrià,et al.  Learning mixture models using a genetic version of the EM algorithm , 2000, Pattern Recognition Letters.

[14]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[15]  Roger W. Johnson,et al.  Exploring Relationships in Body Dimensions , 2003 .

[16]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[17]  Salvatore Ingrassia,et al.  Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints , 2011, Comput. Stat. Data Anal..

[18]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[21]  Paul D. McNicholas,et al.  Using evolutionary algorithms for model-based clustering , 2013, Pattern Recognit. Lett..

[22]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[23]  Bettina Gr,et al.  BOOTSTRAPPING FINITE MIXTURE MODELS , 2004 .

[24]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[25]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[26]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[27]  S. Ingrassia A likelihood-based constrained algorithm for multivariate normal mixture models , 2004 .

[28]  Christian P. Robert,et al.  Reparameterization strategies for hidden Markov models and Bayesian approaches to maximum likelihood estimation , 1998, Stat. Comput..

[29]  Lengyi Han,et al.  Dionysus: a stochastic fire growth scenario generator , 2014 .

[30]  Salvatore Ingrassia,et al.  Constrained monotone EM algorithms for finite mixture of multivariate Gaussians , 2007, Comput. Stat. Data Anal..

[31]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[32]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[33]  J. Durbin,et al.  Testing for serial correlation in least squares regression. II. , 1950, Biometrika.

[34]  Peter Adams,et al.  The EMMIX software for the fitting of mixtures of normal and t-components , 1999 .

[35]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[36]  S. Wood,et al.  Minimising model fitting objectives that contain spurious local minima by bootstrap restarting , 2001 .

[37]  Paul D. McNicholas,et al.  Model-Based Clustering , 2016, Journal of Classification.

[38]  B. Stocks Fire Potential in the Spruce Budworm-damaged Forests of Ontario , 1987 .

[39]  Djamel Bouchaffra,et al.  Genetic-based EM algorithm for learning Gaussian mixture models , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Donald B. Rubin,et al.  EM and beyond , 1991 .

[41]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[42]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[43]  H. Kaiser A NOTE ON GUTTMAN'S LOWER BOUND FOR THE NUMBER OF COMMON FACTORS1 , 1961 .

[44]  L. Guttman Some necessary conditions for common-factor analysis , 1954 .

[45]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[46]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .