Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models

Model-based clustering using a family of Gaussian mixture models, with parsimonious factor analysis-like covariance structure, is described and an ecient algorithm for its implementation is presented. This algorithm uses the alternating expectationconditional maximization (AECM) variant of the expectation-maximization (EM) algorithm. Two central issues around the implementation of this family of models, namely model selection and convergence criteria, are discussed. These central issues also have implications for other model-based clustering techniques and for the implementation of techniques like the EM algorithm, in general. The Bayesian information criterion (BIC) is used for model selection and Aitken’s acceleration, which is shown to outperform the lack of progress criterion, is used to determine convergence. A brief introduction to parallel computing is then given before the implementation of this algorithm in parallel is facilitated within the master-slave paradigm. A simulation study is then carried out to confirm the eectiveness of this parallelization. The resulting software is applied to two data sets to demonstrate its eectiveness when compared to existing software.

[1]  Jeffrey S. Racine,et al.  Parallel distributed kernel estimation , 2002 .

[2]  Michael I. Jordan,et al.  Mixtures of Probabilistic Principal Component Analyzers , 2001 .

[3]  B. Leroux Consistent estimation of a mixing distribution , 1992 .

[4]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[5]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[6]  Erricos John Kontoghiorghes,et al.  Parallel Algorithms for Linear Models: Numerical Methods and Estimation Problems , 2000 .

[7]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[8]  Calvin L. Williams,et al.  Modern Applied Statistics with S-Plus , 1997 .

[9]  Domenico Talia,et al.  P-AutoClass: Scalable Parallel Clustering for Mining Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[10]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[11]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[12]  Geoffrey J. McLachlan,et al.  Mixtures of Factor Analyzers , 2000, International Conference on Machine Learning.

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[15]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[16]  Jordan L. Boyd-Graber,et al.  Maximum Likelihood , 2006 .

[17]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[18]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[19]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[20]  Zhou Xing-cai,et al.  The EM Algorithm for Factor Analyzers:An Extension with Latent Variable , 2006 .

[21]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[22]  Adrian E. Raftery,et al.  Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST , 2003, J. Classif..

[23]  Erricos John Kontoghiorghes,et al.  A graph approach to generate all possible regression submodels , 2007, Comput. Stat. Data Anal..

[24]  B. Lindsay,et al.  The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family , 1994 .

[25]  P. Deb Finite Mixture Models , 2008 .

[26]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[27]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[28]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[29]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[30]  R. Fildes Journal of the American Statistical Association : William S. Cleveland, Marylyn E. McGill and Robert McGill, The shape parameter for a two variable graph 83 (1988) 289-300 , 1989 .

[31]  Ruy Luiz Milidiú,et al.  DPLS and PPLS: two PLS algorithms for large data sets , 2005, Comput. Stat. Data Anal..

[32]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[33]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[34]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[35]  D. Bates,et al.  Newton-Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated-Measures Data , 1988 .

[36]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[37]  Geoffrey J. McLachlan,et al.  Robust Cluster Analysis via Mixtures of Multivariate t-Distributions , 1998, SSPR/SPR.

[38]  Erricos John Kontoghiorghes,et al.  Efficient algorithms for estimating the general linear model , 2006, Parallel Comput..

[39]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[40]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[41]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.