Inference in model-based cluster analysis

A new approach to cluster analysis has been introduced based on parsimonious geometric modelling of the within-group covariance matrices in a mixture of multivariate normal distributions, using hierarchical agglomeration and iterative relocation. It works well and is widely used via the MCLUST software available in S-PLUS and StatLib. However, it has several limitations: there is no assessment of the uncertainty about the classification, the partition can be suboptimal, parameter estimates are biased, the shape matrix has to be specified by the user, prior group probabilities are assumed to be equal, the method for choosing the number of groups is based on a crude approximation, and no formal way of choosing between the various possible models is included. Here, we propose a new approach which overcomes all these difficulties. It consists of exact Bayesian inference via Gibbs sampling, and the calculation of Bayes factors (for choosing the model and the number of groups) from the output using the Laplace–Metropolis estimator. It works well in several real and simulated examples.

[1]  F. Marriott 389: Separating Mixtures of Normal Distributions , 1975 .

[2]  Adrian E. Raftery,et al.  Fitting straight lines to point patterns , 1984, Pattern Recognit..

[3]  J. Schmee An Introduction to Multivariate Statistical Analysis , 1986 .

[4]  L. Tierney,et al.  Accurate Approximations for Posterior Moments and Marginal Densities , 1986 .

[5]  Adrian F. M. Smith,et al.  Bayesian computation via the gibbs sampler and related markov chain monte carlo methods (with discus , 1993 .

[6]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[7]  A. Raftery,et al.  How Many Iterations in the Gibbs Sampler , 1991 .

[8]  C. P. Robert,et al.  Analyse de mélanges gaussiens pour de petits échantillons : applications à la cinématique stellaire , 1991 .

[9]  M. West,et al.  A Bayesian method for classification and discrimination , 1992 .

[10]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[11]  Gilles Celeux,et al.  Une histoire de discrétisation , 1993, Monde des Util. Anal. Données.

[12]  C. Soubiran Kinematics of the galaxy's stellar populations from a proper motion survey. , 1993 .

[13]  Adrian Raftery,et al.  The Number of Iterations, Convergence Diagnostics and Generic Metropolis Algorithms , 1995 .

[14]  Adrian E. Raftery,et al.  Hypothesis Testing and Model Selection Via Posterior Simulation , 1995 .

[15]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[16]  A. Raftery Approximate Bayes factors and accounting for model uncertainty in generalised linear models , 1996 .

[17]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[18]  A. Raftery,et al.  Estimating Bayes Factors via Posterior Simulation with the Laplace—Metropolis Estimator , 1997 .

[19]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .