An Experimental Comparison of Model-Based Clustering Methods

We compare the three basic algorithms for model-based clustering on high-dimensional discrete-variable datasets. All three algorithms use the same underlying model: a naive-Bayes model with a hidden root node, also known as a multinomial-mixture model. In the first part of the paper, we perform an experimental comparison between three batch algorithms that learn the parameters of this model: the Expectation–Maximization (EM) algorithm, a “winner take all” version of the EM algorithm reminiscent of the K-means algorithm, and model-based agglomerative clustering. We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization methods on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of agglomerative clustering. Although the methods are substantially different, they lead to learned models that are similar in quality.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  M. Degroot Optimal Statistical Decisions , 1970 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[6]  Annette J. Dobson,et al.  An introduction to generalized linear models , 1991 .

[7]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[8]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[9]  Bo Thiesson,et al.  Accelerated Quantification of Bayesian Networks with Incomplete Data , 1995, KDD.

[10]  Brendan J. Frey,et al.  Does the Wake-sleep Algorithm Produce Good Density Estimators? , 1995, NIPS.

[11]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  Douglas H. Fisher,et al.  Iterative Optimization and Simplification of Hierarchical Clusterings , 1996, J. Artif. Intell. Res..

[13]  Eric Bauer,et al.  Update Rules for Parameter Estimation in Bayesian Networks , 1997, UAI.

[14]  Chris Fraley,et al.  Algorithms for Model-Based Gaussian Hierarchical Clustering , 1998, SIAM J. Sci. Comput..

[15]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.

[16]  Bo Thiesson,et al.  Computationally Efficient Methods For Selecting Among Mixtures Of Graphical Models, With Discussion , 1999 .

[17]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[18]  J. Vermunt Latent Class Models , 2004 .