Joint Parsimonious Modeling and Model Order Selection for Multivariate Gaussian Mixtures

Multivariate Gaussian mixture models (GMMs) are widely for density estimation, model-based data clustering, and statistical classification. A difficult problem is estimating the model order, i.e., the number of mixture components, and model structure. Use of full covariance matrices, with number of parameters quadratic in the feature dimension, entails high model complexity, and thus may underestimate order, while naive Bayes mixtures may introduce model bias and lead to order overestimates. We develop a parsimonious modeling and model order and structure selection method for GMMs which allows for and optimizes over parameter tying configurations across mixture components applied to each individual parameter, including the covariates. We derive a generalized Expectation-Maximization algorithm for [(Bayesian information criterion (BIC)-based] penalized likelihood minimization. This, coupled with sequential model order reduction, forms our joint learning and model selection. Our method searches over a rich space of models and, consistent with minimizing BIC, achieves fine-grained matching of model complexity to the available data. We have found our method to be effective and largely robust in learning accurate model orders and parameter-tying structures for simulated ground-truth mixtures. We compared against naive Bayes and standard full-covariance GMMs for several criteria: 1) model order and structure accuracy (for synthetic data sets); 2) test set log-likelihood; 3) unsupervised classification accuracy; and 4) accuracy when class-conditional mixtures are used in a plug-in Bayes classifier. Our method, which chooses model orders intermediate between standard and naive Bayes GMMs, gives improved accuracy with respect to each of these performance measures.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[3]  Kenneth Rose,et al.  A Deterministic Annealing Approach for Parsimonious Design of Piecewise Regression Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  David J. Miller,et al.  Joint source-channel decoding for variable-length encoded data by exact and approximate MAP sequence estimation , 2000, IEEE Trans. Commun..

[6]  Kenneth Rose,et al.  A non-greedy approach to tree-structured clustering , 1994, Pattern Recognit. Lett..

[7]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Hongyu Li,et al.  Outlier Detection in Benchmark Classification Tasks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  David J. Miller,et al.  Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection , 2006, IEEE Transactions on Signal Processing.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  David J. Miller,et al.  Low-delay optimal MAP state estimation in HMM's with application to symbol decoding , 1997, IEEE Signal Processing Letters.

[14]  David J. Miller,et al.  General statistical inference for discrete and mixed spaces by an approximate application of the maximum entropy principle , 2000, IEEE Trans. Neural Networks Learn. Syst..

[15]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[16]  Kenneth Rose,et al.  Combined source-channel vector quantization using deterministic annealing , 1994, IEEE Trans. Commun..

[17]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[18]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[19]  Kenneth Rose,et al.  Mixture of experts regression modeling by deterministic annealing , 1997, IEEE Trans. Signal Process..

[20]  H. Akaike A new look at the statistical model identification , 1974 .

[21]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[22]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[23]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[24]  Sebastian Thrun,et al.  Using EM to Classify Text from Labeled and Unlabeled Documents , 1998 .

[25]  David J. Miller,et al.  A sequence-based approximate MMSE decoder for source coding over noisy channels using discrete hidden Markov models , 1998, IEEE Trans. Commun..

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  David J. Miller,et al.  Approximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-Order Probability Constraints: Application to Statistical Classification , 2000, Neural Computation.

[28]  David J. Miller,et al.  A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[30]  Alfred O. Hero,et al.  Space-alternating generalized expectation-maximization algorithm , 1994, IEEE Trans. Signal Process..

[31]  Chin-Hui Lee,et al.  Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models , 1991, HLT.

[32]  Shivakumar Vaithyanathan,et al.  Generalized Model Selection for Unsupervised Learning in High Dimensions , 1999, NIPS.

[33]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[34]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[35]  David J. Miller,et al.  Combined Learning and Use for a Mixture Model Equivalent to the RBF Classifier , 1998, Neural Computation.

[36]  Peder A. Olsen,et al.  Modeling inverse covariance matrices by basis expansion , 2004, IEEE Trans. Speech Audio Process..

[37]  Jirí Grim,et al.  Multivariate statistical pattern recognition with nonreduced dimensionality , 1986, Kybernetika.

[38]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[39]  Qi Zhao,et al.  A sequence-based extension of mean-field annealing using the forward/backward algorithm: application to image segmentation , 2003, IEEE Trans. Signal Process..

[40]  David J. Miller,et al.  Transport of wireless video using separate, concatenated, and joint source-channel coding , 1999, Proc. IEEE.

[41]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[42]  David J. Miller,et al.  Critic-driven ensemble classification , 1999, IEEE Trans. Signal Process..

[43]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[44]  Kenneth Rose,et al.  Hierarchical, Unsupervised Learning with Growing via Phase Transitions , 1996, Neural Computation.

[45]  I. C. Gormley,et al.  Expectation Maximization Algorithm , 2008 .

[46]  Kenneth Rose,et al.  Entropy-constrained tree-structured vector quantizer design , 1996, IEEE Trans. Image Process..

[47]  Kenneth Rose,et al.  A global optimization technique for statistical classifier design , 1996, IEEE Trans. Signal Process..

[48]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[49]  A. Lanterman Schwarz, Wallace, and Rissanen: Intertwining Themes in Theories of Model Selection , 2001 .

[50]  D.J. Miller,et al.  An iterative hillclimbing algorithm for discrete optimization on images: application to joint encoding of image transform coefficients , 2002, IEEE Signal Processing Letters.

[51]  David J. Miller,et al.  Hybrid fractal zerotree wavelet image coding , 2002, Signal Process. Image Commun..

[52]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.