High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

We consider the problem of determining the structure of high-dimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.

[1]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[2]  David L. Dowe,et al.  MML Clustering of Continuous-Valued Data Using Gaussian and t Distributions , 2002, Australian Joint Conference on Artificial Intelligence.

[3]  Ravi Kothari,et al.  On finding the number of clusters , 1999, Pattern Recognit. Lett..

[4]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[6]  H. Bozdogan Determining the Number of Component Clusters in the Standard Multivariate Normal Mixture Model Using Model-Selection Criteria. , 1983 .

[7]  Rohan A. Baxter,et al.  MML and Bayesianism: similarities and differences: introduction to minimum encoding inference Part , 1994 .

[8]  Michael Unser,et al.  Sum and Difference Histograms for Texture Classification , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[10]  David L. Dowe,et al.  Unsupervised Learning of Correlated Multivariate Gaussian Mixture Models Using MML , 2003, Australian Conference on Artificial Intelligence.

[11]  P. Deb Finite Mixture Models , 2008 .

[12]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[14]  C. S. Wallace,et al.  Classification by Minimum-Message-Length Inference , 1991, ICCI.

[15]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[16]  R. Beckman,et al.  Maximum likelihood estimation for the beta distribution , 1978 .

[17]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Nizar Bouguila,et al.  Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application , 2004, IEEE Transactions on Image Processing.

[20]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[21]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[22]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[23]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[24]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Trygve Randen,et al.  Filtering for Texture Classification: A Comparative Study , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Anil K. Jain,et al.  Image classification for content-based indexing , 2001, IEEE Trans. Image Process..

[27]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[28]  José M. N. Leitão,et al.  On Fitting Mixture Models , 1999, EMMCVPR.

[29]  Tzu-Tsung Wong,et al.  Generalized Dirichlet distribution in Bayesian analysis , 1998, Appl. Math. Comput..

[30]  Djemel Ziou,et al.  Combining positive and negative examples in relevance feedback for content-based image retrieval , 2003, J. Vis. Commun. Image Represent..

[31]  D. W. Scott Probability Density Estimation , 2001 .

[32]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[33]  C. S. Wallace,et al.  Intrinsic Classification of Spatially Correlated Data , 1998, Comput. J..

[34]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[35]  H. Akaike A new look at the statistical model identification , 1974 .

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  Alex Pentland,et al.  Photobook: Content-based manipulation of image databases , 1996, International Journal of Computer Vision.

[38]  David L. Dowe,et al.  MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions , 2000, Stat. Comput..

[39]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[40]  Yudi Agusta,et al.  Unsupervised learning of Gamma mixture models using Minimum Message Length , 2003 .

[41]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[42]  Nizar Bouguila,et al.  MML-Based Approach for High-Dimensional Unsupervised Learning Using the Generalized Dirichlet Mixture , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[43]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[44]  Jonathan J. Oliver,et al.  Finding overlapping components with MML , 2000, Stat. Comput..

[45]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[46]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[47]  D. Ziou,et al.  A powerful finite mixture model based on the generalized Dirichlet distribution: unsupervised learning and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[48]  David L. Dowe,et al.  Intrinsic classification by MML - the Snob program , 1994 .

[49]  G. Celeux,et al.  A stochastic approximation type EM algorithm for the mixture problem , 1992 .

[50]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[51]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[52]  Calyampudi Radhakrishna Rao,et al.  Advanced Statistical Methods in Biometric Research. , 1953 .

[53]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[54]  Nizar Bouguila,et al.  Unsupervised learning of a finite gamma mixture using MML: application to SAR image analysis , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[55]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[56]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[57]  L. Wasserman,et al.  Practical Bayesian Density Estimation Using Mixtures of Normals , 1997 .

[58]  C. S. Wallace,et al.  MML mixture modelling of multi-state, Poisson, von Mises circular and Gaussian distributions , 1997 .

[59]  N. J. A. Sloane,et al.  Sphere Packings, Lattices and Groups , 1987, Grundlehren der mathematischen Wissenschaften.

[60]  B. S. Manjunath,et al.  Category-based image retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[61]  David L. Dowe,et al.  Single Factor Analysis in MML Mixture Modelling , 1998, PAKDD.

[62]  Adrian E. Raftery,et al.  Inference in model-based cluster analysis , 1997, Stat. Comput..

[63]  G. Reaven,et al.  An attempt to define the nature of chemical diabetes using a multidimensional analysis , 2004, Diabetologia.

[64]  G. Schwarz Estimating the Dimension of a Model , 1978 .