Discrete data clustering using finite mixture models

Finite mixture models have been applied for different computer vision, image processing and pattern recognition tasks. The majority of the work done concerning finite mixture models has focused on mixtures for continuous data. However, many applications involve and generate discrete data for which discrete mixtures are better suited. In this paper, we investigate the problem of discrete data modeling using finite mixture models. We propose a novel, well motivated mixture that we call the multinomial generalized Dirichlet mixture. The novel model is compared with other discrete mixtures. We designed experiments involving spatial color image databases modeling and summarization, and text classification to show the robustness, flexibility and merits of our approach.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Nizar Bouguila,et al.  Novel Mixtures Based on the Dirichlet Distribution: Application to Data and Image Classification , 2003, MLDM.

[4]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[5]  Alfons Juan-Císcar,et al.  On the use of Bernoulli mixture models for text classification , 2001, Pattern Recognit..

[6]  D. Ziou,et al.  A powerful finite mixture model based on the generalized Dirichlet distribution: unsupervised learning and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Jing Huang,et al.  Spatial Color Indexing and Applications , 2004, International Journal of Computer Vision.

[9]  H. O. Hartley,et al.  Classification and Estimation in Analysis of Variance Problems , 1968 .

[10]  Nizar Bouguila,et al.  A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture , 2006, IEEE Transactions on Image Processing.

[11]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[12]  W. M. Bolstad Introduction to Bayesian Statistics , 2004 .

[13]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[14]  Tzu-Tsung Wong A BAYESIAN APPROACH EMPLOYING GENERALIZED DIRICHLET PRIORS IN PREDICTING MICROCHIP YIELDS , 2005 .

[15]  Joachim M. Buhmann,et al.  Histogram clustering for unsupervised image segmentation , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[16]  P F Thall,et al.  Some extensions and applications of a Bayesian strategy for monitoring multiple outcomes in clinical trials. , 1998, Statistics in medicine.

[17]  R. Jennrich,et al.  Acceleration of the EM Algorithm by using Quasi‐Newton Methods , 1997 .

[18]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[19]  Nizar Bouguila,et al.  High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Nizar Bouguila,et al.  Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization , 2007, J. Vis. Commun. Image Represent..

[21]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[22]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[23]  S. Kotz,et al.  Symmetric Multivariate and Related Distributions , 1989 .

[24]  F. Graybill,et al.  Matrices with Applications in Statistics. , 1984 .

[25]  E. Hille Analytic Function Theory , 1961 .

[26]  T. Louis,et al.  Bayes and Empirical Bayes Methods for Data Analysis. , 1997 .

[27]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[28]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[29]  F. Graybill,et al.  Matrices with Applications in Statistics. , 1984 .

[30]  David G. Stork,et al.  Pattern Classification , 1973 .

[31]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[32]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[33]  Kenneth Lange,et al.  Applications of the Dirichlet distribution to forensic match probabilities , 2005, Genetica.

[34]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[35]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[36]  Ramin Zabih,et al.  Comparing images using joint histograms , 1999, Multimedia Systems.

[37]  D. Ziou,et al.  Ieee Workshop on Machine Learning for Signal Processing Improving Content Based Image Retrieval Systems Using Finite M U Lt I N 0 M I a L D I Rich Let M I Xtu R E , 2022 .

[38]  A. Rukhin Bayes and Empirical Bayes Methods for Data Analysis , 1997 .

[39]  M. Degroot Optimal Statistical Decisions , 1970 .

[40]  Tzu-Tsung Wong,et al.  Generalized Dirichlet distribution in Bayesian analysis , 1998, Appl. Math. Comput..

[41]  Swarup Medasani,et al.  Categorization of image databases for efficient retrieval using robust mixture decomposition , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[42]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[43]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[44]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[45]  Nizar Bouguila,et al.  Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application , 2004, IEEE Transactions on Image Processing.

[46]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[47]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[48]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[49]  Nizar Bouguila,et al.  Spatial Color Image Databases Summarization , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[50]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[51]  Wen-Hsiang Wu,et al.  Fuzzy clustering algorithm for latent class model , 2004, Stat. Comput..

[52]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Robert H. Lochner,et al.  A Generalized Dirichlet Distribution in Bayesian Life Testing , 1975 .

[54]  T. Minka Bayesian inference, entropy, and the multinomial distribution , 2003 .

[55]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[56]  Michael J. Swain,et al.  The capacity of color histogram indexing , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[57]  G. Schwarz Estimating the Dimension of a Model , 1978 .