Inference and evaluation of the multinomial mixture model for text clustering

In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build ''soft'' theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional ''semantic'' space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach.

[1]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[2]  J. Mosimann On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions , 1962 .

[3]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[4]  Olivier Capp Quelques observations sur le mod` ele LDA , 2006 .

[5]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[6]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[9]  L. Rigouste,et al.  Inference for probabilistic unsupervised text clustering , 2005, IEEE/SP 13th Workshop on Statistical Signal Processing, 2005.

[10]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[11]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[12]  John F. Canny,et al.  GaP: a factor model for discrete data , 2004, SIGIR '04.

[13]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[17]  François Yvon,et al.  Improving Rocchio with Weakly Supervised Clustering , 2003, ECML.

[18]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[19]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[20]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[21]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[22]  András Frank,et al.  On Kuhn's Hungarian Method—A tribute from Hungary , 2005 .

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[25]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[26]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[27]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[28]  Thomas L. Griffiths,et al.  A probabilistic approach to semantic representation , 2019, Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society.