Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. We derive a new family of distributions that are approximations to DCM distributions and constitute an exponential family, unlike DCM distributions. We use these so-called EDCM distributions to obtain insights into the properties of DCM distributions, and then derive an algorithm for EDCM maximum-likelihood training that is many times faster than the corresponding method for DCM distributions. Next, we investigate expectation-maximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. Experiments show that the new algorithm is competitive with the best methods in the literature, and superior from the point of view of finding models with low perplexity.

[1]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[4]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[5]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[6]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[7]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[8]  N. L. Johnson,et al.  Discrete Multivariate Distributions , 1998 .

[9]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[10]  Dennis Day,et al.  The multivariate Polya distribution in combat modeling , 2001 .

[11]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[12]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[13]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[14]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[15]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .