Modeling word burstiness using the Dirichlet distribution

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.

[1]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[2]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[3]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[4]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[5]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[6]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[7]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[10]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[13]  Martin Jansche,et al.  Parametric Models of Linguistic Count Data , 2003, ACL.

[14]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[15]  David R. Karger,et al.  Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model , 2003, SIGIR.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.