A Bayesian Mixture Model for Term Re-occurrence and Burstiness

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term's re-occurrence rate and within-document burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term's importance based on its distribution pattern in the corpus.

[1]  Paul H. Garthwaite,et al.  Frequent Term Distribution Measures for Dataset Profiling , 2004, LREC.

[2]  Paul H. Garthwaite,et al.  Defeating the Homogeneity Assumption , 2004 .

[3]  Kui-Lam Kwok,et al.  A new method of weighting query terms for ad-hoc retrieval , 1996, SIGIR '96.

[4]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[5]  Adam Kilgarriff,et al.  Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora , 1997, VLC.

[6]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[7]  S. Richardson,et al.  Mixtures of distributions: inference and estimation , 1995 .

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Kenneth Ward Church,et al.  Empirical Term Weighting and Expansion Frequency , 2000, EMNLP.

[10]  Alexander Franz Independence Assumptions Considered Harmful , 1997, ACL.

[11]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[12]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[13]  C. Robert Mixtures of Distributions: Inference and Estimation , 1996 .

[14]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[15]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.