论文信息 - A Bayesian Mixture Model for Term Re-occurrence and Burstiness

A Bayesian Mixture Model for Term Re-occurrence and Burstiness

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term's re-occurrence rate and within-document burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term's importance based on its distribution pattern in the corpus.

Paul H. Garthwaite | Anne N. De Roeck | Avik Sarkar

[1] Paul H. Garthwaite,et al. Frequent Term Distribution Measures for Dataset Profiling , 2004, LREC.

[2] Paul H. Garthwaite,et al. Defeating the Homogeneity Assumption , 2004 .

[3] Kui-Lam Kwok,et al. A new method of weighting query terms for ad-hoc retrieval , 1996, SIGIR '96.

[4] Don R. Swanson,et al. Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[5] Adam Kilgarriff,et al. Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora , 1997, VLC.

[6] Kenneth Ward Church,et al. Poisson mixtures , 1995, Natural Language Engineering.

[7] S. Richardson,et al. Mixtures of distributions: inference and estimation , 1995 .

[8] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9] Kenneth Ward Church,et al. Empirical Term Weighting and Expansion Frequency , 2000, EMNLP.

[10] Alexander Franz. Independence Assumptions Considered Harmful , 1997, ACL.

[11] Kenneth Ward Church. Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[12] Peter Green,et al. Markov chain Monte Carlo in Practice , 1996 .

[13] C. Robert. Mixtures of Distributions: Inference and Estimation , 1996 .

[14] Kenneth Ward Church,et al. Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[15] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.