论文信息 - The beta-binomial mixture model for word frequencies in documents with applications to information retrieval

The beta-binomial mixture model for word frequencies in documents with applications to information retrieval

This paper describes a continuous-mixture statistical model for word occurrence frequencies in documents, and the application of that model to the DARPA-sponsored TDT topic identification tasks [1]. This model was originally proposed in 1990 by L. Gillick [2] as a means to account for variation in word frequencies across documents more accurately than the binomial model. The present paper presents further mathematical development of the model, leading to the implementation of a topic-tracking system. Performance results for this system on the Tracking Task in the December 1998 DARPA TDT Evaluation will be shown and compared with Dragon’s existing, more complex multinomial-model-based system. (Results from other systems applied to this task are available in [3].) We will conclude with plans for further development.

Stephen A. Lowe

[1] Stephen E. Robertson,et al. Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[2] Stephen P. Harter,et al. A probabilistic approach to automatic keyword indexing , 1974 .

[3] Stephen P. Harter,et al. A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[4] Stephen A. Lowe. The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection , 1999 .

[5] Janet M. Baker,et al. Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Janet M. Baker,et al. Topic and Speaker Identification via Large Vocabulary Continuous Speech Recognition , 1993, HLT.

[7] J. Rice. Mathematical Statistics and Data Analysis , 1988 .

[8] Sean Connolly,et al. Improvements in switchboard recognition and topic identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9] QUENTIN BURRELL,et al. A Simple stochastic Model for Library loans , 1980, J. Documentation.

[10] Jonathan Yamron,et al. Topic Tracking in a News Stream , 1999 .