Language Models for Topic Tracking

Generative unigram language models have proven to be a simple though effective model for information retrieval tasks. In contrast to ad-hoc retrieval, topic tracking requires that matching scores are comparable across topics. Several ranking functions based on generative language models: straight likelihood, likelihood ratio, normalized likelihood ratio, and the related Kullback-Leibler divergence are evaluated in two orientations. Best performance is achieved by the models based on a normalized log-likelihood ratio. Key component of these models is the a-priori probability of a story with respect to a common reference distribution.

[1]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[2]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[3]  Mark Liberman,et al.  Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts , 2000, LREC.

[4]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[5]  W. Bruce Croft,et al.  Workshop on language modeling and information retrieval , 2001, SIGF.

[6]  Avi Arampatzis,et al.  The score-distributional threshold optimization for adaptive binary classification tasks , 2001, SIGIR '01.

[7]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[8]  Wessel Kraaij,et al.  Using language models for tracking events of interest over time , 2001 .

[9]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Wessel Kraaij,et al.  Unsupervised Event Clustering in Multilingual News Streams , 2002 .

[12]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[13]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[14]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[15]  Christoph Baumgarten,et al.  A probabilistic model for distributed information retrieval , 1997, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[16]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[17]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[18]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[19]  Jean-Pierre Chevallet,et al.  About Retrieval Models and Logic , 1992, Comput. J..

[20]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[21]  Richard M. Schwartz,et al.  Topic tracking for radio, TV broadcast, and newswire , 1999, EUROSPEECH.

[22]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[23]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[24]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[25]  S. Robertson The probability ranking principle in IR , 1997 .

[26]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[27]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[28]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[29]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..