Contributions of Language Modeling to the Theory and Practice of Information Retrieval

This paper presents an analysis of what language modeling (LM) is in the context of information retrieval (IR). We argue that there are two principal contributions of the language modeling approach. First, that it brings the thinking, theory, and practical knowledge of research in related fields to bear on the retrieval problem. Second, that it makes patent that parameter estimation is important for probabilistic IR approaches. In particular, it has brought to the attention of the IR community the idea that explicit consideration needs to be given to variance reduction in the design of statistical estimators. We describe a simulation environment which has been developed for the study of theoretical issues in information retrieval. Results obtained from the simulation are presented, which show quantitatively how variance reduction techniques applied to parameter estimation can improve performance for the ad-hoc retrieval task.

[1]  Lynette Hirschman,et al.  Named Entity Scoring for Speech Input , 1998, COLING-ACL.

[2]  S. Robertson The probability ranking principle in IR , 1997 .

[3]  S. K. Wong,et al.  An Information-Theoretic Measure of Term Specificity. , 1992 .

[4]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[5]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[6]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[7]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[8]  William T. Morgan,et al.  The role of variance in term weighting for probabilistic information retrieval , 2002, CIKM '02.

[9]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[10]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[11]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[12]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing , 1975, J. Am. Soc. Inf. Sci..

[13]  Don R. Swanson,et al.  A decision theoretic foundation for indexing , 1975, J. Am. Soc. Inf. Sci..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Marc Light,et al.  Hiding a Semantic Hierarchy in a Markov Model , 1999, ACL 1999.

[16]  Larry Gillick,et al.  A hidden Markov model approach to text segmentation and event tracking , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Djoerd Hiemstra,et al.  Relating the new language models of information retrieval to the traditional retrieval models , 2000 .

[18]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[19]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[20]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[21]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[22]  Warren R. Greiff,et al.  Fine-Grained Hidden Markov Modeling for Broadcast-News Story Segmentation , 2001, HLT.

[23]  L. R. Rabiner,et al.  On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition , 1983, The Bell System Technical Journal.

[24]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .