Statistical language modeling for information retrieval

Abstract : This chapter reviews research and applications in statistical language modeling for information retrieval (IR) that has emerged within the past several years as a new probabilistic framework for describing information retrieval processes. Generally speaking, statistical language modeling, or more simply, language modeling (LM), refers to the task of estimating a probability distribution that captures statistical regularities of natural language use. Applied to information retrieval, language modeling refers to the problem of estimating the likelihood that a query and a document could have been generated by the same language model, given the language model of the document and with or without a language model of the query.

[1]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2]  Jay Ponte,et al.  LANGUAGE MODELS FOR RELEVANCE FEEDBACK , 2002 .

[3]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[4]  Hermann Ney,et al.  Estimating 'small' probabilities by leaving-one-out , 1993, EUROSPEECH.

[5]  S. Robertson The probability ranking principle in IR , 1997 .

[6]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[7]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[8]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[9]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[10]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.

[11]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[12]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[13]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[14]  C. J. van Rijsbergen,et al.  Probabilistic Retrieval Revisited , 1992, Comput. J..

[15]  van Gerardus Noord,et al.  Special issue: finite state methods in natural language processing , 2003 .

[16]  Djoerd Hiemstra,et al.  Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[17]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[18]  H. Jeffreys,et al.  Theory of probability , 1896 .

[19]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[20]  Djoerd Hiemstra,et al.  Relating the new language models of information retrieval to the traditional retrieval models , 2000 .

[21]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[22]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[23]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[24]  ChengXiang Zhai,et al.  Risk minimization and language modeling in text retrieval dissertation abstract , 2002, SIGF.

[25]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[26]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[27]  W. Bruce Croft,et al.  Relevance Feedback and Personalization: A Language Modeling Perspective , 2001, DELOS.

[28]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[29]  Djoerd Hiemstra,et al.  Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term , 2002, SIGIR '02.

[30]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[31]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[32]  Victor Lavrenko,et al.  Optimal Mixture Models in IR , 2002, ECIR.

[33]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[34]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[35]  Jochen Peters,et al.  SEMANTIC TEXT CLUSTERS AND WORD CLASSES – THE DUALISM OF MUTUAL INFORMATION AND MAXIMUM LIKELIHOOD , 2001 .

[36]  George Kingsley Zipf,et al.  Relative Frequency as a Determinant of Phonetic Change , 1930 .

[37]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[38]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[39]  D. Faber,et al.  Laplace, Pierre Simon (Marquis de) , 2005 .

[40]  W. Bruce Croft,et al.  Workshop on language modeling and information retrieval , 2001, SIGF.

[41]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[42]  W. Bruce Croft,et al.  Efficient probabilistic Inference for text retrieval , 1991, RIAO.

[43]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[44]  I. Good,et al.  Probability and the Weighting of Evidence. , 1951 .

[45]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[46]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[47]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[48]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[49]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[50]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[51]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[52]  W. Bruce Croft,et al.  Quantifying query ambiguity , 2002 .

[53]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[54]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[55]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[56]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[57]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[58]  William S. Cooper,et al.  Foundations of Probabilistic and Utility-Theoretic Indexing , 1978, JACM.

[59]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[60]  P. Laplace A Philosophical Essay On Probabilities , 1902 .

[61]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[62]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[63]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.