An Approach to Information Retrieval Based on Statistical Model Selection

Abstract Building on previous work in the field of language modeling information retrieval (IR), this paper proposes a novel approach to document ranking based on statistical model selection. The proposed approach offers two main contributions. First, we posit the notion of a document’s “null model,” a language model that conditions our assessment of the document model’s significance with respect to the query. Second, we introduce an information-theoretic model complexity penalty into document ranking. We rank documents on a penalized log-likelihood ratio comparing the probability that each document model generated the query versus the likelihood that a corresponding “null” model generated it. Each model is assessed by the Akaike information criterion (AIC), the expected Kullback-Leibler divergence between the observed model (null or non-null) and the underlying model that generated the data. We report experimental results where the model selection approach offers improvement over traditional LM retrieval.

[1]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[2]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[3]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[4]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[5]  Emine Yilmaz,et al.  A geometric interpretation and analysis of R-precision , 2005, CIKM '05.

[6]  C. W. Cleverdon,et al.  The testing of index language devices , 1997 .

[7]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[10]  Thomas Roelleke,et al.  A parallel derivation of probabilistic information retrieval models , 2006, SIGIR.

[11]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[12]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[15]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[16]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[17]  Gianni Amati,et al.  Frequentist and Bayesian Approach to Information Retrieval , 2006, ECIR.

[18]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[19]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[20]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[21]  Leif Azzopardi,et al.  An analysis on document length retrieval trends in language modeling smoothing , 2008, Information Retrieval.

[22]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[23]  Robert M. Losee When information retrieval measures agree about the relative quality of document rankings , 2000, J. Am. Soc. Inf. Sci..

[24]  Donna K. Harman,et al.  The Text REtrieval Conference (TREC) , 1999, NTCIR.