论文信息 - An Approach to Information Retrieval Based on Statistical Model Selection

An Approach to Information Retrieval Based on Statistical Model Selection

Abstract Building on previous work in the field of language modeling information retrieval (IR), this paper proposes a novel approach to document ranking based on statistical model selection. The proposed approach offers two main contributions. First, we posit the notion of a document’s “null model,” a language model that conditions our assessment of the document model’s significance with respect to the query. Second, we introduce an information-theoretic model complexity penalty into document ranking. We rank documents on a penalized log-likelihood ratio comparing the probability that each document model generated the query versus the likelihood that a corresponding “null” model generated it. Each model is assessed by the Akaike information criterion (AIC), the expected Kullback-Leibler divergence between the observed model (null or non-null) and the underlying model that generated the data. We report experimental results where the model selection approach offers improvement over traditional LM retrieval.

Miles Efron

[1] C. J. van Rijsbergen,et al. Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[2] ChengXiang Zhai,et al. A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[3] CHENGXIANG ZHAI,et al. A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[4] Michael H. Kutner. Applied Linear Statistical Models , 1974 .

[5] Emine Yilmaz,et al. A geometric interpretation and analysis of R-precision , 2005, CIKM '05.

[6] C. W. Cleverdon,et al. The testing of index language devices , 1997 .

[7] H. Akaike,et al. Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[8] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[9] John D. Lafferty,et al. Two-stage language models for information retrieval , 2002, SIGIR '02.

[10] Thomas Roelleke,et al. A parallel derivation of probabilistic information retrieval models , 2006, SIGIR.

[11] John D. Lafferty,et al. A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..