A Theoretical Analysis of Pseudo-Relevance Feedback Models

Our goal in this study is to compare several widely used pseudo-relevance feedback (PRF) models and understand what explains their respective behavior. To do so, we first analyze how different PRF models behave through the characteristics of the terms they select and through their performance on two widely used test collections. This analysis reveals that several well-known models surprisingly tend to select very common terms, with low IDF (inverse document frequency). We then introduce several conditions PRF models should satisfy regarding both the terms they select and the way they weigh them, prior to study whether standard PRF models satisfy these conditions or not. This study reveals that most models are deficient with respect to at least one condition, and that this deficiency explains the results of our analysis of the behavior of the models, as well as some of the results reported on the respective performance of PRF models. Based on the PRF conditions, we finally propose possible corrections for the simple mixture model. The PRF models obtained after these corrections outperform their standard version and yield state-of-the-art PRF models which confirms the validity of our theoretical analysis.

[1]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[2]  Éric Gaussier,et al.  Is Document Frequency Important for PRF? , 2011, ICTIR.

[3]  Kevyn Collins-Thompson,et al.  Reducing the risk of query expansion via robust constrained optimization , 2009, CIKM.

[4]  W. Bruce Croft,et al.  Geometric representations for multiple documents , 2010, SIGIR.

[5]  ChengXiang Zhai,et al.  Positional relevance model for pseudo-relevance feedback , 2010, SIGIR.

[6]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[7]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[8]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[9]  Kevyn Collins-Thompson,et al.  A unified optimization framework for robust pseudo-relevance feedback algorithms , 2010, CIKM.

[10]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[11]  Ronan Cummins,et al.  An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions , 2007, Artificial Intelligence Review.

[12]  Kevyn Collins-Thompson,et al.  Estimation and use of uncertainty in pseudo-relevance feedback , 2007, SIGIR.

[13]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[14]  ChengXiang Zhai,et al.  Adaptive relevance feedback in information retrieval , 2009, CIKM.

[15]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[16]  Éric Gaussier,et al.  Information-based models for ad hoc IR , 2010, SIGIR '10.

[17]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[18]  Kevyn Collins-Thompson Estimating Robust Query Models with Convex Optimization , 2008, NIPS.

[19]  Xiaoying Gao,et al.  Exploiting underrepresented query aspects for automatic query expansion , 2007, KDD '07.

[20]  Ram Akella,et al.  A new probabilistic retrieval model based on the dirichlet compound multinomial distribution , 2008, SIGIR '08.

[21]  Claudio Carpineto,et al.  Fondazione Ugo Bordoni at TREC 2003: Robust and Web Track , 2003, TREC.

[22]  ChengXiang Zhai,et al.  Semantic term matching in axiomatic approaches to information retrieval , 2006, SIGIR.

[23]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.