Estimation and use of uncertainty in pseudo-relevance feedback

Existing pseudo-relevance feedback methods typically perform averaging over the top-retrieved documents, but ignore an important statistical dimension: the risk or variance associated with either the individual document models, or their combination. Treating the baseline feedback method as a black box, and the output feedback model as a random variable, we estimate a posterior distribution for the feed-back model by resampling a given query's top-retrieved documents, using the posterior mean or mode as the enhanced feedback model. We then perform model combination over several enhanced models, each based on a slightly modified query sampled from the original query. We find that resampling documents helps increase individual feedback model precision by removing noise terms, while sampling from the query improves robustness (worst-case performance) by emphasizing terms related to multiple query aspects. The result is a meta-feedback algorithm that is both more robust and more precise than the original strong baseline method.

[1]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[2]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[3]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[4]  Tetsuya Sakai,et al.  Flexible pseudo-relevance feedback via selective sampling , 2005, TALIP.

[5]  Jorma Laaksonen,et al.  SOM_PAK: The Self-Organizing Map Program Package , 1996 .

[6]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[7]  Chi-Hoon Lee,et al.  Using query-specific variance estimates to combine Bayesian classifiers , 2006, ICML '06.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Kevyn Collins-Thompson,et al.  Initial Results with Structured Queries and Language Models on Half a Terabyte of Text , 2004, TREC.

[10]  Claudio Carpineto,et al.  Improving retrieval feedback with multiple term-ranking function combination , 2002, TOIS.

[11]  Robert Wing Pong Luk,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[12]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[13]  W. Bruce Croft,et al.  Ranking robustness: a novel framework to predict query performance , 2006, CIKM '06.

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[16]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[17]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[18]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[19]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[20]  William T. Morgan,et al.  The role of variance in term weighting for probabilistic information retrieval , 2002, CIKM '02.