A cluster-based resampling method for pseudo-relevance feedback

Typical pseudo-relevance feedback methods assume the top-retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster-based resampling method to select better pseudo-relevant documents based on the relevance model. The main idea is to use document clusters to find dominant documents for the initial retrieval set, and to repeatedly feed the documents to emphasize the core topics of a query. Experimental results on large-scale web TREC collections show significant improvements over the relevance model. For justification of the resampling approach, we examine relevance density of feedback documents. A higher relevance density will result in greater retrieval accuracy, ultimately approaching true relevance feedback. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback. This result indicates that the proposed method is effective for pseudo-relevance feedback.

[1]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[2]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[3]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[4]  Charles L. A. Clarke,et al.  A multi-system analysis of document and term selection for blind feedback , 2004, CIKM '04.

[5]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[6]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[7]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[8]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[9]  Kevyn Collins-Thompson,et al.  Estimation and use of uncertainty in pseudo-relevance feedback , 2007, SIGIR.

[10]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[11]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[12]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[13]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[14]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[15]  Guodong Zhou,et al.  Document re-ranking using cluster validation and label propagation , 2006, CIKM '06.

[16]  C. Buckley,et al.  Reliable Information Access Final Workshop Report , 2004 .

[17]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[18]  Tetsuya Sakai,et al.  Flexible pseudo-relevance feedback via selective sampling , 2005, TALIP.

[19]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[20]  Kyo Kageura,et al.  Implicit ambiguity resolution using incremental clustering in cross-language information retrieval , 2004, Inf. Process. Manag..

[21]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[22]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[23]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[24]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[25]  Key-Sun Choi,et al.  Re-ranking model based on document clusters , 2001, Inf. Process. Manag..

[26]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[27]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[28]  Vladimir Anashin,et al.  Call for participation , 2005, User Modeling and User-Adapted Interaction.

[29]  Charles L. A. Clarke,et al.  Task-Specific Query Expansion (MultiText Experiments for TREC 2003) , 2003, TREC.

[30]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[31]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[32]  Oren Kurland,et al.  Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models , 2006, SIGIR.

[33]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .