Query dependent pseudo-relevance feedback based on wikipedia

Pseudo-relevance feedback (PRF) via query-expansion has been proven to be e®ective in many information retrieval (IR) tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for PRF. One problem with this approach is that one or more of the top retrieved documents may be non-relevant, which can introduce noise into the feedback process. Besides, existing methods generally do not take into account the significantly different types of queries that are often entered into an IR system. Intuitively, Wikipedia can be seen as a large, manually edited document collection which could be exploited to improve document retrieval effectiveness within PRF. It is not obvious how we might best utilize information from Wikipedia in PRF, and to date, the potential of Wikipedia for this task has been largely unexplored. In our work, we present a systematic exploration of the utilization of Wikipedia in PRF for query dependent expansion. Specifically, we classify TREC topics into three categories based on Wikipedia: 1) entity queries, 2) ambiguous queries, and 3) broader queries. We propose and study the effectiveness of three methods for expansion term selection, each modeling the Wikipedia based pseudo-relevance information from a different perspective. We incorporate the expansion terms into the original query and use language modeling IR to evaluate these methods. Experiments on four TREC test collections, including the large web collection GOV2, show that retrieval performance of each type of query can be improved. In addition, we demonstrate that the proposed method out-performs the baseline relevance model in terms of precision and robustness.

[1]  Iadh Ounis,et al.  Combining fields for query expansion and adaptive query expansion , 2007, Inf. Process. Manag..

[2]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[3]  Kui-Lam Kwok,et al.  Improving two-stage ad-hoc retrieval for short queries , 1998, SIGIR '98.

[4]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[5]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[6]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[7]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[10]  Paul-Alexandru Chirita,et al.  Personalized query expansion for the web , 2007, SIGIR.

[11]  Korris Fu-Lai Chung,et al.  Improving weak ad-hoc queries using wikipedia asexternal corpus , 2007, SIGIR.

[12]  Berthier A. Ribeiro-Neto,et al.  Concept-based interactive query expansion , 2005, CIKM '05.

[13]  Max Mühlhäuser,et al.  Analyzing and accessing Wikipedia as a lexical semantic resource , 2007 .

[14]  W. Bruce Croft,et al.  A framework for selective query expansion , 2004, CIKM '04.

[15]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[16]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[17]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[18]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[19]  Wei Zhang,et al.  UIC at TREC 2006 Blog Track , 2006, TREC.

[20]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[21]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[22]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[23]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[24]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[25]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[26]  Claire Fautsch,et al.  UniNE at TREC 2008: Fact and Opinion Retrieval in the Blogsphere , 2008, TREC.

[27]  W. Bruce Croft,et al.  Indri at TREC 2005: Terabyte Track , 2005, TREC.

[28]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[29]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[30]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[31]  Ian H. Witten,et al.  A knowledge-based search engine powered by wikipedia , 2007, CIKM '07.

[32]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[33]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[34]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[35]  M. de Rijke,et al.  Fact Discovery in Wikipedia , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[36]  Clement Yu,et al.  UIC at TREC 2008 Blog Track , 2008 .

[37]  M. de Rijke,et al.  The University of Amsterdam at TREC 2008: Blog, Enterprise, and Relevance Feedback , 2008 .