Query Aspect Based Term Weighting Regularization in Information Retrieval

Traditional retrieval models assume that query terms are independent and rank documents primarily based on various term weighting strategies including TF-IDF and document length normalization. However, query terms are related, and groups of semantically related query terms may form query aspects. Intuitively, the relations among query terms could be utilized to identify hidden query aspects and promote the ranking of documents covering more query aspects. Despite its importance, the use of semantic relations among query terms for term weighting regularization has been under-explored in information retrieval. In this paper, we study the incorporation of query term relations into existing retrieval models and focus on addressing the challenge, i.e., how to regularize the weights of terms in different query aspects to improve retrieval performance. Specifically, we first develop a general strategy that can systematically integrate a term weighting regularization function into existing retrieval functions, and then propose two specific regularization functions based on the guidance provided by constraint analysis. Experiments on eight standard TREC data sets show that the proposed methods are effective to improve retrieval accuracy.

[1]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.

[4]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[5]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[6]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[7]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[8]  James Allan,et al.  A Case For Shorter Queries, and Helping Users Create Them , 2007, NAACL.

[9]  Chris Buckley,et al.  Why current IR engines fail , 2004, SIGIR '04.

[10]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[11]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[12]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[13]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[14]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[15]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[16]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[17]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[18]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[19]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[20]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[21]  Donna K. Harman,et al.  SIGIR 2004 workshop: RIA and "where can IR go from here?" , 2004, SIGF.

[22]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[23]  ChengXiang Zhai,et al.  Semantic term matching in axiomatic approaches to information retrieval , 2006, SIGIR.

[24]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[25]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[26]  Matthew Lease An improved markov random field model for supporting verbose queries , 2009, SIGIR.