Viewing Term Proximity from a Different Perspective

This paper extends the state-of-the-art probabilistic model BM25 to utilize term proximity from a new perspective. Most previous work only consider dependencies between pairs of terms, and regard phrases as additional independent evidence. It is difficult to estimate the importance of a phrase and its extra contribution to a relevance score, as the phrase actually overlaps with the component terms. This paper proposes a new approach. First, query terms are grouped locally into non-overlapping phrases that may contain one or more query terms. Second, these phrases are not scored independently but are instead treated as providing a context for the component query terms. The relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are. Third, we replace term frequency by the accumulated relevance contribution. Consequently, term proximity is easily integrated into the probabilistic model. Experimental results on TREC-10 and TREC-11 collections show stable improvements in terms of average precision and significant improvements in terms of top precisions.

[1]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[2]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[3]  C. J. van Rijsbergen,et al.  An Evaluation of feedback in Document Retrieval using Co‐Occurrence Data , 1978, J. Documentation.

[4]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[5]  G. Salton,et al.  A Generalized Term Dependence Model in Information Retrieval , 1983 .

[6]  W. Bruce Croft Boolean Queries and Term Dependencies in Probabilistic Retrieval Models. , 1986 .

[7]  Joel L. Fagan,et al.  Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods , 1987, SIGIR.

[8]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[9]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[10]  Robert M. Losee Term Dependence: Truncating the Bahadur Lazarsfeld Expansion , 1994, Inf. Process. Manag..

[11]  David Hawking,et al.  Proximity Operators - So Near And Yet So Far , 1995, TREC.

[12]  Charles L. A. Clarke,et al.  Shortest Substring Ranking (MultiText Experiments for TREC-4) , 1995, TREC.

[13]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[14]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[15]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[16]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[18]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[19]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[20]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[21]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[22]  Rohini K. Srihari,et al.  Biterm language models for document retrieval , 2002, SIGIR '02.

[23]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[24]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[25]  Gilad Mishne,et al.  Boosting Web Retrieval through Query Operations , 2005, BNAIC.

[26]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[27]  Charles L. A. Clarke,et al.  Term proximity scoring for ad-hoc retrieval on very large text collections , 2006, SIGIR.

[28]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.