Global Statistics in Proximity Weighting Models

Information retrieval systems often use proximity or term dependence models to increase the effectiveness of document retrieval. Many of the existing proximity models examine document-level local statistics, such as the frequencies that pairs of query terms occur within fixed-size windows of each document, before applying standard or adapted weighting functions ‐ for instance Markov Random Fields. Term weighting models use Inverse Document Frequency (IDF) to control the influence of occurrences of differe nt query terms in documents. Similarly, some proximity models also take into account the frequency of pairs of query terms in the entire corpus of documents. However, pair frequency is an expensive statistic to pre-compute at indexing time, or to compute at retrieval time before scoring documents. In this work, we examine in a uniform setting, the importance of such global statistics for proximity weighting. We investigate two sources of global statistics, namely the target corpus, and the entire Web. Experiments are conducted using the TREC GOV2 and ClueWeb09 test collections. Our results show that local statistics alone are sufficient for effective retrieval, and global statistics usually do not bring any significant improvement in effectiveness, compared to the same proximity approaches that do not use these global statistics.

[1]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[2]  Krysta Marie Svore,et al.  How good is a span of terms?: exploiting proximity to improve web retrieval , 2010, SIGIR.

[3]  Iadh Ounis,et al.  University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier , 2006, TREC.

[4]  Iadh Ounis,et al.  Multinomial Randomness Models for Retrieval with Document Fields , 2007, ECIR.

[5]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[6]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[7]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[8]  Claudio Carpineto,et al.  Italian Monolingual Information Retrieval with PROSIT , 2002, CLEF.

[9]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[10]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[11]  Nenghai Yu,et al.  Can phrase indexing help to process non-phrase queries? , 2008, CIKM '08.

[12]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[13]  Charles L. A. Clarke,et al.  Term proximity scoring for ad-hoc retrieval on very large text collections , 2006, SIGIR.

[14]  Iadh Ounis,et al.  Incorporating term dependency in the dfr framework , 2007, SIGIR.

[15]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[16]  Shuming Shi,et al.  Effective top-k computation in retrieving structured documents with term-proximity support , 2007, CIKM '07.

[17]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[18]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[19]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[20]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[21]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[22]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[23]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[24]  Seung-won Hwang,et al.  Efficient Text Proximity Search , 2007, SPIRE.

[25]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[26]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[27]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[28]  Xin Li,et al.  Investigation of partial query proximity in web search , 2008, WWW.

[29]  Gilad Mishne,et al.  Boosting Web Retrieval through Query Operations , 2005, BNAIC.

[30]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[31]  Ronan Cummins,et al.  Learning in a pairwise term-term proximity framework for information retrieval , 2009, SIGIR.