论文信息 - Global Statistics in Proximity Weighting Models

Global Statistics in Proximity Weighting Models

Information retrieval systems often use proximity or term dependence models to increase the effectiveness of document retrieval. Many of the existing proximity models examine document-level local statistics, such as the frequencies that pairs of query terms occur within fixed-size windows of each document, before applying standard or adapted weighting functions ‐ for instance Markov Random Fields. Term weighting models use Inverse Document Frequency (IDF) to control the influence of occurrences of differe nt query terms in documents. Similarly, some proximity models also take into account the frequency of pairs of query terms in the entire corpus of documents. However, pair frequency is an expensive statistic to pre-compute at indexing time, or to compute at retrieval time before scoring documents. In this work, we examine in a uniform setting, the importance of such global statistics for proximity weighting. We investigate two sources of global statistics, namely the target corpus, and the entire Web. Experiments are conducted using the TREC GOV2 and ClueWeb09 test collections. Our results show that local statistics alone are sufficient for effective retrieval, and global statistics usually do not bring any significant improvement in effectiveness, compared to the same proximity approaches that do not use these global statistics.

Iadh Ounis | Craig Macdonald

[1] Xiaolong Li,et al. An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[2] Krysta Marie Svore,et al. How good is a span of terms?: exploiting proximity to improve web retrieval , 2010, SIGIR.

[3] Iadh Ounis,et al. University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier , 2006, TREC.

[4] Iadh Ounis,et al. Multinomial Randomness Models for Retrieval with Document Fields , 2007, ECIR.

[5] W. Bruce Croft,et al. A Markov random field model for term dependencies , 2005, SIGIR '05.

[6] Gianni Amati,et al. Probability models for information retrieval based on divergence from randomness , 2003 .

[7] Andrei Z. Broder,et al. Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[8] Claudio Carpineto,et al. Italian Monolingual Information Retrieval with PROSIT , 2002, CLEF.

[9] Howard R. Turtle,et al. Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[10] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[11] Nenghai Yu,et al. Can phrase indexing help to process non-phrase queries? , 2008, CIKM '08.