Lessons Learned From Indexing Close Word Pairs

Abstract : We describe experiments with proximity-aware ranking functions that use indexing of word pairs. Our goal is to evaluate a method of "mild" pruning of proximity information, which would be appropriate for a moderately loaded retrieval system, e.g., an enterprise search engine. We create an index that includes occurrences of close word pairs, where one of the words is frequent. This allows one to efficiently restore relative positional information for all non-stop words within a certain distance. It is also possible to answer phrase queries promptly. We use two functions to evaluate relevance: a modification of a classic proximity-aware function and a logistic function that includes a linear combination of relevance features. Additionally, we use the spam scores provided by the University of Waterloo.

[1]  Marc Najork,et al.  Microsoft Research at TREC 2011 Web Track , 2010, TREC.

[2]  Hugh E. Williams,et al.  Efficient phrase querying with an auxiliary index , 2002, SIGIR '02.

[3]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[4]  Hugh E. Williams,et al.  What's Next? Index Structures for Efficient Phrase Querying , 1999, Australasian Database Conference.

[5]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[6]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[7]  Hedley Rees,et al.  Limited-Dependent and Qualitative Variables in Econometrics. , 1985 .

[8]  Charles L. A. Clarke,et al.  Shortest-substring retrieval and ranking , 2000, TOIS.

[9]  Charles L. A. Clarke,et al.  Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval , 2005, TREC.

[10]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[11]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[12]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[13]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[14]  Seung-won Hwang,et al.  Efficient Text Proximity Search , 2007, SPIRE.

[15]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.