A constraint to automatically regulate document-length normalisation

Retrieval functions in information retrieval (IR) are fundamental to the effectiveness of search systems. However, considerable parameter tuning is often needed to increase the effectiveness of the retrieval. Document length normalisation is one such aspect that requires tuning on a per-query and per-collection basis for many retrieval functions. In this paper, we develop an approach that regularises the level of normalisation to apply on a per-query basis. We formally describe the interaction between query-terms and document length normalisation using a constraint. We then develop a general pre-retrieval approach to adapt a number of state-of-the-art ranking functions so that they adhere to the constraint. Finally, we empirically demonstrate that the adapted retrieval functions outperform default versions of the original retrieval functions, and perform at least comparably to tuned versions of the original functions, on a number of datasets. Essentially this regulates the normalisation parameter in a number of retrieval functions on a per-query basis in a principled manner.

[1]  Kam-Fai Wong,et al.  Adapting pivoted document-length normalization for query size: Experiments in Chinese and English , 2006, TALIP.

[2]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[3]  Ronan Cummins,et al.  The Effect of Query Length on Normalisation in Information Retrieval , 2009, AICS.

[4]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[5]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[6]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[7]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[8]  Iadh Ounis,et al.  Term Frequency Normalisation Tuning for BM25 and DFR Models , 2005, ECIR.

[9]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[10]  Éric Gaussier,et al.  Do IR models satisfy the TDC retrieval constraint , 2011, SIGIR '11.

[11]  Ophir Frieder,et al.  Document normalization revisited , 2002, SIGIR '02.

[12]  Ronan Cummins,et al.  Measuring constraint violations in information retrieval , 2009, SIGIR.

[13]  Éric Gaussier,et al.  Retrieval constraints and word frequency distributions a log-logistic model for IR , 2011, Information Retrieval.

[14]  Ronan Cummins,et al.  An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions , 2007, Artificial Intelligence Review.

[15]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.