论文信息 - A Study of Retrieval Models for Long Documents and Queries in Information Retrieval

A Study of Retrieval Models for Long Documents and Queries in Information Retrieval

Recent research has shown that long documents are unfairly penalised by a number of current retrieval methods. In this paper, we formally analyse two important but distinct reasons for normalising documents with respect to length, namely verbosity and scope, and discuss the practical implications of not normalising accordingly. We review a number of language modelling approaches and a range of recently developed retrieval methods, and show that most do not correctly model both phenomena, thus limiting their retrieval effectiveness in certain situations. Furthermore, the retrieval characteristics of long natural language queries have not traditionally had the same attention as short keyword queries. We develop a new discriminative query language modelling approach that demonstrates improved performance on long verbose queries by appropriately weighting salient aspects of the query. When combined with query expansion, we show that our new approach yields state-of-the-art performance for long verbose queries.

Ronan Cummins | Ronan Cummins

[1] John D. Lafferty,et al. Two-stage language models for information retrieval , 2002, SIGIR '02.

[2] Ben He,et al. Enhancing ad-hoc relevance weighting using probability density estimation , 2011, SIGIR '11.

[3] ChengXiang Zhai,et al. Lower-bounding term frequency normalization , 2011, CIKM '11.

[4] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[5] Stephen P. Harter,et al. A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[6] Douglas W. Oard,et al. A Fixed-Point Method for Weighting Terms in Verbose Informational Queries , 2014, CIKM.

[7] Jiaul H. Paik. A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[8] Joemon M. Jose,et al. Navigating the User Query Space , 2011, SPIRE.

[9] ChengXiang Zhai,et al. A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[10] Iadh Ounis,et al. On setting the hyper-parameters of term frequency normalization for information retrieval , 2007, TOIS.

[11] W. Bruce Croft,et al. Parameterized concept weighting in verbose queries , 2011, SIGIR.