A Study of Retrieval Models for Long Documents and Queries in Information Retrieval

Recent research has shown that long documents are unfairly penalised by a number of current retrieval methods. In this paper, we formally analyse two important but distinct reasons for normalising documents with respect to length, namely verbosity and scope, and discuss the practical implications of not normalising accordingly. We review a number of language modelling approaches and a range of recently developed retrieval methods, and show that most do not correctly model both phenomena, thus limiting their retrieval effectiveness in certain situations. Furthermore, the retrieval characteristics of long natural language queries have not traditionally had the same attention as short keyword queries. We develop a new discriminative query language modelling approach that demonstrates improved performance on long verbose queries by appropriately weighting salient aspects of the query. When combined with query expansion, we show that our new approach yields state-of-the-art performance for long verbose queries.

[1]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[2]  Ben He,et al.  Enhancing ad-hoc relevance weighting using probability density estimation , 2011, SIGIR '11.

[3]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[4]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[5]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[6]  Douglas W. Oard,et al.  A Fixed-Point Method for Weighting Terms in Verbose Informational Queries , 2014, CIKM.

[7]  Jiaul H. Paik A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[8]  Joemon M. Jose,et al.  Navigating the User Query Space , 2011, SPIRE.

[9]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[10]  Iadh Ounis,et al.  On setting the hyper-parameters of term frequency normalization for information retrieval , 2007, TOIS.

[11]  W. Bruce Croft,et al.  Parameterized concept weighting in verbose queries , 2011, SIGIR.

[12]  Jong-Hyeok Lee,et al.  Improving Term Frequency Normalization for Multi-topical Documents and Application to Language Modeling Approaches , 2008, ECIR.

[13]  W. Bruce Croft,et al.  Learning to rank query reformulations , 2010, SIGIR '10.

[14]  Ronan Cummins,et al.  A constraint to automatically regulate document-length normalisation , 2012, CIKM '12.

[15]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[16]  ChengXiang Zhai,et al.  When documents are very long, BM25 fails! , 2011, SIGIR.

[17]  Seung-Hoon Na Two-Stage Document Length Normalization for Information Retrieval , 2015, TOIS.

[18]  Yuanhua Lv,et al.  A Pólya Urn Document Language Model for Improved Information Retrieval , 2015, ACM Trans. Inf. Syst..

[19]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[20]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[21]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[22]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[23]  Ben Carterette,et al.  Overview of the TREC 2011 Session Track , 2011, TREC.

[24]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[25]  David Bodoff Fuhr's challenge: conceptual research, or bust , 2012, SIGF.

[26]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[27]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.