A systematic approach to normalization in probabilistic models

Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.

[1]  ChengXiang Zhai,et al.  When documents are very long, BM25 fails! , 2011, SIGIR.

[2]  Allan Hanbury,et al.  Toward a model of domain-specific search , 2013, OAIR.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[5]  ChengXiang Zhai,et al.  Adaptive term frequency normalization for BM25 , 2011, CIKM '11.

[6]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[7]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[8]  Allan Hanbury,et al.  Verboseness Fission for BM25 Document Length Normalization , 2015, ICTIR.

[9]  Jong-Hyeok Lee,et al.  Improving Term Frequency Normalization for Multi-topical Documents and Application to Language Modeling Approaches , 2008, ECIR.

[10]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[11]  Iadh Ounis,et al.  Term Frequency Normalisation Tuning for BM25 and DFR Models , 2005, ECIR.

[12]  Donald Metzler,et al.  Generalized inverse document frequency , 2008, CIKM '08.

[13]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[14]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[15]  Mathias Lux,et al.  Visual Information Retrieval Using Java and LIRE , 2013, Visual Information Retrieval Using Java and LIRE.

[16]  Thomas Roelleke Information Retrieval Models: Foundations & Relationships , 2013, Information Retrieval Models: Foundations & Relationships.

[17]  Michalis Vazirgiannis,et al.  Composition of TF normalizations: new insights on scoring functions for ad hoc IR , 2013, SIGIR.

[18]  Ricardo Baeza-Yates,et al.  Harmony Assumptions in Information Retrieval and Social Networks , 2015, Comput. J..

[19]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[20]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[21]  Iadh Ounis,et al.  A study of the dirichlet priors for term frequency normalisation , 2005, SIGIR '05.

[22]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[23]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[24]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[25]  Peter Schäuble,et al.  Improving a Basic Retrieval Method by Links and Passage Level Evidence , 1994, TREC.

[26]  Gianni Amati,et al.  AN INFORMATION RETRIEVAL LOGIC MODEL: IMPLEMENTATION AND EXPERIMENTS* , 2007 .

[27]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.