Score distribution models: assumptions, intuition, and robustness to score manipulation

Inferring the score distribution of relevant and non-relevant documents is an essential task for many IR applications (e.g. information filtering, recall-oriented IR, meta-search, distributed IR). Modeling score distributions in an accurate manner is the basis of any inference. Thus, numerous score distribution models have been proposed in the literature. Most of the models were proposed on the basis of empirical evidence and goodness-of-fit. In this work, we model score distributions in a rather different, systematic manner. We start with a basic assumption on the distribution of terms in a document. Following the transformations applied on term frequencies by two basic ranking functions, BM25 and Language Models, we derive the distribution of the produced scores for all documents. Then we focus on the relevant documents. We detach our analysis from particular ranking functions. Instead, we consider a model for precision-recall curves, and given this model, we present a general mathematical framework which, given any score distribution for all retrieved documents, produces an analytical formula for the score distribution of relevant documents that is consistent with the precision-recall curves that follow the aforementioned model. In particular, assuming a Gamma distribution for all retrieved documents, we show that the derived distribution for the relevant documents resembles a Gaussian distribution with a heavy right tail.

[1]  Avi Arampatzis,et al.  The score-distributional threshold optimization for adaptive binary classification tasks , 2001, SIGIR '01.

[2]  Wessel Kraaij,et al.  A Language Modeling Approach to Tracking News Events , 2000 .

[3]  Emine Yilmaz,et al.  A geometric interpretation and analysis of R-precision , 2005, CIKM '05.

[4]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[5]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[6]  J A Swets,et al.  Information Retrieval Systems. , 1963, Science.

[7]  Peter W. Zehna,et al.  Probability, modeling uncertainty , 1983 .

[8]  Stephen E. Robertson,et al.  On Score Distributions and Relevance , 2007, ECIR.

[9]  Michael P. Wiper,et al.  Mixtures of Gamma Distributions With Applications , 2001 .

[10]  Evangelos Kanoulas,et al.  Modeling the Score Distributions of Relevant and Non-relevant Documents , 2009, ICTIR.

[11]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[12]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[13]  John A. Swets,et al.  Effectiveness of information retrieval methods , 1969 .

[14]  Yi Zhang,et al.  Maximum likelihood estimation for filtering thresholds , 2001, SIGIR '01.

[15]  M. Neuts,et al.  On mixtures of χ2- andF-distributions which yield distributions of the same family , 1967 .

[16]  Stephen E. Robertson,et al.  Where to stop reading a ranked list?: threshold optimization using truncated score distributions , 2009, SIGIR.

[17]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[18]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[19]  Abraham Bookstein,et al.  When the most "pertinent" document should not be retrieved - An analysis of the Swets model , 1977, Inf. Process. Manag..

[20]  Kevyn Collins-Thompson,et al.  Information Filtering, Novelty Detection, and Named-Page Finding , 2002, TREC.