From Retrieval Status Values to Probabilities of Relevance for Advanced IR Applications

Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a “normalisation” function which maps the retrieval status value onto the probability of relevance (“mapping functions”). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.

[1]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  S. Robertson The probability ranking principle in IR , 1997 .

[5]  S. Griffis EDITOR , 1997, Journal of Navigation.

[6]  S. Fienberg,et al.  The Analysis of Cross-Classified Categorical Data. , 1978 .

[7]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[8]  James Allan,et al.  INQUERY Does Battle With TREC-6 , 1997, TREC.

[9]  C. J. van Rijsbergen,et al.  Probabilistic Retrieval Revisited , 1992, Comput. J..

[10]  Donna Harman,et al.  The Second Text Retrieval Conference (TREC-2) , 1995, Inf. Process. Manag..

[11]  Stephen E. Fienberg,et al.  The analysis of cross-classified categorical data , 1980 .

[12]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[13]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[14]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[15]  J. D. H. Freeman Applied categorical data analysis , 1987 .

[16]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[17]  Norbert Fuhr,et al.  Combining model-oriented and description-oriented approaches for probabilistic indexing , 1991, SIGIR '91.

[18]  Serge Abiteboul,et al.  Nested Relations and Complex Objects in Databases , 1989, Lecture Notes in Computer Science.

[19]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[20]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[21]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[22]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[23]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[24]  Norbert Fuhr,et al.  Evaluating different methods of estimating retrieval quality for resource selection , 2003, SIGIR.

[25]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[26]  Norbert Fuhr,et al.  From Uncertain Inference to Probability of Relevance for Advanced IR Applications , 2003, ECIR.