Modeling score distributions for combining the outputs of search engines

In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese. It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics. This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.

[1]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[2]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[3]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[4]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[5]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[6]  Ronald Fagin,et al.  Fuzzy queries in multimedia database systems , 1998, PODS '98.

[7]  Javed A. Aslam,et al.  Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session) , 2000, SIGIR '00.

[8]  Kagan Tumer,et al.  Linear and Order Statistics Combiners for Pattern Classification , 1999, ArXiv.

[9]  Warren R. Greiff,et al.  The use of Exploratory Data Analysis in Information Retrieval Research , 2002 .

[10]  Gerald J. Kowalski,et al.  Information Retrieval Systems , 1997, The Information Retrieval Series.

[11]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[12]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[13]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[14]  Avi Arampatzis,et al.  Incrementality, Half-life, and Threshold Optimization for Adaptive Document Filtering , 2000, TREC.

[15]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[16]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[17]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[18]  Garrison W. Cottrell,et al.  Predicting the performance of linearly combined IR systems , 1998, SIGIR '98.

[19]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[20]  Abraham Bookstein,et al.  When the most "pertinent" document should not be retrieved - An analysis of the Swets model , 1977, Inf. Process. Manag..