Generative model-based metasearch for data fusion in information retrieval

"Data fusion" refers to the problem in information retrieval (IR) where several lists of documents ranked against a query are to be merged into a single ranked list for presentation to a user. Data fusion is also known as "metasearch." In a digital library setting data fusion may support operations such as federated search based on multiple repository representations. This paper presents a novel approach to the fusion problem: generative model-based Metasearch (GeM). We suggest viewing the appearance of documents in a return set as the outcome of a probabilistic process; some documents are likely to occur in the model, while others are unlikely. Using Bayesian parameter estimation to fit a multinomial distribution based on the return sets to be merged, GeM achieves a final ranking by listing documents in decreasing probability of generation under the induced model. We also introduce what we call "the impatient reader" approach to normalizing document ranks in service to the fusion operation. We report results from several experiments on TREC data suggesting that GeM, informed with impatient reader document scores, operates at state-of-the-art levels of effectiveness.

[1]  Liu Peng,et al.  Probability-based fusion of information retrieval result sets , 2006, Artificial Intelligence Review.

[2]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[3]  Jaana Kekäläinen,et al.  The polyrepresentation continuum in IR , 2006, IIiX.

[4]  Xin Fu,et al.  Eliciting better information need descriptions from users of information search systems , 2007, Inf. Process. Manag..

[5]  M. Evans Statistical Distributions , 2000 .

[6]  Mette Skov,et al.  Inter and intra-document contexts applied in polyrepresentation for best match IR , 2008, Inf. Process. Manag..

[7]  Jie Lu,et al.  Full-text federated search of text-based digital libraries in peer-to-peer networks , 2006, Information Retrieval.

[8]  Paul Thompson,et al.  A combination of expert opinion approach to probabilistic information retrieval, part 1: The conceptual model , 1990, Inf. Process. Manag..

[9]  Fabio Crestani,et al.  Metadata harvesting for content-based distributed information retrieval , 2008 .

[10]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[11]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[12]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[13]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[14]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[15]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[16]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[17]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[18]  Javed A. Aslam,et al.  Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session) , 2000, SIGIR '00.

[19]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[20]  Xing Li,et al.  Learning to rank collections , 2007, SIGIR.

[21]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[22]  Nicholas J. Belkin,et al.  The effect multiple query representations on information retrieval system performance , 1993, SIGIR.

[23]  John Dunnion,et al.  ProbFuse: a probabilistic approach to data fusion , 2006, SIGIR.

[24]  W. M. Bolstad Introduction to Bayesian Statistics , 2004 .

[25]  N. L. Johnson,et al.  Discrete Multivariate Distributions , 1998 .

[26]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[27]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[28]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[29]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.