Probabilistic data fusion on a large document collection

Data fusion is the process of combining the output of a number of Information Retrieval (IR) algorithms into a single result set, to achieve greater retrieval performance. ProbFuse is a data fusion algorithm that uses the history of the underlying IR algorithms to estimate the probability that subsequent result sets include relevant documents in particular positions. It has been shown to out-perform CombMNZ, the standard data fusion algorithm against which to compare performance, in a number of previous experiments. This paper builds upon this previous work and applies probFuse to the much larger Web Track document collection from the 2004 Text REtreival Conference. The performance of probFuse is compared against that of CombMNZ using a number of evaluation measures and is shown to achieve substantial performance improvements.

[1]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[2]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[3]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[4]  Javed A. Aslam,et al.  Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session) , 2000, SIGIR '00.

[5]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[6]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[7]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[8]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[9]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[10]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[11]  Liu Peng,et al.  Probability-based fusion of information retrieval result sets , 2006, Artificial Intelligence Review.

[12]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[13]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[14]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[15]  David Hawking,et al.  Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[16]  Luis Gravano,et al.  STARTS: Stanford Protocol Proposal for Internet Retrieval and Search , 1997 .

[17]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.

[18]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[19]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[20]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[21]  John Dunnion,et al.  ProbFuse: a probabilistic approach to data fusion , 2006, SIGIR.