Predicting the effectiveness of Naïve data fusion on the basis of system characteristics

Effective automation of the information retrieval task has long been an active area of research, leading to sophisticated retrieval models. With many IR schemes available, researchers have begun to investigate the benefits of combining the results of different IR schemes to improve performance, in the process called “data fusion.” There are many successful data fusion experiments reported in IR literature, but there are also cases in which it did not work well. Thus, if would be quite valuable to have a theory that can predict, in advance, whether fusion of two or more retrieval schemes will be worth doing. In previous study (Ng & Kantor, 1998), we identified two predictive variables for the effectiveness of fusion: (a) a list‐based measure of output dissimilarity, and (b) a pair‐wise measure of the similarity of performance of the two schemes. In this article we investigate the predictive power of these two variables in simple symmetrical data fusion. We use the IR systems participating in the TREC 4 routing task to train a model that predicts the effectiveness of data fusion, and use the IR systems participating in the TREC 5 routing task to test that model. The model asks, “when will fusion perform better than an oracle who uses the best scheme from each pair?” We explore statistical techniques for fitting the model to the training data and use the receiver operating characteristic curve of signal detection theory to represent the power of the resulting models. The trained prediction methods predict whether fusion will beat an oracle, at levels much higher than could be achieved by chance.

[1]  John G. Kemeny,et al.  Finite Markov chains , 1960 .

[2]  M. Kendall Rank Correlation Methods , 1949 .

[3]  Donna K. Harman,et al.  Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[4]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[5]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[6]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[7]  Paul B. Kantor,et al.  Two Experiments on Retrieval With Corrupted Data and Clean Queries in the TREC-4 Adhoc Task Environment: Data Fusion and Pattern Scanning , 1995, TREC.

[8]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[9]  John A. Swets,et al.  Evaluation of diagnostic systems : methods from signal detection theory , 1982 .

[10]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[11]  Kwong Bor Ng,et al.  An investigation of the conditions for effective data fusion in information retrieval , 1998 .

[12]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[13]  Paul B. Kantor,et al.  The Information Quest: A Dynamic Model of User's Information Needs. , 1999 .

[14]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[15]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[16]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[17]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[18]  Rick S. Blum,et al.  Distributed detection with multiple sensors I. Advanced topics , 1997, Proc. IEEE.

[19]  Louis M. Gomez,et al.  All the Right Words: Finding What You Want as a Function of Richness of Indexing Vocabulary. , 1990 .

[20]  Hinrich Schütze,et al.  Method combination for document filtering , 1996, SIGIR '96.

[21]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[22]  Paul B. Kantor,et al.  A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap* , 1988 .

[23]  Donna Harman,et al.  The fourth text REtrieval conference , 1996 .

[24]  Alan F. Smeaton Independence of Contributing Retrieval Strategies in Data Fusion for Effective Information Retrieval , 1998, BCS-IRSG Annual Colloquium on IR Research.

[25]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[26]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.