论文信息 - Predicting the effectiveness of Naïve data fusion on the basis of system characteristics

Predicting the effectiveness of Naïve data fusion on the basis of system characteristics

Effective automation of the information retrieval task has long been an active area of research, leading to sophisticated retrieval models. With many IR schemes available, researchers have begun to investigate the benefits of combining the results of different IR schemes to improve performance, in the process called “data fusion.” There are many successful data fusion experiments reported in IR literature, but there are also cases in which it did not work well. Thus, if would be quite valuable to have a theory that can predict, in advance, whether fusion of two or more retrieval schemes will be worth doing. In previous study (Ng & Kantor, 1998), we identified two predictive variables for the effectiveness of fusion: (a) a list‐based measure of output dissimilarity, and (b) a pair‐wise measure of the similarity of performance of the two schemes. In this article we investigate the predictive power of these two variables in simple symmetrical data fusion. We use the IR systems participating in the TREC 4 routing task to train a model that predicts the effectiveness of data fusion, and use the IR systems participating in the TREC 5 routing task to test that model. The model asks, “when will fusion perform better than an oracle who uses the best scheme from each pair?” We explore statistical techniques for fitting the model to the training data and use the receiver operating characteristic curve of signal detection theory to represent the power of the resulting models. The trained prediction methods predict whether fusion will beat an oracle, at levels much higher than could be achieved by chance.

Paul B. Kantor | Kwong Bor Ng

[1] John G. Kemeny,et al. Finite Markov chains , 1960 .

[2] M. Kendall. Rank Correlation Methods , 1949 .

[3] Donna K. Harman,et al. Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[4] Garrison W. Cottrell,et al. Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[5] Edward A. Fox,et al. Combination of Multiple Searches , 1993, TREC.

[6] Donna K. Harman,et al. Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[7] Paul B. Kantor,et al. Two Experiments on Retrieval With Corrupted Data and Clean Queries in the TREC-4 Adhoc Task Environment: Data Fusion and Pattern Scanning , 1995, TREC.

[8] Garrison W. Cottrell,et al. Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[9] John A. Swets,et al. Evaluation of diagnostic systems : methods from signal detection theory , 1982 .

[10] Cyril Cleverdon,et al. The Cranfield tests on index language devices , 1997 .

[11] Kwong Bor Ng,et al. An investigation of the conditions for effective data fusion in information retrieval , 1998 .