论文信息 - The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics

The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics

The Voorhees/Buckley swap method is useful for comparing the discrimination power of Information Retrieval (IR) and Question Answering (QA) metrics. Given a test collection, a set of runs and an evaluation metric, it derives the swap rate, the chance of observing inconsistencies when two completely different topic sets are used for comparing a pair of runs. Recently, however, Sanderson and Zobel claimed that the method overestimates swap rates as it samples topics without replacement. The main question we address in this paper is whether sampling with and without replacement produce any different results for the purpose of comparing the sensitivity of different metrics. Our IR and QA experiments show that the two methods do generally yield similar results, which suggests that the original Voorhees/Buckley method is valid.

Tetsuya Sakai | T. Sakai

[1] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[2] Tetsuya Sakai,et al. A Note on the Reliability of Japanese Question Answering Evaluation , 2004 .

[3] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[4] Tetsuya Sakai. Ranking the NTCIR Systems Based on Multigrade Relevance , 2004, AIRS.

[5] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[6] G. Casella,et al. Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[7] Tetsuya Sakai,et al. The Reliability of Metrics Based on Graded Relevance , 2005, AIRS.

[8] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[9] Ian Soboroff. On evaluating web search with very few relevant documents , 2004, SIGIR '04.

[10] Jacques Savoy,et al. Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[11] Tetsuya Sakai,et al. New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering , 2004, NTCIR.

[12] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[13] Tetsuya Sakai,et al. ASKMi: A Japanese Question Answering System based on Semantic Role Analysis , 2004, RIAO.