The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics

The Voorhees/Buckley swap method is useful for comparing the discrimination power of Information Retrieval (IR) and Question Answering (QA) metrics. Given a test collection, a set of runs and an evaluation metric, it derives the swap rate, the chance of observing inconsistencies when two completely different topic sets are used for comparing a pair of runs. Recently, however, Sanderson and Zobel claimed that the method overestimates swap rates as it samples topics without replacement. The main question we address in this paper is whether sampling with and without replacement produce any different results for the purpose of comparing the sensitivity of different metrics. Our IR and QA experiments show that the two methods do generally yield similar results, which suggests that the original Voorhees/Buckley method is valid.