On the Contributions of Topics to System Evaluation

We consider the selection of good subsets of topics for system evaluation. It has previously been suggested that some individual topics and some subsets of topics are better for system evaluation than others: given limited resources, choosing the best subset of topics may give significantly better prediction of overall system effectiveness than (for example) choosing random subsets. Earlier experimental results are extended, with particular reference to generalisation: the ability of a subset of topics selected on the basis on one collection of system runs to perform well in evaluating another collection of system runs. It turns out to be hard to establish generalisability; it is not at all clear that it is possible to identify subsets of topics that are good for general evaluation.