On Using Fewer Topics in Information Retrieval Evaluations

The possibility of using fewer topics in TREC, and in TREC-like initiatives, has been studied recently, with encouraging results: even when decreasing consistently the number of topics (for example, using a topic subset of cardinality only 10, in place of the usual 50) it is possible, at least potentially, to obtain similar results when evaluating system effectiveness. However, the generality of this approach has been questioned, since the topic subset selected on one system population does not seem adequate to evaluate other systems. In this paper we reconsider that generality issue: we emphasize some limitations in the previous work and we show some experimental results that are instead more positive. The obtained results support the hypothesis that, by taking special care, the few topics selected on the basis of a given system population are also adequate to evaluate a different system population as well.

[1]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[2]  Stefano Mizzaro,et al.  IR Evaluation without a Common Set of Topics , 2009, ICTIR.

[3]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[4]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[5]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[6]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[7]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[8]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[9]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[10]  Djoerd Hiemstra,et al.  Relying on topic subsets for system ranking estimation , 2009, CIKM.

[11]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[12]  Ingemar J. Cox,et al.  Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[13]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[14]  Djoerd Hiemstra,et al.  A Case for Automatic System Evaluation , 2010, ECIR.

[15]  Ingemar J. Cox,et al.  Prioritizing relevance judgments to improve the construction of IR test collections , 2011, CIKM '11.

[16]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[17]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[18]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[19]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[20]  Stephen E. Robertson On Smoothing Average Precision , 2012, ECIR.