论文信息 - On Using Fewer Topics in Information Retrieval Evaluations

On Using Fewer Topics in Information Retrieval Evaluations

The possibility of using fewer topics in TREC, and in TREC-like initiatives, has been studied recently, with encouraging results: even when decreasing consistently the number of topics (for example, using a topic subset of cardinality only 10, in place of the usual 50) it is possible, at least potentially, to obtain similar results when evaluating system effectiveness. However, the generality of this approach has been questioned, since the topic subset selected on one system population does not seem adequate to evaluate other systems. In this paper we reconsider that generality issue: we emphasize some limitations in the previous work and we show some experimental results that are instead more positive. The obtained results support the hypothesis that, by taking special care, the few topics selected on the basis of a given system population are also adequate to evaluate a different system population as well.

Stephen E. Robertson | Stefano Mizzaro | Andrea Berto

[1] Emine Yilmaz,et al. Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[2] Stefano Mizzaro,et al. IR Evaluation without a Common Set of Topics , 2009, ICTIR.

[3] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[4] Stephen E. Robertson,et al. On GMAP: and other transformations , 2006, CIKM '06.

[5] Stephen E. Robertson,et al. On the Contributions of Topics to System Evaluation , 2011, ECIR.

[6] Ellen M. Voorhees,et al. Evaluation by highly relevant documents , 2001, SIGIR '01.

[7] Jon Kleinberg,et al. Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[8] Gordon V. Cormack,et al. Statistical precision of information retrieval evaluation , 2006, SIGIR.

[9] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[10] Djoerd Hiemstra,et al. Relying on topic subsets for system ranking estimation , 2009, CIKM.

[11] Ellen M. Voorhees,et al. Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.