论文信息 - The effect of pooling and evaluation depth on IR metrics

The effect of pooling and evaluation depth on IR metrics

Batch IR evaluations are usually performed in a framework that consists of a document collection, a set of queries, a set of relevance judgments, and one or more effectiveness metrics. A large number of evaluation metrics have been proposed, with two primary families having emerged: recall-based metrics, and utility-based metrics. In both families, the pragmatics of forming judgments mean that it is usual to evaluate the metric to some chosen depth such as $$k=20$$k=20 or $$k=100$$k=100, without necessarily fully considering the ramifications associated with that choice. Our aim is this paper is to explore the relative risks arising with fixed-depth evaluation in the two families, and document the complex interplay between metric evaluation depth and judgment pooling depth. Using a range of TREC resources including NewsWire data and the ClueWeb collection, we: (1) examine the implications of finite pooling on the subsequent usefulness of different test collections, including specifying options for truncated evaluation; and (2) determine the extent to which various metrics correlate with themselves when computed to different evaluation depths using those judgments. We demonstrate that the judgment pools constructed for the ClueWeb collections lack resilience, and are suited primarily to the application of top-heavy utility-based metrics rather than recall-based metrics; and that on the majority of the established test collections, and across a range of evaluation depths, recall-based metrics tend to be more volatile in the system rankings they generate than are utility-based metrics. That is, experimentation using utility-based metrics is more robust to choices such as the evaluation depth employed than is experimentation using recall-based metrics. This distinction should be noted by researchers as they plan and execute system-versus-system retrieval experiments.

J. Shane Culpepper | Alistair Moffat | Xiaolu Lu

[1] Stephen E. Robertson,et al. On the choice of effectiveness measures for learning to rank , 2010, Information Retrieval.

[2] Ellen M. Voorhees,et al. Evaluation by highly relevant documents , 2001, SIGIR '01.

[3] Tetsuya Sakai,et al. Alternatives to Bpref , 2007, SIGIR.

[4] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.

[5] Ben Carterette,et al. Low cost evaluation in information retrieval , 2010, SIGIR '10.

[6] Stephen E. Robertson,et al. A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[7] Stephen E. Robertson,et al. Extending average precision to graded relevance judgments , 2010, SIGIR.

[8] Tetsuya Sakai,et al. Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[9] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[10] Emine Yilmaz,et al. A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[11] Charles L. A. Clarke,et al. Overview of the TREC 2011 Web Track | NIST , 2011 .