The effect of pooling and evaluation depth on IR metrics

Batch IR evaluations are usually performed in a framework that consists of a document collection, a set of queries, a set of relevance judgments, and one or more effectiveness metrics. A large number of evaluation metrics have been proposed, with two primary families having emerged: recall-based metrics, and utility-based metrics. In both families, the pragmatics of forming judgments mean that it is usual to evaluate the metric to some chosen depth such as $$k=20$$k=20 or $$k=100$$k=100, without necessarily fully considering the ramifications associated with that choice. Our aim is this paper is to explore the relative risks arising with fixed-depth evaluation in the two families, and document the complex interplay between metric evaluation depth and judgment pooling depth. Using a range of TREC resources including NewsWire data and the ClueWeb collection, we: (1) examine the implications of finite pooling on the subsequent usefulness of different test collections, including specifying options for truncated evaluation; and (2) determine the extent to which various metrics correlate with themselves when computed to different evaluation depths using those judgments. We demonstrate that the judgment pools constructed for the ClueWeb collections lack resilience, and are suited primarily to the application of top-heavy utility-based metrics rather than recall-based metrics; and that on the majority of the established test collections, and across a range of evaluation depths, recall-based metrics tend to be more volatile in the system rankings they generate than are utility-based metrics. That is, experimentation using utility-based metrics is more robust to choices such as the evaluation depth employed than is experimentation using recall-based metrics. This distinction should be noted by researchers as they plan and execute system-versus-system retrieval experiments.

[1]  Stephen E. Robertson,et al.  On the choice of effectiveness measures for learning to rank , 2010, Information Retrieval.

[2]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[3]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[4]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[5]  Ben Carterette,et al.  Low cost evaluation in information retrieval , 2010, SIGIR '10.

[6]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[7]  Stephen E. Robertson,et al.  Extending average precision to graded relevance judgments , 2010, SIGIR.

[8]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[9]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[10]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[11]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track | NIST , 2011 .

[12]  Alistair Moffat,et al.  Score Estimation, Incomplete Judgments, and Significance Testing in IR Evaluation , 2010, AIRS.

[13]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[14]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[15]  Stefano Mizzaro,et al.  A Classification of IR Effectiveness Metrics , 2006, ECIR.

[16]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[17]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[18]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[19]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[20]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[21]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[22]  Alistair Moffat,et al.  The Effect of Pooling and Evaluation Depth on Metric Stability , 2010, EVIA@NTCIR.

[23]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[24]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[25]  Alistair Moffat,et al.  Seven Numeric Properties of Effectiveness Metrics , 2013, AIRS.

[26]  Tetsuya Sakai,et al.  New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering , 2004, NTCIR.

[27]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[28]  Tetsuya Sakai,et al.  Metrics, Statistics, Tests , 2013, PROMISE Winter School.

[29]  Nicola Ferro Bridging Between Information Retrieval and Databases: PROMISE Winter School 2013, Bressanone, Italy, February 4-8, 2013. Revised Tutorial Lectures ... Applications, incl. Internet/Web, and HCI , 2014 .

[30]  Evangelos Kanoulas,et al.  Empirical justification of the gain and discount function for nDCG , 2009, CIKM.

[31]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[32]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[33]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[34]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[35]  Peter Bailey,et al.  INST: An Adaptive Metric for Information Retrieval Evaluation , 2015, ADCS.

[36]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[37]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[38]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[39]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..