Rank-biased precision for measurement of retrieval effectiveness

A range of methods for measuring the effectiveness of information retrieval systems has been proposed. These are typically intended to provide a quantitative single-value summary of a document ranking relative to a query. However, many of these measures have failings. For example, recall is not well founded as a measure of satisfaction, since the user of an actual system cannot judge recall. Average precision is derived from recall, and suffers from the same problem. In addition, average precision lacks key stability properties that are needed for robust experiments. In this article, we introduce a new effectiveness metric, rank-biased precision, that avoids these problems. Rank-biased pre-cision is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.

[1]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[2]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[3]  William M. Shaw,et al.  On the foundation of evaluation , 1986, J. Am. Soc. Inf. Sci..

[4]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[5]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[6]  Tetsuya Sakai Ranking the NTCIR Systems Based on Multigrade Relevance , 2004, AIRS.

[7]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[8]  W. M. Shaw On the Foundation of Evaluation. , 1986 .

[9]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[10]  KekäläinenJaana Binary and graded relevance in IR evaluations , 2005 .

[11]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[12]  MoffatAlistair,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008 .

[13]  Xiannong Meng,et al.  On User-oriented Measurements of Effectiveness of Web Information Retrieval Systems , 2004, International Conference on Internet Computing.

[14]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[15]  Robert M. Losee When information retrieval measures agree about the relative quality of document rankings , 2000, J. Am. Soc. Inf. Sci..

[16]  Emine Yilmaz,et al.  A geometric interpretation of r-precision and its correlation with average precision , 2005, SIGIR '05.

[17]  Louise T. Su The Relevance of Recall and Precision in User Evaluation , 1994, J. Am. Soc. Inf. Sci..

[18]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[19]  Kartik Hosanagar A utility theoretic approach to determining optimal wait times in distributed information retrieval , 2005, SIGIR '05.

[20]  Louise T. Su The Relevance of Recall and Precision in User Evaluation , 1994, J. Am. Soc. Inf. Sci..

[21]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[22]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[23]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[24]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[25]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[26]  James Allan,et al.  When will information retrieval be "good enough"? , 2005, SIGIR '05.

[27]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[28]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[29]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[30]  Peter Schäuble,et al.  Determining the effectiveness of retrieval algorithms , 1991, Inf. Process. Manag..

[31]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[32]  Jochen R. Moehr,et al.  Current Status of the Evaluation of Information Retrieval , 2003, Journal of Medical Systems.

[33]  Stefano Mizzaro,et al.  Measuring retrieval effectiveness: A new proposal and a first experimental validation , 2004, J. Assoc. Inf. Sci. Technol..

[34]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[35]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[36]  R D Gillette,et al.  Evaluation Evaluation , 2008, European Conference on Artificial Intelligence.

[37]  Jaana Kekäläinen,et al.  Binary and graded relevance in IR evaluations--Comparison of the effects on ranking of IR systems , 2005, Inf. Process. Manag..

[38]  Peter Ingwersen,et al.  Measures of relative relevance and ranked half-life: performance indicators for interactive IR , 1998, SIGIR '98.

[39]  Stefano Mizzaro,et al.  Relevance: The Whole History , 1997, J. Am. Soc. Inf. Sci..

[40]  Stefano Mizzaro Relevance: the whole history , 1997 .

[41]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[42]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[43]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[44]  Tefko Saracevic,et al.  Evaluation of evaluation in information retrieval , 1995, SIGIR '95.

[45]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[46]  Robert M. Losee When information retrieval measures agree about the relative quality of document rankings , 2000 .

[47]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.