An Interval-Like Scale Property for I. R. Evaluation Measures

Evaluation measures play an important role in IR experimental evaluation and their properties determine the kind of statistical analyses we can conduct. It has been previously shown that it is questionable that IR effectiveness measures are on an interval-scale and this implies that computing means and variances is not a permissible operation. In this paper, we investigate whether it is possible to relax a bit the de€nition of interval scale, introducing the notion of intervallike scale, and to what extent IR e‚ectiveness measures comply with this relaxed de€nition.

[1]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[2]  Stefano Mizzaro,et al.  Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics , 2013, ICTIR.

[3]  Ben Carterette Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.

[4]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[5]  Nicola Ferro,et al.  Are IR Evaluation Measures on an Interval Scale? , 2017, ICTIR.

[6]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[7]  Fabrizio Sebastiani,et al.  An Axiomatically Derived Measure for the Evaluation of Classification Algorithms , 2015, ICTIR.

[8]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[9]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[10]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[11]  P. Bollmann,et al.  Two axioms for evaluation measures in information retrieval , 1984, SIGIR 1984.

[12]  Giovanni Battista Rossi Measurement and Probability , 2014 .

[13]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[14]  Nicola Ferro,et al.  Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness , 2015, ICTIR.

[15]  Peter Bollmann-Sdorra,et al.  Measurement-theoretical investigation of the MZ-metric , 1980, SIGIR '80.

[16]  Patrick Suppes,et al.  Foundations of measurement , 1971 .

[17]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[18]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[19]  Stephan Foldes On distances and metrics in discrete ordered sets , 2013 .

[20]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[21]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[22]  Tetsuya Sakai,et al.  Statistical reform in information retrieval? , 2014, SIGF.

[23]  Jaap Van Brakel,et al.  Foundations of measurement , 1983 .

[24]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[25]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[26]  R. Stanley Enumerative Combinatorics: Volume 1 , 2011 .

[27]  Tetsuya Sakai The Probability that Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation , 2017, SIGIR.