Are IR Evaluation Measures on an Interval Scale?

In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily use to compare IR systems. We face this issue in the framework of the representational theory of measurement and we rely on the notion of difference structure, i.e. a total equi-spaced ordering on the system runs. We found that the most popular set-based measures, i.e. precision, recall, and F-measure are interval-based. In the case of rank-based measures, using a strongly top-heavy ordering, we found that only RBP with p = 1/2 is on an interval scale while RBP for other p values, AP, DCG, and ERR are not. Moreover, using a weakly top-heavy ordering, we found that none of RBP, AP, DCG, and ERR is on an interval scale.

[1]  Stefano Mizzaro,et al.  Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics , 2013, ICTIR.

[2]  Nicola Ferro,et al.  Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness , 2015, ICTIR.

[3]  D-IOOO Berlin,et al.  TWO AXIOMS FOR EVALUATION MEASURES IN INFORMATION RETRIEVAL , 2001 .

[4]  Alistair Moffat,et al.  Seven Numeric Properties of Effectiveness Metrics , 2013, AIRS.

[5]  Norbert Fuhr Salton award lecture: information retrieval as engineering science , 2012, SIGIR 2012.

[6]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[7]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[8]  Jaap Van Brakel,et al.  Foundations of measurement , 1983 .

[9]  Giovanni Battista Rossi Measurement and Probability: A Probabilistic Theory of Measurement with Applications , 2014 .

[10]  Jonathan Barzilai,et al.  On the foundations of measurement , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[11]  William Q. Meeker,et al.  Assumptions for statistical inference , 1993 .

[12]  Stephan Foldes On distances and metrics in discrete ordered sets , 2013 .

[13]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[14]  R. Stanley Enumerative Combinatorics: Volume 1 , 2011 .

[15]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[16]  Peter Bollmann-Sdorra,et al.  Measurement-theoretical investigation of the MZ-metric , 1980, SIGIR '80.

[17]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[18]  A. Tversky,et al.  Foundations of Measurement, Vol. I: Additive and Polynomial Representations , 1991 .

[19]  Anna Gavling,et al.  The ART at , 2008 .

[20]  Fabrizio Sebastiani,et al.  An Axiomatically Derived Measure for the Evaluation of Classification Algorithms , 2015, ICTIR.

[21]  Paul H. Edelman,et al.  Hyperplane arrangements with a lattice of regions , 1990, Discret. Comput. Geom..

[22]  Sadaaki Miyamoto,et al.  Generalizations of multisets and rough approximations , 2004, Int. J. Intell. Syst..

[23]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[24]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[25]  Leonard R. Sussman,et al.  Nominal, Ordinal, Interval, and Ratio Typologies are Misleading , 1993 .

[26]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.