A General Theory of IR Evaluation Measures

Interval scales are assumed by several basic descriptive statistics, such as mean and variance, and by many statistical significance tests which are daily used in IR to compare systems. Unfortunately, so far, there has not been any systematic and formal study to discover the actual scale properties of IR measures. Therefore, in this paper, we develop a theory of <italic>Information Retrieval (IR)</italic> evaluation measures, based on the representational theory of measurements, to determine whether and when IR measures are interval scales. We found that common set-based retrieval measures—namely Precision, Recall, and F-measure—always are interval scales in the case of binary relevance while this happens also in the case of multi-graded relevance only when the relevance degrees themselves are on a ratio scale and we define a specific partial order among systems. In the case of rank-based retrieval measures—namely AP, gRBP, DCG, and ERR—only gRPB is an interval scale when we choose a specific value of the parameter <inline-formula><tex-math notation="LaTeX">$p$</tex-math><alternatives><inline-graphic xlink:href="ferro-ieq1-2840708.gif"/></alternatives></inline-formula> and define a specific total order among systems while all the other IR measures are not interval scales. Besides the formal framework itself and the proof of the scale properties of several commonly used IR measures, the paper also defines some brand new set-based and rank-based IR evaluation measures which ensure to be interval scales.

[1]  Peter Bollmann-Sdorra,et al.  Measurement-theoretical investigation of the MZ-metric , 1980, SIGIR '80.

[2]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[3]  Stefano Mizzaro,et al.  Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics , 2013, ICTIR.

[4]  D-IOOO Berlin,et al.  TWO AXIOMS FOR EVALUATION MEASURES IN INFORMATION RETRIEVAL , 2001 .

[5]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[6]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[7]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[8]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[9]  Nicola Ferro,et al.  Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness , 2015, ICTIR.

[10]  R. Stanley Enumerative Combinatorics: Volume 1 , 2011 .

[11]  Giovanni Battista Rossi Measurement and Probability: A Probabilistic Theory of Measurement with Applications , 2014 .

[12]  Fred S. Roberts,et al.  Applications of the theory of meaningfulness to psychology , 1985 .

[13]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[14]  J. C. Falmagne,et al.  Scales and meaningfulness of quantitative laws , 1983, Synthese.

[15]  Alistair Moffat,et al.  Seven Numeric Properties of Effectiveness Metrics , 2013, AIRS.

[16]  Luca Mari,et al.  Beyond the representational viewpoint: a new formalization of measurement , 2000 .

[17]  Stephan Foldes On distances and metrics in discrete ordered sets , 2013 .

[18]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[19]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[20]  Eddy Maddalena,et al.  Axiometrics: Axioms of Information Retrieval Effectiveness Metrics , 2014, EVIA@NTCIR.

[21]  Paul H. Edelman,et al.  Hyperplane arrangements with a lattice of regions , 1990, Discret. Comput. Geom..

[22]  Sadaaki Miyamoto,et al.  Generalizations of multisets and rough approximations , 2004, Int. J. Intell. Syst..

[23]  L. Finkelstein Widely, strongly and weakly defined measurement , 2003 .

[24]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[25]  Leonard R. Sussman,et al.  Nominal, Ordinal, Interval, and Ratio Typologies are Misleading , 1993 .

[26]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[27]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[28]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[29]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[30]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[31]  Fabrizio Sebastiani,et al.  An Axiomatically Derived Measure for the Evaluation of Classification Algorithms , 2015, ICTIR.

[32]  Stefano Mizzaro,et al.  Relevance: The Whole History , 1997, J. Am. Soc. Inf. Sci..