Evaluating Evaluation Measure Stability

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.

[1]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[2]  David D. Lewis The TREC-4 Filtering Track , 1995, TREC.

[3]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..

[4]  Donna Harman,et al.  The fourth text REtrieval conference , 1996 .

[5]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[6]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[7]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[8]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness part II. Implementation of the philosophy , 1973, J. Am. Soc. Inf. Sci..

[9]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[10]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[11]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[12]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[13]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[14]  Karen Sparck Jones Automatic Indexing; Progress in Documentation. , 1974 .

[15]  Christine D. Piatko,et al.  The JHU/APL HAIRCUT System at TREC-8 , 1999, TREC.

[16]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[17]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[18]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[19]  Kui-Lam Kwok,et al.  TREC-8 Ad-Hoc, Query and Filtering Track Experiments using PIRCS , 1999, TREC.

[20]  Harold Borko,et al.  Automatic indexing , 1981, ACM '81.

[21]  David Hawking,et al.  ACSys TREC-7 Experiments , 1998, TREC.

[22]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[23]  Peter Bailey,et al.  ACSys TREC-8 Experiments , 1999, TREC.

[24]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[25]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[26]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.