Statistical inference in retrieval effectiveness evaluation

Evaluation methodology, and particularly its statistical tests associated, plays a central role in the information retrieval domain which maintains a strong empirical tradition. In an effort to evaluate the retrieval effectiveness of a search algorithm, this paper focuses on the average precision over a set of fixed recall values. After reviewing traditional evaluation methodology through the use of examples, this study suggests applying another statistical inference methodology called bootstrap, within which no particular assumption is needed about the distribution of the observations. Moreover, this scheme may be used to assert the accuracy of virtually any statistic, to build approximate confidence interval, and to verify whether a statistically significant difference exists between two retrieval schemes, even when dealing with a relatively small sample size. This study also suggests selecting the sample median rather than the sample mean in evaluating retrieval effectiveness where the justification for this choice is based on the nature of the information retrieval data.

[1]  Laurence G. Grimm,et al.  Statistical Applications for the Behavioral Sciences , 1993 .

[2]  Edward A. Fox,et al.  Characterization of Two New Experimental Collections in Computer and Information Science Containing Textual and Bibliographic Concepts , 1983 .

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  Paul B. Kantor Operations research for libraries and information agencies , 1994 .

[5]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .

[6]  Jacques Savoy Stemming of French words based on grammatical categories , 1993 .

[7]  Pierre L'Ecuyer,et al.  Implementing a random number package with splitting facilities , 1991, TOMS.

[8]  Jacques Savoy,et al.  A Learning Scheme for Information Retrieval in Hypertext , 1994, Inf. Process. Manag..

[9]  Jacques Savoy,et al.  Stemming of French Words Based on Grammatical Categories , 1993, J. Am. Soc. Inf. Sci..

[10]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[11]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[12]  Linda Schamber Relevance and Information Behavior. , 1994 .

[13]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[14]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[15]  Myke Gluck,et al.  Exploring the Relationship between User Satisfaction and Relevance in Information Systems , 1996, Inf. Process. Manag..

[16]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[17]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[18]  David Dubin Measurement in information science , 1997 .

[19]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[20]  Pierre L'Ecuyer,et al.  Efficient and portable combined Tausworthe random number generators , 1990, TOMC.

[21]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[22]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[23]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[24]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[25]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[26]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[27]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. Background and methodology , 1997, J. Am. Soc. Inf. Sci..

[28]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[29]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[30]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[31]  Stephen P. Harter Variations in relevance assessments and the measurement of retrieval effectiveness , 1996 .

[32]  Chap T. Le,et al.  Applied Categorical Data Analysis , 1998 .

[33]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[34]  Joseph P. Romano,et al.  Bootstrap technology and applications , 1992 .

[35]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[36]  Charles Herring,et al.  Random number generators are chaotic , 1989, CACM.

[37]  Tefko Saracevic,et al.  Evaluation of evaluation in information retrieval , 1995, SIGIR '95.

[38]  Robert H. Ledwith On the Difficulties of Applying the Results of Information Retrieval Research to Aid in the Searching of Larg Scientific Databases , 1992, Inf. Process. Manag..

[39]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[40]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[41]  Stephen E. Fienberg,et al.  The analysis of cross-classified categorical data , 1980 .

[42]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .