Cumulated gain-based evaluation of IR techniques

Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.

[1]  Susan Brewer,et al.  Information storage and retrieval , 1959, ACM '59.

[2]  S. Pollock Measures for the comparison of information retrieval systems , 1968 .

[3]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[4]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[5]  Nicholas J. Belkin,et al.  Ranking in Principle , 1978, J. Documentation.

[6]  Gerard Salton,et al.  Automatic indexing , 1980, ACM '80.

[7]  Harold Borko,et al.  Automatic indexing , 1981, ACM '81.

[8]  T. Obremski Practical Nonparametric Statistics (2nd ed.) , 1981 .

[9]  R. A. Groeneveld,et al.  Practical Nonparametric Statistics (2nd ed). , 1981 .

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  M. E. Maron,et al.  AN EVALUATION OF RETRIEVAL EFFECTIVENESS FOR A FULL-TEXT DOCulwvT-l?ETl?lEviiL SYSTEM , 1985 .

[12]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[13]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. Background and methodology , 1997, J. Am. Soc. Inf. Sci..

[14]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .

[15]  Sung-Hyon Myaeng,et al.  Integration of user profiles: models and experiments in information retrieval , 1990, Inf. Process. Manag..

[16]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[17]  William R. Hersh,et al.  An evaluation of interactive Boolean and natural language searching with an online medical textbook , 1995 .

[18]  William R. Hersh,et al.  An Evaluation of Interactive Boolean and Natural Language Searching with an Online Medical Textbook , 1995, J. Am. Soc. Inf. Sci..

[19]  Jaana Kekäläinen,et al.  The impact of query structure and query expansion on retrieval performance , 1998, SIGIR '98.

[20]  Amanda Spink,et al.  From Highly Relevant to Not Relevant: Examining Different Regions of Relevance , 1998, Inf. Process. Manag..

[21]  Robert M. Losee Text retrieval and filtering: analytic models of performance , 1998 .

[22]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[23]  Peter Ingwersen,et al.  Measures of relative relevance and ranked half-life: performance indicators for interactive IR , 1998, SIGIR '98.

[24]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[25]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[26]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[27]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[28]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[29]  Eero Sormunen,et al.  A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases , 2000 .

[30]  Pia Borlund,et al.  Evaluation of interactive information retrieval systems , 2000 .

[31]  Pertti Vakkari,et al.  Changes in relevance criteria and problem stages in task performance , 2000, J. Documentation.

[32]  Pia Borlund,et al.  Experimental components for the evaluation of interactive information retrieval systems , 2000, J. Documentation.

[33]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[34]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[35]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[36]  K. Järvelin,et al.  EVALUATING INFORMATION RETRIEVAL SYSTEMS UNDER THE CHALLENGES OF INTERACTION AND MULTIDIMENSIONAL DYNAMIC RELEVANCE , 2002 .

[37]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[38]  Eero Sormunen,et al.  Extensions to the STAIRS Study—Empirical Evidence for the Hypothesised Ineffectiveness of Boolean Queries in Large Full-Text Databases , 2001, Information Retrieval.

[39]  Jaana Kekäläinen,et al.  The Co-Effects of Query Structure and Expansion on Retrieval Performance in Probabilistic Text Retrieval , 2004, Information Retrieval.