论文信息 - Test theory for evaluating reliability of IR test collections

Test theory for evaluating reliability of IR test collections

Classical test theory offers theoretically derived reliability measures such as Cronbach's alpha, which can be applied to measure the reliability of a set of Information Retrieval test results. The theory also supports item analysis, which identifies queries that are hampering the test's reliability, and which may be candidates for refinement or removal. A generalization of Classical Test Theory, called Generalizability Theory, provides an even richer set of tools. It allows us to estimate the reliability of a test as a function of the number of queries, assessors (relevance judges), and other aspects of the test's design. One novel aspect of Generalizability Theory is that it allows this estimation of reliability even before the test collection exists, based purely on the numbers of queries and assessors that it will contain. These calculations can help test designers in advance, by allowing them to compare the reliability of test designs with various numbers of queries and relevance assessors, and to spend their limited budgets on a design that maximizes reliability. Empirical analysis shows that in cases for which our data is representative, having more queries is more helpful for reliability than having more assessors. It also suggests that reliability may be improved with a per-document performance measure, as opposed to a document-set based performance measure, where appropriate. The theory also clarifies the implicit debate in IR literature regarding the nature of error in relevance judgments.

David Bodoff

[1] David A. Hull. Stemming algorithms: a case study for detailed evaluation , 1996 .

[2] Tetsuya Sakai,et al. On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[3] Charles L. A. Clarke,et al. Efficient construction of large test collections , 1998, SIGIR '98.

[4] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[5] Donna K. Harman,et al. Overview of the TREC 2003 Novelty Track , 2003, TREC.

[6] Ian Soboroff,et al. Overview of the TREC 2004 Novelty Track , 2004, TREC.

[7] Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[8] L. Crocker,et al. Introduction to Classical and Modern Test Theory , 1986 .

[9] Ian Soboroff. Do TREC web collections look like the web? , 2002, SIGF.

[10] Richard J. Shavelson,et al. Generalizability Theory: A Primer , 1991 .

[11] Paul Over,et al. Blind Men and Elephants: Six Approaches to TREC data , 1999, Information Retrieval.

[12] Stephen E. Robertson,et al. The TREC 2002 Filtering Track Report , 2002, TREC.

[13] Randall W. Potter,et al. Confidence intervals on variance components , 1992 .

[14] W. Hersh,et al. Factors associated with successful answering of clinical questions using an information retrieval system. , 2002, Bulletin of the Medical Library Association.