Assessor error in stratified evaluation

Several important information retrieval tasks, including those in medicine, law, and patent review, have an authoritative standard of relevance, and are concerned about retrieval completeness. During the evaluation of retrieval effectiveness in these domains, assessors make errors in applying the standard of relevance, and the impact of these errors, particularly on estimates of recall, is of crucial concern. Using data from the interactive task of the TREC Legal Track, this paper investigates how reliably the yield of relevant documents can be estimated from sampled assessments in the presence of assessor error, particularly where sampling is stratified based upon the results of participating retrieval systems. We show that assessor error is in general a greater source of inaccuracy than sampling error. A process of appeal and adjudication, such as used in the interactive task, is found to be effective at locating many assessment errors; but the process is expensive if complete, and biased if incomplete. An unbiased double-sampling method for resolving assessment error is proposed, and shown on representative data to be more efficient and accurate than appeal-based adjudication.

[1]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[2]  E. Pronin,et al.  Perception and misperception of bias in human judgment , 2007, Trends in Cognitive Sciences.

[3]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[4]  Tefko Saracevic Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .

[5]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[6]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[7]  John Tait,et al.  CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain , 2009, CLEF.

[8]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[9]  Herbert L. Roitblat,et al.  Document categorization in legal electronic discovery: computer classification vs. manual review , 2010 .

[10]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[11]  A. Tenenbein A Double Sampling Scheme for Estimating from Binomial Data with Misclassifications , 1970 .

[12]  David G. Stork,et al.  Evaluating Classifiers by Means of Test Data with Noisy Labels , 2003, IJCAI.

[13]  Douglas W. Oard,et al.  TREC 2006 Legal Track Overview , 2006, TREC.

[14]  Herbert L. Roitblat,et al.  Document categorization in legal electronic discovery: computer classification vs. manual review , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Tefko Saracevic,et al.  Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007, J. Assoc. Inf. Sci. Technol..

[16]  Ulukbek Ibraev,et al.  Estimating the Number of Relevant Documents in Enormous Collections. , 1999 .

[17]  Douglas W. Oard,et al.  Overview of the TREC 2009 Legal Track , 2009, TREC.

[18]  David D. Lewis The TREC-4 Filtering Track , 1995, TREC.

[19]  Douglas W. Oard,et al.  Overview of the TREC 2008 Legal Track , 2008, TREC.

[20]  W. John Wilbur,et al.  Human Subjectivity and Performance Limits in Document Retrieval , 1996, Inf. Process. Manag..

[21]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.