A comparison of pooled and sampled relevance judgments

Test collections are most useful when they are reusable, that is, when they can be reliably used to rank systems that did not contribute to the pools. Pooled relevance judgments for very large collections may not be reusable for two easons: they will be very sparse and not sufficiently complete, and they may be biased in the sense that theywill unfairly rank some class of systems. The TREC 2006 terabyte track judged both a pool and a deep random sample in order to measure the effects of sparseness and bias.

[1]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[2]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[3]  Justin Zobel,et al.  Redundant documents and search effectiveness , 2005, CIKM '05.

[4]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[5]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[6]  Ellen M. Voorhees,et al.  Bias and the limits of pooling , 2006, SIGIR '06.

[7]  Mark Sanderson,et al.  EVIA 2007 NTCIR-6 pre-meeting workshop : proceedings of the first international workshop on evaluating information access (EVIA) , 2007 .

[8]  Susan T. Dumais,et al.  Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval , 2004, SIGIR 2004.

[9]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[10]  Ellen M. Voorhees,et al.  Summary of the SIGIR 2003 workshop on defining evaluation methodologies for terabyte-scale test collections , 2003, SIGF.

[11]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[12]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[13]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[14]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[15]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[16]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[17]  Emine Yilmaz,et al.  Measure-based metasearch , 2005, SIGIR '05.