Evaluation over thousands of queries

Information retrieval evaluation has typically been performed over several dozen queries, each judged to near-completeness. There has been a great deal of recent work on evaluation over much smaller judgment sets: how to select the best set of documents to judge and how to estimate evaluation measures when few judgments are available. In light of this, it should be possible to evaluate over many more queries without much more total judging effort. The Million Query Track at TREC 2007 used two document selection algorithms to acquire relevance judgments for more than 1,800 queries. We present results of the track, along with deeper analysis: investigating tradeoffs between the number of queries and number of judgments shows that, up to a point, evaluation over more queries with fewer judgments is more cost-effective and as reliable as fewer queries with more judgments. Total assessor effort can be reduced by 95% with no appreciable increase in evaluation errors.

[1]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[2]  Ophir Frieder,et al.  Repeatable evaluation of information retrieval effectiveness in dynamic environments , 2006 .

[3]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track | NIST , 2005 .

[4]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[5]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[6]  J. Jackson Wiley Series in Probability and Mathematical Statistics , 2004 .

[7]  Javed A. Aslam,et al.  Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions , 2007, ECIR.

[8]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[9]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[10]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[11]  Pu Li,et al.  Test theory for assessing IR test collections , 2007, SIGIR.

[12]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[13]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[14]  Ben Carterette,et al.  Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.

[15]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[16]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[17]  N. Butt Sampling with Unequal Probabilities , 2003 .

[18]  Ellen M. Voorhees,et al.  Bias and the limits of pooling , 2006, SIGIR '06.

[19]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[20]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[21]  W. L. Stevens,et al.  Sampling Without Replacement with Probability Proportional to Size , 1958 .

[22]  Emine Yilmaz,et al.  Measure-based metasearch , 2005, SIGIR '05.