A Practical Sampling Strategy for Efficient Retrieval Evaluation

We consider the problem of large-scale retrieval evaluation, with a focus on the considerable effort required to judge tens of thousands of documents using traditional test collection construction methodologies. Recently, two methods based on random sampling were proposed to help alleviate this burden: While the first method proposed by Aslam et al. is very accurate and efficient, it is also very complex, and while the second method proposed by Yilmaz et al. is relatively simple, its accuracy and efficiency are significantly lower than the former. In this work, we propose a new method for large-scale retrieval evaluation based on random sampling which combines the strengths of each of the above methods: it maintains the simplicity of the Yilmaz et al. method while achieving the performance of the Aslam et al. method. Furthermore, we demonstrate that this new sampling method can be adapted to incorporate both randomly sampled and fixed relevance judgments, as were available in the most recent TREC Terabyte track, for example.

[1]  W. L. Stevens,et al.  Sampling Without Replacement with Probability Proportional to Size , 1958 .

[2]  Editors , 1986, Brain Research Bulletin.

[3]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[4]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[5]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[6]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[7]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[8]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[9]  Susan T. Dumais,et al.  Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval , 2004, SIGIR 2004.

[10]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[11]  N. Butt Sampling with Unequal Probabilities , 2003 .

[12]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[13]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[14]  Emine Yilmaz,et al.  Measure-based metasearch , 2005, SIGIR '05.

[15]  Charles L. A. Clarke,et al.  The TREC terabyte retrieval track , 2005, SIGF.

[16]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[17]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[18]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[19]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Javed A. Aslam,et al.  Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions , 2007, ECIR.

[21]  Chris P. Tsokos,et al.  Mathematical Statistics with Applications , 2009 .