Prioritizing relevance judgments to improve the construction of IR test collections

We consider the problem of optimally allocating a fixed budget to construct a test collection with associated relevance judgements, such that it can (i) accurately evaluate the relative performance of the participating systems, and (ii) generalize to new, previously unseen systems. We propose a two stage approach. For a given set of queries, we adopt the traditional pooling method and use a portion of the budget to evaluate a set of documents retrieved by the participating systems. Next, we analyze the relevance judgments to prioritize the queries and remaining pooled documents for further relevance assessments. The query prioritization is formulated as a convex optimization problem, thereby permitting efficient solution and providing a flexible framework to incorporate various constraints. Query-document pairs with the highest priority scores are evaluated using the remaining budget. We evaluate our resource optimization approach on the TREC 2004 Robust track collection. We demonstrate that our optimization techniques are cost efficient and yield a significant improvement in the reusability of the test collections.

[1]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[2]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[3]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[4]  Ben Carterette,et al.  Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.

[5]  Richard W. Cottle,et al.  Linear Complementarity Problem , 2009, Encyclopedia of Optimization.

[6]  Ben Carterette,et al.  Measuring the reusability of test collections , 2010, WSDM '10.

[7]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[8]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[9]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[10]  Kuldeep Kumar,et al.  Robust Statistics, 2nd edn , 2011 .

[11]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[12]  Laurence Anthony F. Park,et al.  Score adjustment for correction of pooling bias , 2009, SIGIR.

[13]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[14]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[15]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[16]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[17]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[18]  Ingemar J. Cox,et al.  Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[19]  I. Stancu-Minasian Nonlinear Fractional Programming , 1997 .

[20]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[21]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[22]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[23]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[24]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.