Pooling-based continuous evaluation of information retrieval systems

The dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled test-collections and on the continuous use of either crowdsourcing or professional editors to obtain relevance judgements. Instead of building a static collection for a finite set of systems known a priori, we propose an IR evaluation paradigm where retrieval approaches are evaluated iteratively on the same collection. Each new retrieval technique takes care of obtaining its missing relevance judgements and hence contributes to augmenting the overall set of relevance judgements of the collection. We also propose two metrics: Fairness Score, and opportunistic number of relevant documents, which we then use to define new pooling strategies. The goal of this work is to study the behavior of standard IR metrics, IR system ranking, and of several pooling techniques in a continuous evaluation context by comparing continuous and non-continuous evaluation results on classic test collections. We both use standard and crowdsourced relevance judgements, and we actually run a continuous evaluation campaign over several existing IR systems.

[1]  Wessel Kraaij,et al.  Reliability and Validity of Query Intent Assessments , 2013, DIR.

[2]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[3]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[4]  Andrew Trotman,et al.  Focused Retrieval and Evaluation , 2011 .

[5]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.

[6]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[7]  Gianluca Demartini,et al.  Overview of the INEX 2009 Entity Ranking Track , 2009, INEX.

[8]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[9]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[10]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[11]  R. McCreadie Crowdsourcing Blog Track Top News Judgments at TREC , 2011 .

[12]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[13]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[14]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[15]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[16]  Javed A. Aslam,et al.  IR system evaluation using nugget-based test collections , 2012, WSDM '12.

[17]  Ricardo Baeza-Yates,et al.  Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[18]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[19]  Krisztian Balog,et al.  Head First: Living Labs for Ad-hoc Search Evaluation , 2014, CIKM.

[20]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[21]  Roi Blanco,et al.  Repeatable and reliable semantic search evaluation , 2013, J. Web Semant..

[22]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[23]  Stefano Mizzaro,et al.  Relevance: The Whole History , 1997, J. Am. Soc. Inf. Sci..

[24]  Roi Blanco,et al.  Influence of Timeline and Named-Entity Components on User Engagement , 2013, ECIR.

[25]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[26]  Roi Blanco,et al.  Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.

[27]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[28]  Haim Levkowitz,et al.  Introduction to information retrieval (IR) , 2008 .

[29]  Gabriella Kazai,et al.  Worker types and personality traits in crowdsourcing relevance labels , 2011, CIKM '11.

[30]  Laurence Anthony F. Park,et al.  Score adjustment for correction of pooling bias , 2009, SIGIR.

[31]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[32]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[33]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[34]  Falk Scholer,et al.  The effect of threshold priming and need for cognition on relevance calibration and assessment , 2013, SIGIR.

[35]  Alistair A. Young,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[36]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[37]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems , 1962 .

[38]  Gabriella Kazai,et al.  Overview of the TREC 2012 Crowdsourcing Track , 2012, TREC.

[39]  Andrew Trotman,et al.  Overview of the INEX 2009 Entity Ranking Track , 2009 .

[40]  Roi Blanco,et al.  Evaluating ad-hoc object retrieval , 2010, IWEST@ISWC.

[41]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[42]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[43]  C. Cleverdon Report on the testing and analysis of an investigation into comparative efficiency of indexing systems , 1962 .

[44]  Gianluca Demartini,et al.  Pick-a-crowd: tell me what you like, and i'll tell you what to do , 2013, CIDR.

[45]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[46]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[47]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.