Low-cost evaluation techniques for information retrieval systems: A review

For a system-based information retrieval evaluation, test collection model still remains as a costly task. Producing relevance judgments is an expensive, time consuming task which has to be performed by human assessors. It is not viable to assess the relevancy of every single document in a corpus against each topic for a large collection. In an experimental-based environment, partial judgment on the basis of a pooling method is created to substitute a complete assessment of documents for relevancy. Due to the increasing number of documents, topics, and retrieval systems, the need to perform low-cost evaluations while obtaining reliable results is essential. Researchers are seeking techniques to reduce the costs of experimental IR evaluation process by the means of reducing the number of relevance judgments to be performed or even eliminating them while still obtaining reliable results. In this paper, various state-of-the-art approaches in performing low-cost retrieval evaluation are discussed under each of the following categories; selecting the best sets of documents to be judged; calculating evaluation measures, both, robust to incomplete judgments; statistical inference of evaluation metrics; inference of judgments on relevance, query selection; techniques to test the reliability of the evaluation and reusability of the constructed collections; and other alternative methods to pooling. This paper is intended to link the reader to the corpus of ‘must read’ papers in the area of low-cost evaluation of IR systems.

[1]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[2]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[3]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[4]  Mark Baillie,et al.  Evaluating epistemic uncertainty under incomplete assessments , 2008, Inf. Process. Manag..

[5]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[6]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[7]  Mark Sanderson,et al.  Relatively relevant: Assessor shift in document judgements , 2010, ADCS 2010.

[8]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[9]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[10]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[11]  Emine Yilmaz,et al.  Inferring document relevance from incomplete information , 2007, CIKM '07.

[12]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[13]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[14]  Thomas Mandl,et al.  Recent Developments in the Evaluation of Information Retrieval Systems: Moving Towards Diversity and Practical Relevance , 2008, Informatica.

[15]  Laurence Anthony F. Park,et al.  Score adjustment for correction of pooling bias , 2009, SIGIR.

[16]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[17]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[18]  Peng Li,et al.  Using Clustering to Improve Retrieval Evaluation without Relevance Judgments , 2010, COLING.

[19]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[20]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[21]  James Allan,et al.  Incremental test collections , 2005, CIKM '05.

[22]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[23]  and software — performance evaluation , .

[24]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[25]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[26]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[27]  Alistair Moffat,et al.  System scoring using partial prior information , 2009, SIGIR.

[28]  Justin Zobel,et al.  Redundant documents and search effectiveness , 2005, CIKM '05.

[29]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[30]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[31]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[32]  Ingemar J. Cox,et al.  Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[33]  Sri Devi Ravana Experimental evaluation of information retrieval systems , 2011 .

[34]  Ingemar J. Cox,et al.  Prioritizing relevance judgments to improve the construction of IR test collections , 2011, CIKM '11.

[35]  Brian A Vander Schee Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business , 2009 .

[36]  Ben Carterette,et al.  Low cost evaluation in information retrieval , 2010, SIGIR '10.

[37]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[38]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[39]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[40]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[41]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[42]  Per Ahlgren,et al.  Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures , 2008, Inf. Process. Manag..

[43]  Anselm Spoerri,et al.  How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.

[44]  Andrew Trotman,et al.  Sound and complete relevance assessment for XML retrieval , 2008, TOIS.

[45]  Ingemar J. Cox,et al.  Optimizing the cost of information retrieval testcollections , 2011, PIKM '11.

[46]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[47]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[48]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[49]  Jaap Kamps,et al.  Evaluation effort, reliability and reusability in XML retrieval , 2011, J. Assoc. Inf. Sci. Technol..

[50]  R. Fidel Qualitative methods in information retrieval research. , 1993 .

[51]  Ted S. Sindlinger,et al.  Crowdsourcing: Why the Power of the Crowd is Driving the Future of Business , 2010 .

[52]  Per Ahlgren,et al.  Retrieval evaluation with incomplete relevance data: a comparative study of three measures , 2006, CIKM '06.

[53]  William Webber,et al.  Measurement in information retrieval evaluation , 2010 .

[54]  Alistair Moffat,et al.  Score Aggregation Techniques in Retrieval Experimentation , 2009, ADC.

[55]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[56]  Ben Carterette,et al.  Measuring the reusability of test collections , 2010, WSDM '10.

[57]  Yue Liu,et al.  ICTNET at Web Track 2010 Diversity Task , 2010, TREC.

[58]  James Allan,et al.  Research methodology in studies of assessor effort for information retrieval evaluation , 2007 .

[59]  Miles Efron,et al.  Query polyrepresentation for ranking retrieval systems without relevance judgments , 2010, J. Assoc. Inf. Sci. Technol..

[60]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[61]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[62]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.