CHEERS: CHeap & Engineered Evaluation of Retrieval Systems

In test collection based evaluation of retrieval effectiveness, many research investigated different directions for an economical and a semi-automatic evaluation of retrieval systems. Although several methods have been proposed and experimentally evaluated, their accuracy seems still limited. In this paper we present our proposal for a more engineered approach to information retrieval evaluation.

[1]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[2]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[3]  Eddy Maddalena,et al.  Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.

[4]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[5]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[6]  and software — performance evaluation , .

[7]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[8]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[9]  Eddy Maddalena,et al.  Considering Assessor Agreement in IR Evaluation , 2017, ICTIR.

[10]  Tetsuya Sakai,et al.  Ranking Retrieval Systems without Relevance Assessments: Revisited , 2010, EVIA@NTCIR.

[11]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[12]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[13]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[14]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[15]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[16]  Fernando Diaz,et al.  Performance prediction using spatial autocorrelation , 2007, SIGIR.

[17]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[18]  Eddy Maddalena,et al.  Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing , 2017, HCOMP.

[19]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[20]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[21]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[22]  Stephen E. Robertson,et al.  On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.