Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments

The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate effectiveness completely automatically, without human relevance assessments. Since human relevance assessment is one of the main costs of building a test collection, both in human time and money resources, this rather ambitious goal would have a practical impact. In this article, we reproduce the main results on evaluating information retrieval systems without relevance judgments; furthermore, we generalize such previous work to analyze the effect of test collections, evaluation metrics, and pool depth. We also expand the idea to semi-automatic evaluation and estimation of topic difficulty. Our results show that (i) previous work is overall reproducible, although some specific results are not; (ii) collection, metric, and pool depth impact the automatic evaluation of systems, which is anyway accurate in several cases; (iii) semi-automatic evaluation is an effective methodology; and (iv) automatic evaluation can (to some extent) be used to predict topic difficulty.

[1]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[2]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[3]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[4]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[5]  Elad Yom-Tov,et al.  Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[6]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[7]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[8]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Track. , 2004 .

[9]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[10]  Shariq Bashir Combining pre-retrieval query quality predictors using genetic programming , 2013, Applied Intelligence.

[11]  Chris Buckley,et al.  Topic prediction based on comparative retrieval rankings , 2004, SIGIR '04.

[12]  Stephen E. Robertson,et al.  On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.

[13]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[14]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[15]  J. Shane Culpepper,et al.  The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[16]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[17]  Josiane Mothe,et al.  Human-Based Query Difficulty Prediction , 2017, ECIR.

[18]  Nicola Ferro,et al.  Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[19]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[20]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[21]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[22]  Shengli Wu,et al.  Data fusion with estimated weights , 2002, CIKM '02.

[23]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[24]  Josiane Mothe,et al.  Linguistic features to predict query difficulty , 2005, SIGIR 2005.

[25]  Josiane Mothe,et al.  Why do you Think this Query is Difficult?: A User Study on Human Query Prediction , 2016, SIGIR.

[26]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[27]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[28]  Franciska de Jong,et al.  Retrieval system evaluation: automatic evaluation versus incomplete judgments , 2010, SIGIR '10.

[29]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[30]  Allan Hanbury,et al.  The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias , 2016, ICTIR.

[31]  Falk Scholer,et al.  Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.

[32]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[33]  David E. Losada,et al.  Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems , 2017, Inf. Process. Manag..

[34]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[35]  Albert N. Link,et al.  Economic impact assessment of NIST's text REtrieval conference (TREC) program. Final report , 2010 .

[36]  Oren Kurland,et al.  Predicting Query Performance by Query-Drift Estimation , 2009, ICTIR.

[37]  Oren Kurland,et al.  Predicting Query Performance by Query-Drift Estimation , 2009, TOIS.

[38]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[39]  Anselm Spoerri,et al.  How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.

[40]  Donna K. Harman,et al.  Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.

[41]  Eddy Maddalena,et al.  Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.

[42]  Peter Bailey,et al.  Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction , 2017, ADCS.

[43]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[44]  Djoerd Hiemstra,et al.  A survey of pre-retrieval query performance predictors , 2008, CIKM '08.

[45]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[46]  and software — performance evaluation , .

[47]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[48]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.