Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations

Abstract In test collection based evaluation of retrieval effectiveness, it has been suggested to completely avoid using human relevance judgments. Although several methods have been proposed, their accuracy is still limited. In this paper we present two overall contributions. First, we provide a systematic comparison of all the most widely adopted previous approaches on a large set of 14 TREC collections. We aim at analyzing the methods in a homogeneous and complete way, in terms of the accuracy measures used as well as in terms of the datasets selected, showing that considerably different results may be achieved considering different methods, datasets, and measures. Second, we study the combination of such methods, which, to the best of our knowledge, has not been investigated so far. Our experimental results show that simple combination strategies based on data fusion techniques are usually not effective and even harmful. However, some more sophisticated solutions, based on machine learning, are indeed effective and often outperform all individual methods. Moreover, they are more stable, as they show a smaller variation across datasets. Our results have the practical implication that, when trying to automatically evaluate retrieval effectiveness, researchers should not use a single method, but a (machine-learning based) combination of them.

[1]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[2]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[3]  Djoerd Hiemstra,et al.  A survey of pre-retrieval query performance predictors , 2008, CIKM '08.

[4]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[5]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[6]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[7]  Stefano Mizzaro,et al.  Reproduce and Improve , 2018, ACM J. Data Inf. Qual..

[8]  Pengfei Li,et al.  On the Effectiveness of Query Weighting for Adapting Rank Learners to New Unlabelled Collections , 2016, CIKM.

[9]  Fernando Diaz,et al.  Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.

[10]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[11]  Tetsuya Sakai,et al.  Ranking Retrieval Systems without Relevance Assessments: Revisited , 2010, EVIA@NTCIR.

[12]  Josiane Mothe,et al.  Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin , 2018, SIGIR.

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Oren Kurland,et al.  Query Performance Prediction Using Reference Lists , 2016, ACM Trans. Inf. Syst..

[15]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[16]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  P. Fishburn Condorcet Social Choice Functions , 1977 .

[20]  Franciska de Jong,et al.  Retrieval system evaluation: automatic evaluation versus incomplete judgments , 2010, SIGIR '10.

[21]  Oren Kurland,et al.  Query-performance prediction: setting the expectations straight , 2014, SIGIR.

[22]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[23]  Fernando Diaz,et al.  Performance prediction using spatial autocorrelation , 2007, SIGIR.

[24]  Nicola Ferro,et al.  Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  Daniele Fanelli,et al.  Negative results are disappearing from most disciplines and countries , 2011, Scientometrics.

[27]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[28]  Stefano Mizzaro,et al.  Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments , 2018, ACM J. Data Inf. Qual..

[29]  J. Knight Negative results: Null and void , 2003, Nature.

[30]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[31]  David Zhang,et al.  Learning Domain-Invariant Subspace Using Domain Features and Independence Maximization , 2016, IEEE Transactions on Cybernetics.

[32]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[33]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[34]  Tie-Yan Liu Learning to Rank for Information Retrieval , 2009, Found. Trends Inf. Retr..

[35]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[36]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[37]  J. Shane Culpepper,et al.  Improving test collection pools with machine learning , 2014, ADCS.

[38]  Stefano Mizzaro,et al.  Effectiveness Evaluation with a Subset of Topics: A Practical Approach , 2018, SIGIR.

[39]  Anselm Spoerri,et al.  How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.

[40]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[41]  Peter Emerson,et al.  The original Borda count and partial voting , 2013, Soc. Choice Welf..

[42]  Djoerd Hiemstra,et al.  A Case for Automatic System Evaluation , 2010, ECIR.