Reproduce and Improve

Effectiveness evaluation of information retrieval systems by means of a test collection is a widely used methodology. However, it is rather expensive in terms of resources, time, and money; therefore, many researchers have proposed methods for a cheaper evaluation. One particular approach, on which we focus in this article, is to use fewer topics: in TREC-like initiatives, usually system effectiveness is evaluated as the average effectiveness on a set of n topics (usually, n=50, but more than 1,000 have been also adopted); instead of using the full set, it has been proposed to find the best subsets of a few good topics that evaluate the systems in the most similar way to the full set. The computational complexity of the task has so far limited the analysis that has been performed. We develop a novel and efficient approach based on a multi-objective evolutionary algorithm. The higher efficiency of our new implementation allows us to reproduce some notable results on topic set reduction, as well as perform new experiments to generalize and improve such results. We show that our approach is able to both reproduce the main state-of-the-art results and to allow us to analyze the effect of the collection, metric, and pool depth used for the evaluation. Finally, differently from previous studies, which have been mainly theoretical, we are also able to discuss some practical topic selection strategies, integrating results of automatic evaluation approaches.

[1]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[2]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Track. , 2004 .

[3]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[4]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[5]  Tetsuya Sakai,et al.  Designing Test Collections for Comparing Many Systems , 2014, CIKM.

[6]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[7]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[8]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[9]  J. Shane Culpepper,et al.  The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[10]  Paul Over,et al.  Blind Men and Elephants: Six Approaches to TREC data , 1999, Information Retrieval.

[11]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[12]  Anselm Spoerri,et al.  How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.

[13]  Stephen E. Robertson,et al.  On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.

[14]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[15]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[16]  Eddy Maddalena,et al.  Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.

[17]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[18]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[19]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[20]  Nicola Ferro,et al.  Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[21]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[22]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[23]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.