Fewer topics? A million topics? Both?! On topics subsets in test collections

When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.

[1]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[2]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[3]  Stefano Mizzaro,et al.  IR Evaluation without a Common Set of Topics , 2009, ICTIR.

[4]  Ingemar J. Cox,et al.  Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[5]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[6]  Julián Urbano,et al.  Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.

[7]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[8]  James Allan,et al.  If I Had a Million Queries , 2009, ECIR.

[9]  Stefano Mizzaro,et al.  Effectiveness Evaluation with a Subset of Topics: A Practical Approach , 2018, SIGIR.

[10]  Julián Urbano,et al.  Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.

[11]  Mark Sanderson,et al.  Problems with Kendall's tau , 2007, SIGIR.

[12]  Mónica Marrero,et al.  On the measurement of test collection reliability , 2013, SIGIR.

[13]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[14]  Stefano Mizzaro,et al.  Reproduce and Improve , 2018, ACM J. Data Inf. Qual..

[15]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[16]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[17]  Pu Li,et al.  Test theory for assessing IR test collections , 2007, SIGIR.

[18]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[19]  James E. Bartlett,et al.  Organizational research: Determining appropriate sample size in survey research , 2001 .

[20]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[21]  Djoerd Hiemstra,et al.  Relying on topic subsets for system ranking estimation , 2009, CIKM.

[22]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[23]  Emine Yilmaz,et al.  Representative & Informative Query Selection for Learning to Rank using Submodular Functions , 2015, SIGIR.

[24]  Alistair Moffat,et al.  Models and metrics: IR evaluation as a user process , 2012, ADCS.

[25]  Ingemar J. Cox,et al.  Prioritizing relevance judgments to improve the construction of IR test collections , 2011, CIKM '11.

[26]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[27]  Ben Carterette,et al.  Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.

[28]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[29]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[30]  Tetsuya Sakai,et al.  Designing Test Collections for Comparing Many Systems , 2014, CIKM.

[31]  Eddy Maddalena,et al.  Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.

[32]  Tamer Elsayed,et al.  Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging , 2017, Inf. Process. Manag..

[33]  Milad Shokouhi,et al.  An uncertainty-aware query selection model for evaluation of IR systems , 2012, SIGIR '12.

[34]  Djoerd Hiemstra,et al.  A Case for Automatic System Evaluation , 2010, ECIR.

[35]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[36]  Stephen E. Robertson,et al.  On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.

[37]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[38]  R. Feise Do multiple outcome measures require p-value adjustment? , 2002, BMC medical research methodology.

[39]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .