Ranking Retrieval Systems without Relevance Assessments: Revisited

We re-examine the problem of ranking retrieval systems without relevance assessments in the context of collaborative evaluation forums such as TREC and NTCIR. The problem was first tackled by Soboroff, Nicholas and Cahan in 2001, using data from TRECs 3-8 [16]. Our long-term goal is to semi-automate repeated evaluation of search engines; our short-term goal is to provide NTCIR participants with a “system ranking forecast” prior to conducting manual relevance assessments, thereby reducing researchers’ idle time and accelerating research. Our extensive experiments using graded-relevance test collections from TREC and NTCIR compare several existing methods for ranking systems without relevance assessments. We show that (a) The simplest method of forming “pseudo-qrels” based on how many systems returned each pooled document performs as well as any other existing method; and that (b) the NTCIR system rankings tend to be easier to predict than the TREC robust track system rankings, and moreover, the NTCIR pseudoqrels yield fewer false alarms than the TREC pseudo-qrels do in statistical significance testing. These differences between TREC and NTCIR may be because TREC sorts pooled documents by document IDs before relevance assessments, while NTCIR sorts them primarily by the number of systems that returned the document. However, we show that, even for the TREC robust data, documents returned by many systems are indeed more likely to be relevant than those returned by fewer systems.

[1]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[2]  Alan Halverson,et al.  Generating labels from clicks , 2009, WSDM '09.

[3]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[4]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[5]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[6]  Noriko Kando,et al.  Ranking the NTCIR ACLIA IR4QA Systems without Relevance Assessments , 2009 .

[7]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[8]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[9]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[10]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[11]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[12]  Ben Carterette,et al.  On rank correlation and the distance between rankings , 2009, SIGIR.

[13]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[14]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[15]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[16]  Noriko Kando,et al.  Overview of the NTCIR-7 ACLIA IR4QA Task , 2008, NTCIR.

[17]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[18]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Sixth NTCIR Workshop , 2005, NTCIR.

[19]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[20]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[21]  Abbe Mowshowitz,et al.  Assessing bias in search engines , 2002, Inf. Process. Manag..