Score Aggregation Techniques in Retrieval Experimentation

Comparative evaluations of information retrieval systems are based on a number of key premises, including that representative topic sets can be created, that suitable relevance judgements can be generated, and that systems can be sensibly compared based on their aggregate performance over the selected topic set. This paper considers the role of the third of these assumptions -- that the performance of a system on a set of topics can be represented by a single overall performance score such as the average, or some other central statistic. In particular, we experiment with score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. Using past TREC runs we show that an adjusted geometric mean provides more consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization (Webber et al., SIGIR 2008) achieves the same outcome in a more consistent manner.

[1]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[2]  Mark Sanderson,et al.  The good and the bad system: does the test collection predict users' effectiveness? , 2008, SIGIR '08.

[3]  Alistair Moffat,et al.  Precision-at-ten considered redundant , 2008, SIGIR '08.

[4]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[5]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[6]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.

[7]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[8]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[9]  Chris Buckley,et al.  Topic prediction based on comparative retrieval rankings , 2004, SIGIR '04.

[10]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[11]  Stefano Mizzaro,et al.  The Good, the Bad, the Difficult, and the Easy: Something Wrong with Information Retrieval Evaluation? , 2008, ECIR.

[12]  Mark T. Keane,et al.  Modeling user behavior using a search-engine , 2007, IUI '07.

[13]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[14]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[15]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[16]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[17]  Chris Buckley Why current IR engines fail , 2004, SIGIR '04.

[18]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[19]  Giorgio Maria Di Nunzio,et al.  How robust are multilingual information retrieval systems? , 2008, SAC '08.

[20]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[21]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[22]  MoffatAlistair,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008 .