Forming test collections with no system pooling

Forming test collection relevance judgments from the pooled output of multiple retrieval systems has become the standard process for creating resources such as the TREC, CLEF, and NTCIR test collections. This paper presents a series of experiments examining three different ways of building test collections where no system pooling is used. First, a collection formation technique combining manual feedback and multiple systems is adapted to work with a single retrieval system. Second, an existing method based on pooling the output of multiple manual searches is re-examined: testing a wider range of searchers and retrieval systems than has been examined before. Third, a new approach is explored where the ranked output of a single automatic search on a single retrieval system is assessed for relevance: no pooling whatsoever. Using established techniques for evaluating the quality of relevance judgments, in all three cases, test collections are formed that are as good as TREC.

[1]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[2]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[3]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[4]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[5]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[6]  Karen Spärck Jones,et al.  TREC-6 1997 Spoken Document Retrieval Track Overview and Results , 1997, TREC.

[7]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[8]  Karen Sparck Jones Automatic Indexing; Progress in Documentation. , 1974 .

[9]  Alan Stuart,et al.  Kendall's Tau , 2004 .

[10]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[11]  Donna K. Harman,et al.  Panel: building and using test collections , 1996, SIGIR '96.

[12]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[13]  Noriko Kando,et al.  Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop , 2004, Information Retrieval.

[14]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[15]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[16]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[17]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[18]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[19]  Peter Schäuble,et al.  Cross-language speech retrieval: establishing a baseline performance , 1997, SIGIR '97.

[20]  Mark Liberman,et al.  Corpora for topic detection and tracking , 2002 .

[21]  Edward A. Fox,et al.  Research Contributions , 2014 .

[22]  J M Bland,et al.  Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[23]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[24]  Mark D. Dunlop,et al.  Image retrieval by hypertext links , 1997, SIGIR '97.

[25]  Karen Sparck Jones,et al.  Statistical bases of relevance assessment for the ideal information retrieval test collection , 1979 .