Improving test collection pools with machine learning

IR experiments typically use test collections for evaluation. Such test collections are formed by judging a pool of documents retrieved by a combination of automatic and manual runs for each topic. The proportion of relevant documents found for each topic depends on the diversity across each of the runs submitted and the depth to which runs are assessed (pool depth). Manual runs are commonly believed to reduce bias in test collections when evaluating new IR systems. In this work, we explore alternative approaches to improving test collection reliability. Using fully automated approaches, we are able to recognise a large portion of relevant documents that would normally only be found through manual runs. Our approach combines simple fusion methods with machine learning. The approach demonstrates the potential to find many more relevant documents than are found using traditional pooling approaches. Our initial results are promising and can be extended in future studies to help test collection curators ensure proper judgment coverage is maintained across the entire document collection.

[1]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[2]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[3]  Tetsuya Sakai,et al.  On the Robustness of Information Retrieval Metrics to Biased Relevance Assessments , 2009, J. Inf. Process..

[4]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[5]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[6]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[7]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[8]  Tetsuya Sakai Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[9]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[10]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[11]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[12]  J. Shane Culpepper,et al.  Extending test collection pools without manual runs , 2014, SIGIR.

[13]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[14]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[15]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[16]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[17]  Noriko Kando,et al.  The Reusability of a Diversified Search Test Collection , 2012, AIRS.

[18]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[19]  Ellen M. Voorhees,et al.  TREC 2014 Web Track Overview , 2015, TREC.

[20]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[21]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[22]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[23]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[24]  Ben Carterette,et al.  Measuring the reusability of test collections , 2010, WSDM '10.

[25]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[26]  Donna K. Harman,et al.  Overview of the Ninth Text REtrieval Conference (TREC-9) , 2000, TREC.

[27]  Tetsuya Sakai The Unreusability of Diversified Search Test Collections , 2013, EVIA@NTCIR.

[28]  Tetsuya Sakai Comparing metrics across TREC and NTCIR:: the robustness to pool depth bias , 2008, SIGIR '08.

[29]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[30]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[31]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[32]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[33]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[34]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[35]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..