An Easter Egg Hunting Approach to Test Collection Building in Dynamic Domains

Test collections for offline evaluation remain crucial for information retrieval research and industrial practice, yet the classical Sparck Jones and Van Rijsbergen approach to test collection building based on the pooling of runs on a large collection is expensive and being pushed beyond its limits with the ever increasing size and dynamic nature of the collections. We experiment with a novel approach to reusable test collection building, where we inject judged pages into an existing corpus, and have systems retrieve pages from the extended corpus with the aim to create a reusable test collection. In a metaphorical way, we hide the Easter eggs for systems to retrieve. Our experiments exploit the unique setup of the TREC Contextual Suggestion Track, which allowed both submissions from a fixed corpus (ClueWeb12) as well as from the open web. We conduct an extensive analysis of the reusability of the test collection based on ClueWeb12, and find it too low for reliable offline testing. Then, we detail the expansion with judged pages from the open web, and do extensive analysis on the reusability of the resulting expanded test collection, and observe a dramatic increase in reusability. Our approach offers novel and cost effective ways to build new test collections, and to refresh and update existing test collections. This explores new ways of effective maintenance of offline test collections for dynamic domains such as the web.

[1]  Carol Peters,et al.  Report on the SIGIR 2009 workshop on the future of IR evaluation , 2009, SIGF.

[2]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[3]  Neil J. Hurley,et al.  Recommender Systems: Attack Types and Strategies , 2005, AAAI.

[4]  James Allan,et al.  Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne , 2012, SIGF.

[5]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[6]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[7]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[8]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[9]  Jimmy J. Lin,et al.  On run diversity in Evaluation as a Service , 2014, SIGIR.

[10]  Tetsuya Sakai Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[11]  Jaap Kamps,et al.  Venue Recommendation and Web Search Based on Anchor Text , 2014, TREC.

[12]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[13]  Ben Carterette,et al.  On rank correlation and the distance between rankings , 2009, SIGIR.

[14]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[15]  C. Cleverdon Report on the testing and analysis of an investigation into comparative efficiency of indexing systems , 1962 .

[16]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[17]  Charles L. A. Clarke,et al.  On the Reusability of Open Test Collections , 2015, SIGIR.

[18]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[19]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems , 1962 .

[20]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[21]  Ian Soboroff,et al.  Dynamic test collections: measuring search effectiveness on the live web , 2006, SIGIR.

[22]  J. Shane Culpepper,et al.  Improving test collection pools with machine learning , 2014, ADCS.

[23]  Gordon V. Cormack,et al.  Power and bias of subset pooling strategies , 2007, SIGIR.

[24]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Contextual Suggestion Track , 2013, TREC.

[25]  Milad Shokouhi,et al.  Behavioral dynamics on the web: Learning, modeling, and prediction , 2013, TOIS.

[26]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[27]  James Allan,et al.  Meeting of the MINDS: an information retrieval research agenda , 2007, SIGF.