Constructing test collections by inferring document relevance via extracted relevant information

The goal of a typical information retrieval system is to satisfy a user's information need---e.g., by providing an answer or information "nugget"---while the actual search space of a typical information retrieval system consists of documents---i.e., collections of nuggets. In this paper, we characterize this relationship between nuggets and documents and discuss applications to system evaluation. In particular, for the problem of test collection construction for IR system evaluation, we demonstrate a highly efficient algorithm for simultaneously obtaining both relevant documents and relevant information. Our technique exploits the mutually reinforcing relationship between relevant documents and relevant information, yielding document-based test collections whose efficiency and efficacy exceed those of typical Cranfield-style test collections, while also generating sets of highly relevant information.

[1]  Javed A. Aslam,et al.  A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm , 2003, SIGIR '03.

[2]  Amanda Spink,et al.  Term relevance feedback and query expansion: relation to design , 1994, SIGIR '94.

[3]  Tetsuya Sakai,et al.  Ranking Retrieval Systems without Relevance Assessments: Revisited , 2010, EVIA@NTCIR.

[4]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[5]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[6]  Javed A. Aslam Improving Algorithms for Boosting , 2000, COLT.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Jimmy J. Lin,et al.  Overview of the TREC 2007 Question Answering Track , 2008, TREC.

[9]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[10]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[11]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.

[12]  Ellen M Voorhees Question answering in TREC , 2001, CIKM '01.

[13]  Hoa Trang Dang,et al.  Overview of the TREC 2006 Question Answering Track 99 , 2006, TREC.

[14]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[15]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[16]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[17]  Ellen M. Voorhees,et al.  The Eighth Text REtrieval Conference (TREC-8) , 2000 .

[18]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[19]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[20]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[21]  Javed A. Aslam,et al.  IR system evaluation using nugget-based test collections , 2012, WSDM '12.

[22]  Young-In Song,et al.  Click the search button and be happy: evaluating direct and immediate information access , 2011, CIKM '11.

[23]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[24]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[25]  Jimmy J. Lin,et al.  Will Pyramids Built of Nuggets Topple Over? , 2006, NAACL.

[26]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[27]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[28]  David Carmel,et al.  Scaling IR-system evaluation using term relevance sets , 2004, SIGIR '04.

[29]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[30]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[31]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[32]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[33]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[34]  Young-In Song,et al.  Overview of NTCIR-9 1CLICK , 2011, NTCIR.

[35]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[36]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[37]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[38]  S. Levitus,et al.  US Government Printing Office , 1998 .

[39]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[40]  Jimmy J. Lin,et al.  Automatically Evaluating Answers to Definition Questions , 2005, HLT.