论文信息 - Machine Learning for Information Retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks

Machine Learning for Information Retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks

For the TREC 2009, we exhaustively classified every document in each corpus, using machine learning methods that had previously been shown to work well for email spam [9, 3]. We treated each document as a sequence of bytes, with no tokenization or parsing of tags or meta-information. This approach was used exclusively for the adhoc web, diversity and relevance feedback tasks, as well as to the batch legal task: the ClueWeb09 and Tobacco collections were processed end-to-end and never indexed. We did the interactive legal task in two phases: first, we used interactive search and judging to find a large and diverse set of training examples; then we used active learning process, similar to what we used for the other tasks, to find find more relevant documents. Finally, we fitted a censored (i.e. truncated) mixed normal distribution to estimate recall and the cutoff to optimize F1, the principal effectiveness measure.

Gordon V. Cormack | Mona Mojdeh | G. Cormack | M. Mojdeh

[1] Gary Robinson,et al. A statistical approach to the spam problem , 2003 .

[2] M VoorheesEllen. Variations in relevance judgments and the measurement of retrieval effectiveness , 2000 .

[3] Stephen E. Robertson,et al. On Term Selection for Query Expansion , 1991, J. Documentation.

[4] Gordon V. Cormack. University of Waterloo Participation in the TREC 2007 Spam Track , 2007, TREC.

[5] Charles L. A. Clarke,et al. Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[6] Gordon V. Cormack,et al. On-line spam filter fusion , 2006, SIGIR.

[7] Charles L. A. Clarke,et al. Efficient construction of large test collections , 1998, SIGIR '98.

[8] Ellen M. Voorhees,et al. Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[9] Douglas W. Oard,et al. Overview of the TREC 2009 Legal Track , 2009, TREC.

[10] Charles L. A. Clarke,et al. Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.