论文信息 - TREC 2010 Web Track Notebook: Term Dependence, Spam Filtering and Quality Bias

TREC 2010 Web Track Notebook: Term Dependence, Spam Filtering and Quality Bias

Many existing retrieval approaches treat all the documents in the collection equally, and do not take into account the content quality of the retrieved documents. In our submissions for TREC 2010 Web Track, we utilize quality-biased ranking methods that are aimed to promote documents that potentially contain high-quality content, and penalize spam and low-quality documents. Our experiments with the ad hoc web topics from TREC 2010 show that features such as the spamminess of the document (as computed by the Waterloo team [6]) and the readability of the document (modeled by the fraction of stopwords in the document) are very important for improving the precision at the top ranks. Promotion of the high-quality Wikipedia pages leads to further retrieval performance improvements. In addition, we found that using Wikipedia as a high-quality document collection for query expansion can ameliorate some of the negative effects of performing pseudo-relevance feedback from a noisy web collection such as ClueWeb09.

W. Bruce Croft | David Fisher | Michael Bendersky

[1] Charles L. A. Clarke,et al. Experiments with ClueWeb09: Relevance Feedback and Web Tracks , 2009, TREC.

[2] Tapas Kanungo,et al. Predicting the readability of short web summaries , 2009, WSDM '09.

[3] Katja Hofmann,et al. Heuristic Ranking and Diversification of Web Documents , 2009, TREC.

[4] Ben Carterette,et al. Million Query Track 2007 Overview , 2008, TREC.

[5] W. Bruce Croft,et al. Relevance Models in Information Retrieval , 2003 .

[6] W. Bruce Croft,et al. A Markov random field model for term dependencies , 2005, SIGIR '05.

[7] W. Bruce Croft,et al. Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[8] Marti A. Hearst,et al. Improving Web Site Design , 2002, IEEE Internet Comput..

[9] Charles L. A. Clarke,et al. Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[10] Marc Najork,et al. Detecting spam web pages through content analysis , 2006, WWW '06.

[11] Jimmy J. Lin,et al. Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search , 2009, TREC.

[12] W. Bruce Croft,et al. Quality-biased ranking of web documents , 2011, WSDM '11.

[13] Charles L. A. Clarke,et al. Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[14] James Allan,et al. INQUERY and TREC-8 , 1998, TREC.

[15] W. Bruce Croft,et al. Document quality models for web ad hoc retrieval , 2005, CIKM '05.

[16] Robert Krovetz,et al. Viewing morphology as an inference process , 1993, Artif. Intell..