论文信息 - Effectiveness beyond the first crawl tier

Effectiveness beyond the first crawl tier

Modern Web crawlers seek to visit quality documents first, and re-visit them more frequently than other documents. As a result, the first-tier crawl of a Web corpus is typically of higher quality compared to subsequent crawls. In this paper, we investigate the impact of first-tier documents on adhoc retrieval performance. In particular, we analyse the retrieval performance of runs submitted to the adhoc task of the TREC 2009 Web track in terms of how they rank first-tier documents and how these documents contribute to the performance of each run. Our results show that the performance of these runs is heavily dependent on their ability to rank first-tier documents. Moreover, we show that, different from leading Web search engines, their attempt to go beyond the first tier almost always results in decreased performance. Finally, we show that selectively removing spam from different tiers can be a direction for fully exploiting documents beyond the first tier.

Craig MacDonald | Rodrygo L. T. Santos | Iadh Ounis

[1] Amanda Spink,et al. Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[2] Charles L. A. Clarke,et al. Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[3] Marc Najork,et al. Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[4] Ricardo A. Baeza-Yates,et al. Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[5] James Allan,et al. Minimal test collections for retrieval evaluation , 2006, SIGIR.