Graph-based seed selection for web-scale crawlers
暂无分享,去创建一个
One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.
[1] Andrei Z. Broder,et al. Graph structure in the Web , 2000, Comput. Networks.
[2] D. Hochbaum,et al. Analysis of the greedy approach in problems of maximum k‐coverage , 1998 .
[3] Filippo Menczer,et al. Crawling the Web , 2004, Web Dynamics.