Due to the explosive growth of the web pages, centralized crawlers are no longer sufficient to run on the web efficiently. There are many distributed crawlers in wide use; however, none of them is suitable for template-customized vertical crawling. In this paper, we present a distributed template-customized vertical crawler which is specially used for crawling Internet forums. The Client-Server architecture of the system and the function of every module are described in detail which can be extended to other fields easily. A crawling-period based distribution strategy is also proposed, with which the crawler manager can coordinate the quantity of crawling tasks and the resources of each crawler very well, and the crawler can process websites with different updating frequency flexibly. We also define a communication protocol between crawlers and crawler manager and describe how to solve the duplicated crawling problem in the distributed system. The performance of centralized vertical crawler and distributed vertical crawler are compared in the experiment. Experimental results demonstrate that the parallel operation of all the crawlers in the distributed system can greatly enhance the crawling efficiency.
[1]
Hengqing Tong,et al.
URL Assignment Algorithm of Crawler in Distributed System Based on Hash
,
2008,
2008 IEEE International Conference on Networking, Sensing and Control.
[2]
Bing Zhou,et al.
A high-precision forum crawler based on vertical crawling
,
2009,
2009 IEEE International Conference on Network Infrastructure and Digital Content.
[3]
Yan Guo,et al.
Board Forum Crawling: A Web Crawling Method for Web Forum
,
2006,
2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).
[4]
Kasmiran Jumari,et al.
2009 International Conference on Future Computer and Communication
,
2009
.
[5]
Torsten Suel,et al.
Design and implementation of a high-performance distributed Web crawler
,
2002,
Proceedings 18th International Conference on Data Engineering.