The Implementation of a Web Crawler URL Filter Algorithm Based on Caching

For large-scale Web information collection, the URL filter module plays important roles in a Web crawler which is a central component of a search engine. The performance of an URL filter module influents the efficiency of the entire collection system directly. This paper introduces one URL filter algorithm based on caching and its implementation. The performances of stability and paralleling of the algorithm are verified by the experiments for Websites which handle a large number of web pages. Experiment results show the algorithm proposed in this paper can achieve satisfactory performances through reasonable adjustments of its some parameters and it is suitable for the process of the URL filter of a Website which has a number of page navigator links and index pages especially.

[1]  Gang Wang,et al.  K-Divided Bloom Filter Algorithm and Its Analysis , 2007, Future Generation Communication and Networking (FGCN 2007).

[2]  Yu Zhihua Design and Realization of a General Web Crawler , 2005 .

[3]  Hengqing Tong,et al.  URL Assignment Algorithm of Crawler in Distributed System Based on Hash , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[4]  Wei-Ming Lin,et al.  Optimal XOR hashing for a linearly distributed address lookup in computer networks , 2005, 2005 Symposium on Architectures for Networking and Communications Systems (ANCS).