Identification of malicious web pages for crawling based on network-related attributes of web server

In this paper, we propose an identification algorithm of malicious Web pages for crawlers, which collect Web pages for the later task to detect malicious Web pages based on the content. Recently, some organization would have to automatically crawl the Web pages with the crawlers for later checking by humans. However, since manually checking Web pages is an expensive task, the total cost would be enormous if the crawlers collected Web pages indiscriminately. Some automatically checking systems can make the human task more efficient, however, they cannot be used to increase the number of malicious Web pages. To solve these problems, we propose an efficient algorithm to determine whether the sites include malicious or dangerous content for crawling Web pages. The feature of the algorithm is that it can determine the probability of a site being malicious or harmless as calculated from the network-related attributes of the Web server derived from the URL string. The attributes refer to the domain name, directory name, and the IP (Internet Protocol) address of the nearest router from the Web server. To confirm the effectiveness of the proposed algorithm, we conducted an evaluation experiment in a simulated environment. We compared the number of the collected malicious Web pages by the proposed algorithm with that of a random sampling algorithm in the experiment. The advantage is +82.8% high in maximum on a stable condition. We also showed an example of crawling trajectories using the proposed algorithm and conventional crawling algorithms. The example showed that the proposed algorithm is able to collect more malicious Web pages than the conventional algorithms.