In this paper, we propose an identification algorithm of malicious Web pages for crawlers, which collect Web pages for the later task to detect malicious Web pages based on the content. Recently, some organization would have to automatically crawl the Web pages with the crawlers for later checking by humans. However, since manually checking Web pages is an expensive task, the total cost would be enormous if the crawlers collected Web pages indiscriminately. Some automatically checking systems can make the human task more efficient, however, they cannot be used to increase the number of malicious Web pages. To solve these problems, we propose an efficient algorithm to determine whether the sites include malicious or dangerous content for crawling Web pages. The feature of the algorithm is that it can determine the probability of a site being malicious or harmless as calculated from the network-related attributes of the Web server derived from the URL string. The attributes refer to the domain name, directory name, and the IP (Internet Protocol) address of the nearest router from the Web server. To confirm the effectiveness of the proposed algorithm, we conducted an evaluation experiment in a simulated environment. We compared the number of the collected malicious Web pages by the proposed algorithm with that of a random sampling algorithm in the experiment. The advantage is +82.8% high in maximum on a stable condition. We also showed an example of crawling trajectories using the proposed algorithm and conventional crawling algorithms. The example showed that the proposed algorithm is able to collect more malicious Web pages than the conventional algorithms.
[1]
Kazunori Matsumoto,et al.
Schema Design for Causal Law Mining from Incomplete Database
,
1999,
Discovery Science.
[2]
H. Akaike.
A new look at the statistical model identification
,
1974
.
[3]
Paul A. Watters,et al.
Statistical and structural approaches to filtering Internet pornography
,
2004,
2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).
[4]
Anthony Skjellum,et al.
Mining spam email to identify common origins for forensic application
,
2008,
SAC '08.
[5]
Paul V. Mockapetris,et al.
Domain names - implementation and specification
,
1987,
RFC.
[6]
Leslie Daigle,et al.
WHOIS Protocol Specification
,
2004,
RFC.
[7]
Chih-Jen Lin,et al.
LIBSVM: A library for support vector machines
,
2011,
TIST.
[8]
Matsumoto Kazunori,et al.
Fast n-gram Assortment Construction for Filtering Hazardous Information (自然言語処理(NL) Vol.2009-NL-194)
,
2009
.
[9]
Keiichiro Hoashi,et al.
Document filtering method using non-relevant information profile
,
2000,
SIGIR '00.