An Unsupervised-Learning Based Method for Detecting Groups of Malicious Web Crawlers in Internet

Malicious web crawler has been a serious threat to the security and performance of web servers in Internet. Generally, malicious web crawler systematically obtains massive web pages without approval, and may involve the theft of data assets. In this paper, we propose an unsupervised learning based method for detecting malicious web crawler. The method can be divided into three phases. Firstly, the method generates a representative vector for each client by combining the information of its visiting statistic behaviors and page request stream. Secondly, a new subspace clustering algorithm is developed to cluster the clients into groups. Finally, four metrics are designed to detect the groups of malicious web crawlers. The proposed method is validated based on a real data set consisting of 580 thousand accessing requests. Experimental results show that the proposed method can accurately detect malicious web crawlers with a high TPR (true positive rate) of 91.0% and a low FPR (false positive rate) of 1.3%.