Finding effective classifier for malicious URL detection

Malicious URL is an important security issue to the Internet, which has a significant economic impact. By now, it is still a challenging problem. In this paper, we propose that combining statistical analysis of website URLs with machine learning techniques will give a more accurate classification of malicious URLs. We focus on the Character features of malicious URLs by statistical methods to obtain char distribution features and structural features. Then, In order to find effective classifier for malicious URL detection, we use six different classifiers to perform cross training. The experimental results on our data set demonstrate that the combination of the URL features extracted in this paper and the Random Forest classification algorithm can achieve 99.7% precision with a false positive rate of less than 0.4%. We also show that these features render better performance than the previously used features which combine lexical features and structural features and render similar results to the N-Gram or TF-IDF based features. Besides, we adjust the number of iterations of random forest and random choice characteristic number of random forest in experiment.