Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning

In recent years, various web-based attacks such as Drive-by-Download attacks are becoming serious. To protect legitimate users, it is important to collect information on malicious sites that could provide a blacklist-based detection software. In our study, we propose a system to collect URLs of malicious sites in the dark web. The proposed system automatically crawls dark web sites and collects malicious URLs that are judged by using VirusTotal and the Gred engine. We also predict dangerous categories of collected web sites that are potentially malicious using a document embedding with a gradient boosting decision tree model. In the experiments, we demonstrate that the proposed system can predict dangerous site categories with 0.82 accuracy in F1-score.