Detecting Malicious URLs Based on Machine Learning Algorithms and Word Embeddings

Relying on the appropriate features is essential in classification models for malware detection, for various important reasons, such as dealing with class imbalance, the ability to detect zero-day malware samples, or preventing attackers to successfully reverse engineer the classification process and changing nonessential feature values to avoid detection. In this paper, we propose a method that uses a combination of word embeddings together with “classical”, domain-engineered features, to obtain reliable classification models for malicious URLs detection. Additionally, we explore different traditional techniques to address class imbalance – such as synthetic oversampling or cost-sensitive learning – and several classification techniques. We find that the best overall results are obtained by using a cost-sensitive neural network – with a precision that exceeds 99% and an accuracy above 90%, while maintaining a recall rate above 89%. We have performed an analysis of the importance of the features proposed, and found that while word embeddings produce better results than bi-gram based features, domain-specific features are necessary for obtaining a high precision in detecting malicious URLs.

[1]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[2]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[3]  Fabio Massacci,et al.  Anatomy of Exploit Kits - Preliminary Analysis of Exploit Kits as Software Artefacts , 2013, ESSoS.

[4]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[5]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[6]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[7]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[8]  Tie Li,et al.  Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods , 2020, Inf. Syst..

[9]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[10]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[11]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[12]  C. Dwyer,et al.  Malvertising - A Rising Threat To The Online Ecosystem , 2017 .

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  Steven C. H. Hoi,et al.  Malicious URL Detection using Machine Learning: A Survey , 2017, ArXiv.