An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models

The World Wide Web services are essential in our daily lives and are available to communities through Uniform Resource Locator (URL). Attackers utilize such means of communication and create malicious URLs to conduct fraudulent activities and deceive others by creating deceptive and misleading websites and domains. Such threats open the doors for many critical attacks such as spams, spyware, phishing, and malware. Therefore, detecting malicious URL is crucially important to prevent the occurrence of many cybercriminal activities. In this study, we examined a set of machine learning (ML) and deep learning (DL) models to detect malicious websites using a dataset comprising 66,506 records of URLs. We engineered three different types of features including lexical-based, network-based and content-based features. To extract the most discriminative features in the dataset, we applied several features selection algorithms, namely, correlation analysis, Analysis of Variance (ANOVA), and chi-square. Finally, we conducted a comparative performance evaluation for several ML and DL models considering set of criteria commonly used to evaluate such models. Results depicted that Naïve Bayes (NB) was the best model for detecting malicious URLs using the applied data with an accuracy of 96%. This research has made contribution to the field by conducting significant features engineering and analysis to identify the best features for malicious URLs predictions, compare different models and achieve a high accuracy using a large new URL dataset.

[1]  M. Ijaz,et al.  A Complete Process of Text Classification System Using State-of-the-Art NLP Models , 2022, Computational intelligence and neuroscience.

[2]  M. Ijaz,et al.  Fine-Tuned DenseNet-169 for Breast Cancer Metastasis Prediction Using FastAI and 1-Cycle Policy , 2022, Sensors.

[3]  M. Aljabri,et al.  Phishing Attacks Detection using Machine Learning and Deep Learning Models , 2022, 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA).

[4]  Mohd. Shafi Pathan,et al.  Movie Popularity and Target Audience Prediction Using the Content-Based Recommender System , 2022, IEEE Access.

[5]  S. Chenthur pandian,et al.  Automatic License Plate Recognition System for Vehicles Using a CNN , 2022, Computers, Materials & Continua.

[6]  Firas Alghanim,et al.  Enhancing Detection of Malicious URLs Using Boosting and Lexical Features , 2022, Intelligent Automation & Soft Computing.

[7]  Sultan H. Almotiri,et al.  Intelligent Techniques for Detecting Network Attacks: Review and Research Directions , 2021, Sensors.

[8]  Mohammed Moreb,et al.  Detecting Malicious URL using Neural Network , 2021, 2021 International Congress of Advanced Technology and Engineering (ICOTEN).

[9]  Michel Cukier,et al.  Discovering features for detecting malicious websites: An empirical study , 2021, Comput. Secur..

[10]  Ibrahim Almarashdeh,et al.  Improved Multi-Verse Optimizer Feature Selection Technique With Application To Phishing, Spam, and Denial Of Service Attacks , 2021, Int. J. Commun. Networks Inf. Secur..

[11]  Chae-Ho Lim,et al.  Malicious URL Detection Based on Associative Classification , 2021, Entropy.

[12]  Kendall Lemons,et al.  A Comparison Between Naïve Bayes and Random Forest to Predict Breast Cancer , 2020 .

[13]  Ashutosh Kumar Singh,et al.  Malicious and Benign Webpages Dataset , 2020, Data in brief.

[14]  Fatimah Alkhudair,et al.  Detecting Malicious URL , 2020, 2020 International Conference on Computing and Information Technology (ICCIT-1441).

[15]  Marcin Woźniak,et al.  Accurate and fast URL phishing detector: A convolutional neural network approach , 2020, Comput. Networks.

[16]  Yi-Wei Ma,et al.  Intelligent Malicious URL Detection with Feature Analysis , 2020, 2020 IEEE Symposium on Computers and Communications (ISCC).

[17]  Tie Li,et al.  Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods , 2020, Inf. Syst..

[18]  Seifedine Kadry,et al.  Detecting malicious URLs using binary classification through adaboost algorithm , 2020 .

[19]  Zhihong Tian,et al.  A Convolution-Based System for Malicious URLs Detection , 2020, Computers, Materials & Continua.

[20]  Quamar Niyaz,et al.  Identifying Generic Features for Malicious URL Detection System , 2019, 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[21]  Priyanka C. Nair,et al.  A Machine Learning Approach for Detecting Malicious Websites using URL Features , 2019, 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA).

[22]  Navneet Goyal,et al.  A Comparison of Machine Learning Attributes for Detecting Malicious Websites , 2019, 2019 11th International Conference on Communication Systems & Networks (COMSNETS).

[23]  Dheeraj Sharma,et al.  Empirical Study on Malicious URL Detection Using Machine Learning , 2018, ICDCIT.

[24]  Ali A. Ghorbani,et al.  Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification , 2018, 2018 International Carnahan Conference on Security Technology (ICCST).

[25]  Dohoon Kim,et al.  WebMon: ML- and YARA-based malicious webpage detection , 2018, Comput. Networks.

[26]  Iwao Sasase,et al.  Obfuscated malicious javascript detection scheme using the feature based on divided URL , 2017, 2017 23rd Asia-Pacific Conference on Communications (APCC).

[27]  Deepak S. Turaga,et al.  Learning Feature Engineering for Classification , 2017, IJCAI.

[28]  A. K. Singh,et al.  MalCrawler: A Crawler for Seeking and Crawling Malicious Websites , 2017, ICDCIT.

[29]  Christian Urcuqui,et al.  Machine Learning Classifiers to Detect Malicious Websites , 2017, SSN.

[30]  Ali A. Ghorbani,et al.  Detecting Malicious URLs Using Lexical Analysis , 2016, NSS.

[31]  James Caverlee,et al.  Detecting Spam URLs in Social Media via Behavioral Analysis , 2015, ECIR.

[32]  Jian Cao,et al.  Detection of Forwarding-Based Malicious URLs in Online Social Networks , 2016, International Journal of Parallel Programming.

[33]  T. L. McCluskey,et al.  Intelligent rule-based phishing websites classification , 2014, IET Inf. Secur..

[34]  Mahdi Abadi,et al.  Detecting Obfuscated JavaScript Malware Using Sequences of Internal Function Calls , 2014, ACM Southeast Regional Conference.

[35]  T. L. McCluskey,et al.  Predicting phishing websites based on self-structuring neural network , 2013, Neural Computing and Applications.

[36]  Calton Pu,et al.  Click traffic analysis of short URL spam on Twitter , 2013, 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[37]  Ali Hamzeh,et al.  A survey on heuristic malware detection techniques , 2013, The 5th Conference on Information and Knowledge Technology.

[38]  Mansour Sheikhan,et al.  Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method , 2013, Neural Computing and Applications.

[39]  T. L. McCluskey,et al.  An assessment of features related to phishing websites using an automated technique , 2012, 2012 International Conference for Internet Technology and Secured Transactions.

[40]  E. Karthikeyan,et al.  Sigmis: A Feature Selection Algorithm Using Correlation Based Method , 2012 .

[41]  Christina A. Christie,et al.  The Chi-Square Test , 2012 .

[42]  Heejo Lee,et al.  Detecting Malicious Web Links and Identifying Their Attack Types , 2011, WebApps.

[43]  Azuraliza Abu Bakar,et al.  Naïve bayes variants in classification learning , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[44]  Niels Provos,et al.  All Your iFRAMEs Point to Us , 2008, USENIX Security Symposium.

[45]  John Weissmann,et al.  What a machine , 2004 .

[46]  F. Gers,et al.  Long short-term memory in recurrent neural networks , 2001 .

[47]  A. Kavitha,et al.  Lexical features based malicious URL detection using machine learning techniques , 2022, Materials Today: Proceedings.