Lightweight URL-based phishing detection using natural language processing transformers for mobile devices

Abstract Hackers are increasingly launching phishing attacks via SMS and social media. Games and dating apps introduce yet another attack vector. However, current deep learning-based phishing detection applications are not applicable to mobile devices due to the computational burden. We propose a lightweight phishing detection algorithm that distinguishes phishing from legitimate websites solely from URLs to be used in mobile devices. As a baseline performance, we apply Artificial Neural Networks (ANNs) to URL-based and HTML-based website features. A model search results in 15 ANN models with accuracies >96%, comparable to state-of-the-art approaches. Next, we test the performance of deep ANNs on URL-based features only; however, all models perform poorly with the highest accuracy of 86.2%, indicating that URL-based features alone are not adequate to detect phishing websites even with deep ANNs. Since language transformers learn to represent context-dependent text sequences, we hypothesize that they will be able to learn directly from the text in URLs to distinguish between legitimate and malicious websites. We apply two state-of-the-art deep transformers (BERT and ELECTRA) for phishing detection. Testing custom and standard vocabularies, we find that pre-trained transformers available for immediate use (with fine-tuning) outperform the model trained with the custom URL-based vocabulary. Using pre-trained transformers to predict phishing websites from only URLs has four advantages: 1) requires little training time (~8 minutes), 2) is more easily updatable than feature-based approaches because no pre-processing of URLs is required, 3) is safer to use because phishing websites can be predicted without physically visiting the malicious sites and 4) is easily deployable for real-time detection and is applicable to run on mobile devices.

[1]  Marcin Woźniak,et al.  Accurate and fast URL phishing detector: A convolutional neural network approach , 2020, Comput. Networks.

[2]  Steven C. H. Hoi,et al.  URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection , 2018, ArXiv.

[3]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[4]  Erzhou Zhu,et al.  DTOF-ANN: An Artificial Neural Network phishing detection model based on Decision Tree and Optimal Features , 2020, Appl. Soft Comput..

[5]  Qussai Yaseen,et al.  Spam Email Detection Using Deep Learning Techniques , 2021, ANT/EDI40.

[6]  Suleiman Y. Yerima,et al.  High Accuracy Phishing Detection Based on Convolutional Neural Networks , 2020, 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS).

[7]  Indrakshi Ray,et al.  A Machine-learning based Unbiased Phishing Detection Approach , 2020, ICETE.

[8]  Indrakshi Ray,et al.  Improved Phishing Detection Algorithms using Adversarial Autoencoder Synthesized Data , 2020, 2020 IEEE 45th Conference on Local Computer Networks (LCN).

[9]  Erzhou Zhu,et al.  OFS-NN: An Effective Phishing Websites Detection Model Based on Optimal Feature Selection and Neural Network , 2019, IEEE Access.

[10]  Nusrat Zahan,et al.  Automated Prediction of Phishing Websites Using Deep Convolutional Neural Network , 2019, 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2).

[11]  Priya Saravanan,et al.  A Framework for Detecting Phishing Websites using GA based Feature Selection and ARTMAP based Website Classification , 2020 .

[12]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[13]  Richard E. Harang,et al.  CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails , 2020, ArXiv.

[14]  Konstantin Berlin,et al.  eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys , 2017, ArXiv.

[15]  Indrakshi Ray,et al.  Adversarial Sampling Attacks Against Phishing Detection , 2019, DBSec.

[16]  Yong Jiang,et al.  CNN-MHSA: A Convolutional Neural Network and multi-head self-attention combined approach for detecting phishing websites , 2020, Neural Networks.

[17]  Yueying He,et al.  An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features , 2021, Expert Syst. Appl..

[18]  Peng Yang,et al.  Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning , 2019, IEEE Access.

[19]  Runzhi Li,et al.  Efficient Detection of Phishing Attacks with Hybrid Neural Networks , 2018, 2018 IEEE 18th International Conference on Communication Technology (ICCT).

[20]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[21]  Choon Lin Tan Phishing Dataset for Machine Learning: Feature Evaluation , 2018 .

[22]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Jinghui Qin,et al.  Phishing URL Detection via CNN and Attention-Based Hierarchical RNN , 2019, 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE).

[25]  O. Alfandi,et al.  A Spam Email Detection Mechanism for English Language Text Emails Using Deep Learning Approach , 2020, 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE).