HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis

Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based data-driven end-to-end automatic phishing web page classification approach. Specifically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the semantic dependencies in the textual contents of the HTML. The CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual feature engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and ensure easy extrapolation to test data. We conduct comprehensive experiments on a dataset of more than 50,000 HTML documents that provides a distribution of phishing to benign web pages obtainable in the real-world that yields over 93% Accuracy and True Positive Rate. Also, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language.

[1]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[2]  Alptekin Küpçü,et al.  Single password authentication , 2013, Comput. Networks.

[3]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[4]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[5]  Ozgur Koray Sahingoz,et al.  Detecting phishing attacks from URL by using NLP techniques , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Fabio A. González,et al.  Classifying phishing URLs using recurrent neural networks , 2017, 2017 APWG Symposium on Electronic Crime Research (eCrime).

[8]  Banu Diri,et al.  Machine learning based phishing detection from URLs , 2019, Expert Syst. Appl..

[9]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samuel Marchal,et al.  On Designing and Evaluating Phishing Webpage Detection Techniques for the Real World , 2018, CSET @ USENIX Security Symposium.

[12]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[13]  Patrick Traynor,et al.  Detecting Mobile Malicious Webpages in Real Time , 2017, IEEE Transactions on Mobile Computing.

[14]  Steven C. H. Hoi,et al.  URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection , 2018, ArXiv.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Valentin I. Spitkovsky,et al.  Punctuation: Making a Point in Unsupervised Dependency Parsing , 2011, CoNLL.

[17]  Christopher N. Gutierrez,et al.  Learning from the Ones that Got Away: Detecting New Forms of Phishing Attacks , 2018, IEEE Transactions on Dependable and Secure Computing.

[18]  Ashok Kumar Das,et al.  A new two-server authentication and key agreement protocol for accessing secure cloud services , 2018, Comput. Networks.

[19]  Wai Lok Woo,et al.  A Deep-Learning-Driven Light-Weight Phishing Detection Sensor , 2019, Sensors.

[20]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[21]  Hilarie Orman Towards a Semantics of Phish , 2012, 2012 IEEE Symposium on Security and Privacy Workshops.

[22]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[23]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[25]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Zhongmin Cai,et al.  Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs , 2013, 2013 IEEE International Conference on Communications (ICC).

[28]  Javier López,et al.  Access control for cyber-physical systems interconnected to the cloud , 2018, Comput. Networks.

[29]  Nauman Aslam,et al.  Intelligent phishing detection and protection scheme for online transactions , 2013, Expert Syst. Appl..

[30]  Pradeep K. Atrey,et al.  A phish detector using lightweight search features , 2016, Comput. Secur..

[31]  Nauman Aslam,et al.  Detection of online phishing email using dynamic evolving neural network based on reinforcement learning , 2018, Decis. Support Syst..

[32]  Samuel Marchal,et al.  Off-the-Hook: An Efficient and Usable Client-Side Phishing Prevention Application , 2017, IEEE Transactions on Computers.