Look Before You Leap: Detecting Phishing Web Pages by Exploiting Raw URL And HTML Characteristics

Cybercriminals resort to phishing as a simple and cost-effective medium to perpetrate cyber-attacks on today's Internet. Recent studies in phishing detection are increasingly adopting automated feature selection over traditional manually engineered features. This transition is due to the inability of existing traditional methods to extrapolate their learning to new data. To this end, in this paper, we propose WebPhish, a deep learning technique using automatic feature selection extracted from the raw URL and HTML of a web page. This approach is the first of its kind, which uses the concatenation of URL and HTML embedding feature vectors as input into a Convolutional Neural Network model to detect phishing attacks on web pages. Extensive experiments on a real-world dataset yielded an accuracy of 98 percent, outperforming other state-of-the-art techniques. Also, WebPhish is a client-side strategy that is completely language-independent and can conduct lightweight phishing detection regardless of the web page's textual language.

[1]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[2]  Marcin Mironczuk,et al.  A recent overview of the state-of-the-art elements of text classification , 2018, Expert Syst. Appl..

[3]  Scott Dick,et al.  An Anti-Phishing System Employing Diffused Information , 2014, TSEC.

[4]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5]  Christopher N. Gutierrez,et al.  Learning from the Ones that Got Away: Detecting New Forms of Phishing Attacks , 2018, IEEE Transactions on Dependable and Secure Computing.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Yingke Chen,et al.  HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[11]  Scott P. Robertson,et al.  Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 1991 .

[12]  Samuel Marchal,et al.  Know Your Phish: Novel Techniques for Detecting Phishing Sites and Their Targets , 2015, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[13]  Fabio A. González,et al.  Classifying phishing URLs using recurrent neural networks , 2017, 2017 APWG Symposium on Electronic Crime Research (eCrime).

[14]  Banu Diri,et al.  Machine learning based phishing detection from URLs , 2019, Expert Syst. Appl..

[15]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[16]  Gyu Sang Choi,et al.  Tweets Classification on the Base of Sentiments for US Airline Companies , 2019, Entropy.

[17]  Katarzyna Musial,et al.  DICE: Deep Intelligent Contextual Embedding for Twitter Sentiment Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[18]  M. Alamgir Hossain,et al.  Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text , 2019, Expert Syst. Appl..

[19]  Yingke Chen,et al.  HTMLPhish: Enabling Accurate Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis , 2019, ArXiv.

[20]  Nauman Aslam,et al.  Intelligent phishing detection and protection scheme for online transactions , 2013, Expert Syst. Appl..

[21]  Pradeep K. Atrey,et al.  A phish detector using lightweight search features , 2016, Comput. Secur..

[22]  Wai Lok Woo,et al.  A Deep-Learning-Driven Light-Weight Phishing Detection Sensor , 2019, Sensors.

[23]  Ian Harris,et al.  Detecting Phishing Attacks Using Natural Language Processing and Machine Learning , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[24]  Keqin Li,et al.  A keyword-based combination approach for detecting phishing webpages , 2019, Comput. Secur..

[25]  T. L. McCluskey,et al.  An assessment of features related to phishing websites using an automated technique , 2012, 2012 International Conference for Internet Technology and Secured Transactions.

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Nauman Aslam,et al.  Detection of online phishing email using dynamic evolving neural network based on reinforcement learning , 2018, Decis. Support Syst..

[29]  Samuel Marchal,et al.  Off-the-Hook: An Efficient and Usable Client-Side Phishing Prevention Application , 2017, IEEE Transactions on Computers.

[30]  Steven C. H. Hoi,et al.  URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection , 2018, ArXiv.

[31]  Stefan Savage,et al.  Detecting and Characterizing Lateral Phishing at Scale , 2019, USENIX Security Symposium.

[32]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[33]  Eleni Berki,et al.  Towards a contingency approach with whitelist- and blacklist-based anti-phishing applications: what do usability tests indicate? , 2014, Behav. Inf. Technol..

[34]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Aadarsh Malviya,et al.  Big data approach for sentiment analysis of twitter data using Hadoop framework and deep learning , 2020, 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE).

[37]  Ozgur Koray Sahingoz,et al.  Detecting phishing attacks from URL by using NLP techniques , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[38]  Ankit Kumar Jain,et al.  Towards detection of phishing websites on client-side using machine learning based approach , 2017, Telecommunication Systems.

[39]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[40]  Kang-Leng Chiew,et al.  Utilisation of website logo for phishing detection , 2015, Comput. Secur..

[41]  Gated Recurrent Units for Airline Sentiment Analysis of Twitter Data , 2016 .

[42]  Patrick Traynor,et al.  Detecting Mobile Malicious Webpages in Real Time , 2017, IEEE Transactions on Mobile Computing.

[43]  Dharma P. Agrawal,et al.  Fighting against phishing attacks: state of the art and future challenges , 2016, Neural Computing and Applications.

[44]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.