An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment

Abstract Phishing has become a favorite method of hackers for committing data theft and continues to evolve. As long as phishing websites continue to operate, many more people and companies will suffer privacy leaks or financial losses. Therefore, the demand for fast and accurate phishing website detection grows stronger. However, the existing phishing detection methods do not fully analyze the features of phishing, and the performance and efficiency of the models only apply to certain limited datasets and need to be improved to be applied to the real web environment. This paper fully considers the social engineering principles of phishing, proposes a comprehensive and interpretable CASE feature framework and designs a multistage phishing detection model to effectively detect phishing sites, especially in the real web environment, where high efficiency and performance and extremely low false alarm rates are required. To fully verify the proposed method, two kinds of data experiments were carried out. One was the comparative experiments among different features and different detection models on CASE, which covers both classic machine learning and deep learning algorithms based on a constructed complex dataset. The other was a one-year phishing discovery experiment in the real web environment. The proposed method achieves better detection results under the premise of significantly shortening the execution time and works well in real phishing discovery, which proves its high practicability in reality.

[1]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[2]  Fabio A. González,et al.  Classifying phishing URLs using recurrent neural networks , 2017, 2017 APWG Symposium on Electronic Crime Research (eCrime).

[3]  Xi Zhang,et al.  Boosting the phishing detection performance by semantic analysis , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[4]  Juan Pablo Hourcade,et al.  B-APT: Bayesian Anti-Phishing Toolbar , 2008, 2008 IEEE International Conference on Communications.

[5]  Ingrid Russell,et al.  An introduction to the WEKA data mining system , 2006, ITICSE '06.

[6]  Konstantin Beznosov,et al.  Phishing threat avoidance behaviour: An empirical investigation , 2016, Comput. Hum. Behav..

[7]  Ali Yazdian Varjani,et al.  New rule-based phishing detection method , 2016, Expert Syst. Appl..

[8]  Iztok Fister,et al.  Datasets for phishing websites detection , 2020, Data in brief.

[9]  Wai Lok Woo,et al.  A Deep-Learning-Driven Light-Weight Phishing Detection Sensor , 2019, Sensors.

[10]  Jack W. Stokes,et al.  Texception: A Character/Word-Level Deep Learning Model for Phishing URL Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jason I. Hong,et al.  A hybrid phish detection approach by identity discovery and keywords retrieval , 2009, WWW '09.

[12]  Max-Emanuel Maurer,et al.  Using visual website similarity for phishing detection and reporting , 2012, CHI Extended Abstracts.

[13]  Youjun Bu,et al.  Research on phishing webpage detection technology based on CNN-BiLSTM algorithm , 2021 .

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Leyla Bilge,et al.  EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis , 2011, NDSS.

[16]  Ankit Kumar Jain,et al.  Towards detection of phishing websites on client-side using machine learning based approach , 2017, Telecommunication Systems.

[17]  J. Doug Tygar,et al.  The battle against phishing: Dynamic Security Skins , 2005, SOUPS '05.

[18]  Abdulhamit Subasi,et al.  Intelligent phishing website detection using random forest classifier , 2017, 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA).

[19]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[20]  Muhammad Shahbaz,et al.  Entropy-Based Feature Selection Classification Approach for Detecting Phishing Websites , 2019, 2019 13th International Conference on Open Source Systems and Technologies (ICOSST).

[21]  Ge Wang,et al.  Verilogo : proactive phishing detection via logo recognition , 2010 .

[22]  Susan Mengel,et al.  Phishing URL Detection Using URL Ranking , 2015, 2015 IEEE International Congress on Big Data.

[23]  Tian Lin,et al.  Dissecting Spear Phishing Emails for Older vs Young Adults: On the Interplay of Weapons of Influence and Life Domains in Predicting Susceptibility to Phishing , 2017, CHI.

[24]  Alwyn Roshan Pais,et al.  Detection of phishing websites using an efficient feature-based machine learning framework , 2018, Neural Computing and Applications.

[25]  A. Sardana,et al.  A PageRank based detection technique for phishing web sites , 2012, 2012 IEEE Symposium on Computers & Informatics (ISCI).

[26]  Wei Zhang,et al.  Phishing Detection Research Based on LSTM Recurrent Neural Network , 2018, ICPCSEE.

[27]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[28]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[29]  Guanggang Geng,et al.  Combating phishing attacks via brand identity and authorization features , 2015, Secur. Commun. Networks.

[30]  Ankit Kumar Jain,et al.  Phishing Detection: Analysis of Visual Similarity Based Approaches , 2017, Secur. Commun. Networks.

[31]  Abdul Basit,et al.  A comprehensive survey of AI-enabled phishing attacks detection techniques , 2020, Telecommunication systems.

[32]  Vishal Kumar,et al.  Identification and Detection of Phishing Emails Using Natural Language Processing Techniques , 2014, SIN.

[33]  Wei Wang,et al.  Favicon - a clue to phishing sites detection , 2013, 2013 APWG eCrime Researchers Summit.

[34]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[35]  Banu Diri,et al.  Machine learning based phishing detection from URLs , 2019, Expert Syst. Appl..

[36]  Naghmeh Moradpoor,et al.  Employing machine learning techniques for detection and classification of phishing emails , 2017, 2017 Computing Conference.

[37]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[38]  Enrique Alegre,et al.  State of the Art: Content-based and Hybrid Phishing Detection , 2021, ArXiv.

[39]  Qingzhong Liu,et al.  Feature Selection for Improved Phishing Detection , 2012, IEA/AIE.

[40]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[41]  Baojiang Cui,et al.  Bidirectional LSTM: An Innovative Approach for Phishing URL Identification , 2019, IMIS.

[42]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[44]  Iñaki Inza,et al.  Measuring the class-imbalance extent of multi-class problems , 2017, Pattern Recognit. Lett..

[45]  Aderemi Oluyinka Adewumi,et al.  Classification of Phishing Email Using Random Forest Machine Learning Technique , 2014, J. Appl. Math..

[46]  Kang-Leng Chiew,et al.  Utilisation of website logo for phishing detection , 2015, Comput. Secur..