PhishHaven—An Efficient Real-Time AI Phishing URLs Detection System

Different machine learning and deep learning-based approaches have been proposed for designing defensive mechanisms against various phishing attacks. Recently, researchers showed that phishing attacks can be performed by employing a deep neural network-based phishing URL generating system called DeepPhish. To prevent this kind of attack, we design an ensemble machine learning-based detection system called PhishHaven to identify AI-generated as well as human-crafted phishing URLs. To the best of our knowledge, this is the first study to consider detecting phishing attacks by both AI and human attackers. PhishHaven employs lexical analysis for feature extraction. To further enhance lexical analysis, we introduce URL HTML Encoding to classify URL on-the-fly and proactively compare with some of the existing methods. We also introduce a URL Hit approach to deal with tiny URLs, which is an open problem yet to be solved. Moreover, the final classification of URLs is made on an unbiased voting mechanism in PhishHaven, which aims to avoid misclassification when the number of votes is equal. To speed up the ensemble-based machine learning models, PhishHaven employs a multi-threading approach to execute the classification in parallel, leading to real-time detection. Theoretical analysis of our solution shows that (1) it can always detect tiny URLs, and (2) it can detect future AI-generated Phishing URLs based on our selected lexical features with 100% accuracy. Through experiments, we analyze our solution with a benchmark dataset of 100,000 phishing and normal URLs. The results show that PhishHaven can achieve 98.00% accuracy, outperforming the existing lexical-based human-crafted phishing URLs detection systems.

[1]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[2]  Alejandro Correa Bahnsen,et al.  DeepPhish : Simulating Malicious AI , 2018 .

[3]  Lav Gupta,et al.  Machine Learning-Based Network Vulnerability Analysis of Industrial Internet of Things , 2019, IEEE Internet of Things Journal.

[4]  Nisha S. Raj,et al.  Approximate string matching algorithm for phishing detection , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[5]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[6]  Alwyn R. Pais,et al.  CatchPhish: detection of phishing websites by inspecting URLs , 2020, J. Ambient Intell. Humaniz. Comput..

[7]  Miles Brundage,et al.  Limitations and risks of machine ethics , 2014, J. Exp. Theor. Artif. Intell..

[8]  Hongtao Sun,et al.  A Survey on Security Communication and Control for Smart Grids Under Malicious Cyber Attacks , 2019, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[9]  Xu Chen,et al.  A stacking model using URL and HTML features for phishing webpage detection , 2019, Future Gener. Comput. Syst..

[10]  Banu Diri,et al.  Machine learning based phishing detection from URLs , 2019, Expert Syst. Appl..

[11]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[12]  Harshal Tupsamudre,et al.  Everything Is in the Name - A URL Based Approach for Phishing Detection , 2019, CSCML.

[13]  Roy T. Fielding,et al.  Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content , 2014, RFC.

[14]  YannakakisMihalis,et al.  REACT to Cyber-Physical Attacks on Power grids (Extended Abstract) , 2019 .

[15]  Kun Li,et al.  BaitAlarm: Detecting Phishing Sites Using Similarity in Fundamental Visual Features , 2013, 2013 5th International Conference on Intelligent Networking and Collaborative Systems.

[16]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[19]  Ramana Rao Kompella,et al.  PhishNet: Predictive Blacklisting to Detect Phishing Attacks , 2010, 2010 Proceedings IEEE INFOCOM.

[20]  Deepa Kundur,et al.  Mitigating Attacks With Nonlinear Dynamics on Actuators in Cyber-Physical Mechatronic Systems , 2019, IEEE Transactions on Industrial Informatics.

[21]  Sabu Emmanuel,et al.  Machine Learning and Cybersecurity , 2020 .

[22]  Radu State,et al.  PhishStorm: Detecting Phishing With Streaming Analytics , 2014, IEEE Transactions on Network and Service Management.

[23]  Ankit Kumar Jain,et al.  A machine learning based approach for phishing detection using hyperlinks information , 2018, Journal of Ambient Intelligence and Humanized Computing.

[24]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[25]  Muhammad Taseer Suleman,et al.  Optimization of URL-Based Phishing Websites Detection through Genetic Algorithms , 2019, Automatic Control and Computer Sciences.

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Yaxin Bi,et al.  KNN Model-Based Approach in Classification , 2003, OTM.

[28]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[29]  Erzhou Zhu,et al.  OFS-NN: An Effective Phishing Websites Detection Model Based on Optimal Feature Selection and Neural Network , 2019, IEEE Access.

[30]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[31]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[32]  Kishan Gajera,et al.  A Novel Approach to Detect Phishing Attack Using Artificial Neural Networks Combined with Pharming Detection , 2019, 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA).

[33]  Ankit Kumar Jain,et al.  Mobile phishing attacks and defence mechanisms: State of art and open research challenges , 2017, Comput. Secur..

[34]  Mingjian Cui,et al.  Machine Learning-Based Anomaly Detection for Load Forecasting Under Cyberattacks , 2019, IEEE Transactions on Smart Grid.