Classifying phishing URLs using recurrent neural networks

As the technical skills and costs associated with the deployment of phishing attacks decrease, we are witnessing an unprecedented level of scams that push the need for better methods to proactively detect phishing threats. In this work, we explored the use of URLs as input for machine learning models applied for phishing site prediction. In this way, we compared a feature-engineering approach followed by a random forest classifier against a novel method based on recurrent neural networks. We determined that the recurrent neural network approach provides an accuracy rate of 98.7% even without the need of manual feature creation, beating by 5% the random forest method. This means it is a scalable and fast-acting proactive detection system that does not require full content analysis.

[1]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[2]  T. L. McCluskey,et al.  Predicting phishing websites based on self-structuring neural network , 2013, Neural Computing and Applications.

[3]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[4]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[5]  J. L. Heidemann Poster: Lightweight Content-based Phishing Detection , 2015 .

[6]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[7]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Radu State,et al.  PhishStorm: Detecting Phishing With Streaming Analytics , 2014, IEEE Transactions on Network and Service Management.

[10]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[11]  Marti A. Hearst,et al.  Why phishing works , 2006, CHI.

[12]  Rakesh M. Verma,et al.  On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers , 2015, CODASPY.

[13]  Javier Vargas,et al.  Knowing your enemies: leveraging data analysis to expose phishing patterns against a major US financial institution , 2016, 2016 APWG Symposium on Electronic Crime Research (eCrime).

[14]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[15]  Radu State,et al.  PhishScore: Hacking phishers' minds , 2014, 10th International Conference on Network and Service Management (CNSM) and Workshop.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[18]  Harry Wechsler,et al.  Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation , 2013, Comput. Secur..

[19]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[20]  Tim Berners-Lee,et al.  Uniform Resource Locators (URL) , 1994, RFC.

[21]  Samuel Marchal,et al.  Know Your Phish: Novel Techniques for Detecting Phishing Sites and Their Targets , 2015, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[22]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[23]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Ge Wang,et al.  Verilogo : proactive phishing detection via logo recognition , 2010 .

[26]  Rakesh M. Verma,et al.  Catching Classical and Hijack-Based Phishing Attacks , 2014, ICISS.

[27]  S. Roopak,et al.  A Novel Phishing Page Detection Mechanism Using HTML Source Code Comparison and Cosine Similarity , 2014, 2014 Fourth International Conference on Advances in Computing and Communications.

[28]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[29]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[30]  Hyrum S. Anderson,et al.  Predicting Domain Generation Algorithms with Long Short-Term Memory Networks , 2016, ArXiv.

[31]  Richard Weber,et al.  Latent semantic analysis and keyword extraction for phishing classification , 2010, 2010 IEEE International Conference on Intelligence and Security Informatics.