URL Based Gateway Side Phishing Detection Method

Phishing attack has become the most dangerous form of fraud to hit online and mobile businesses. In this paper, we reveal some new aspects of the common features that appear in the phishing URLs, and introduce a statistical machine learning classifier to detect the phishing sites which relies on these selected features. Unlike previous studies, we do not utilize a single model for different regions since the result of our analysis shows that the features in different phishing domains have mismatched distributions. As it is impossible for us to recollect enough data and rebuild the models, we adjust the existing model by the transfer learning algorithm to solve these problems. A number of comprehensive experiments show that our proposed method achieves more than 93% accuracy over a balanced dataset and less than 1% error rates in the simulated real phishing scene. Moreover, the well performance in the target domain demonstrates the use of transfer learning algorithm in the anti-phishing scenario is feasible.

[1]  Lorrie Faith Cranor,et al.  An Empirical Analysis of Phishing Blacklists , 2009, CEAS 2009.

[2]  Minaxi Gupta,et al.  Behind Phishing: An Examination of Phisher Modi Operandi , 2008, LEET.

[3]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[4]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[5]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[6]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[7]  Stanley Lemeshow,et al.  Applied Logistic Regression, Second Edition , 1989 .

[8]  Lorrie Faith Cranor,et al.  Phinding Phish: An Evaluation of Anti-Phishing Toolbars , 2007, NDSS.

[9]  John C. Mitchell,et al.  Client-Side Defense Against Web-Based Identity Theft , 2004, NDSS.

[10]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[11]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[12]  Fadi A. Thabtah,et al.  Intelligent phishing detection system for e-banking using fuzzy data mining , 2010, Expert Syst. Appl..

[13]  J A Swets,et al.  Better decisions through science. , 2000, Scientific American.

[14]  K. Zou,et al.  Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models , 2007, Circulation.

[15]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[16]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[17]  Daisuke Miyamoto,et al.  An Evaluation of Machine Learning-Based Methods for Detection of Phishing Sites , 2008, ICONIP.

[18]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[19]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[20]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[21]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[22]  Ming-Wei Chang,et al.  Partitioned logistic regression for spam filtering , 2008, KDD.

[23]  Xuhua Ding,et al.  Anomaly Based Web Phishing Page Detection , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).