A comparison of machine learning techniques for phishing detection

There are many applications available for phishing detection. However, unlike predicting spam, there are only few studies that compare machine learning techniques in predicting phishing. The present study compares the predictive accuracy of several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails. A data set of 2889 phishing and legitimate emails is used in the comparative study. In addition, 43 features are used to train and test the classifiers.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[3]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[4]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[5]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[6]  J. P. Marques de Sá,et al.  Pattern Recognition: Concepts, Methods and Applications , 2001 .

[7]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[8]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[9]  Barton C. Massey,et al.  Learning Spam: Simple Techniques For Freely-Available Software , 2003, USENIX Annual Technical Conference, FREENIX Track.

[10]  Le Zhang,et al.  Filtering Junk Mail with a Maximum Entropy Model , 2003 .

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[13]  A. Emigh,et al.  Online Identity Theft: Phishing Technology, Chokepoints and Countermeasures , 2005 .

[14]  Lance James,et al.  Phishing exposed , 2005 .

[15]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[16]  Min Wu,et al.  Do security toolbars actually prevent phishing attacks? , 2006, CHI.

[17]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[18]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[19]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[20]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[21]  Lorrie Faith Cranor,et al.  Phinding Phish: An Evaluation of Anti-Phishing Toolbars , 2007, NDSS.