Hardening Email Security via Bayesian Additive Regression Trees

The changeable structures and variability of email attacks render current email filtering solutions useless. Consequently, the need for new techniques to harden the protection of users' security and privacy becomes a necessity. The variety of email attacks, namely spam, damages networks' infrastructure and exposes users to new attack vectors daily. Spam is unsolicited email which targets users with different types of commercial messages or advertisements. Porn-related content that contains explicit material or commercials of exploited children is a major trend in these messages as well. The waste of network bandwidth due to the numerous number of spam messages sent and the requirement of complex hardware, software, network resources, and human power are other problems associated with these attacks. Recently, security researchers have noticed an increase in malicious content delivered by these messages, which arises security concerns due to their attack potential. More seriously, phishing attacks have been on the rise for the past couple of years. Phishing is the act of sending a forged e-mail to a recipient, falsely mimicking a legitimate establishment in an attempt to scam the recipient into divulging private information such as credit card numbers or bank account passwords (James, 2005). Recently phishing attacks have become a major concern to financial institutions and law enforcement due to the heavy monetary losses involved. According to a survey by Gartner group, in 2006 approximately 3.25 million victims were spoofed by phishing attacks and in 2007 the number increased by almost 1.3 million victims. Furthermore, in 2007, monetary losses, related to phishing attacks, were estimated by $3.2 billion. All the aforementioned concerns raise the need for new detection mechanisms to subvert email attacks in their various forms. Despite the abundance of applications available for phishing detection, unlike spam classification, there are only few studies that compare machine learning techniques in predicting phishing emails (Abu-Nimeh et al., 2007). We describe a new version of Bayesian Additive Regression Trees (BART) and apply it to phishing detection. A phishing dataset is constructed from 1409 raw phishing emails and 5152 legitimate emails, where 71 features (variables) are used in classifiers' training and testing. The variables consist of both textual and structural features that are extracted from raw emails. The performance of six classifiers, on this dataset, is compared using the area under the curve (AUC) (Huang & Ling, 2005). The classifiers include Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random O pe n A cc es s D at ab as e w w w .in te ch w eb .o rg

[1]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[2]  Jasvinder S. Kandola,et al.  Interpretable modelling with sparse kernels , 2001 .

[3]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[4]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[5]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[8]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[9]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[10]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[11]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Lance James,et al.  Phishing exposed , 2005 .

[15]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[16]  Jon Rigelsford,et al.  Pattern Recognition: Concepts, Methods and Applications , 2002 .

[17]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[18]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[19]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  KarkaletsisVangelis,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2003 .

[24]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Barton C. Massey,et al.  Learning Spam: Simple Techniques For Freely-Available Software , 2003, USENIX Annual Technical Conference, FREENIX Track.

[26]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[27]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[28]  Suku Nair,et al.  Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy , 2008, 2008 Third International Conference on Availability, Reliability and Security.