Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data

Unsolicited commercial or bulk emails or emails containing viruses pose a great threat to the utility of email communications. A recent solution for filtering is reputation systems that can assign a value of trust to each IP address sending email messages. By analyzing the query patterns of each node utilizing reputation information, reputation systems can calculate a reputation score for each queried IP address. In this research, we explore a behavioral classification approach based on features extracted from such global messaging patterns. Due to the large amount of bad senders, this classification task has to cope with highly imbalanced data. Firstly, for each observed sender, we calculate periodicity properties using a discrete Fourier transform and global breadth information reflecting message volume and recipient distribution. After that, a Granular Support Vector Machine - Boundary Alignment algorithm (GSVM-BA) is implemented to solve the class imbalance problem and compared to cost sensitive learning. Lastly, we determine the performance of support vector machine, C4.5 decision trees, na¨ ive Bayesian decision trees, and multinomial logistic regression classifiers on the resulting data set. The best performance is observed by using GSVM-BA for rebalance and then using SVM for classification.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Santosh S. Vempala,et al.  Filtering spam with behavioral blacklisting , 2007, CCS '07.

[3]  Yanqing Zhang,et al.  Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data , 2006, 2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[4]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[5]  Tsau Young Lin,et al.  Data Mining and Machine Oriented Modeling: A Granular Computing Approach , 2000, Applied Intelligence.

[6]  Dawei Han,et al.  2008 International Conference on Artificial Intelligence and Pattern Recognition, Orlando, Florida, U.S.A , 2008 .

[7]  Yanqing Zhang,et al.  Granular support vector machines with association rules mining for protein homology prediction , 2005, Artif. Intell. Medicine.

[8]  Nick Feamster,et al.  Revealing Botnet Membership Using DNSBL Counter-Intelligence , 2006, SRUTI.

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[11]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[13]  Jian-xiong Dong,et al.  Algorithms of fast SVM evaluation based on subspace projection , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[16]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[18]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[19]  Witold Pedrycz,et al.  Granular computing: an introduction , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[20]  Marcos M. Campos,et al.  SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines , 2005, VLDB.

[21]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[22]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.