Extracting discriminative information from e-mail for spam detection inspired by Immune System

Inspired from Biological Immune System, we propose a local concentration based feature extraction (LC) approach for anti-spam. A general anti-spam model is built to incorporate the LC approach with term selection methods and classifiers. In the LC model, each message is divided into areas by a sliding window. At each area, a two-dimensional feature is constructed by calculating the concentrations of spam and legitimate email. Then all the features of each area are combined together as a whole feature vector. Several experiments are conducted on four benchmark corpora, by using 10-fold cross-validation. It is shown that the LC approach can extract the effective position correlated information from messages. Compared to the prevalent Bag-of-Words approach, the LC has better performance in terms of both accuracy and F1 measure. Most significantly, the LC approach can reduce feature dimensionality greatly and has much faster speed.

[1]  William S. Yerazunis Sparse Binary Polynomial Hashing and the CRM114 Discriminator , 2006 .

[2]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[3]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[4]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[5]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[6]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[7]  Ying Tan,et al.  A three-layer back-propagation neural network for spam detection using artificial immune concentration , 2009, Soft Comput..

[8]  William S. Yerazunis,et al.  Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering , 2004, PKDD.

[9]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[12]  Chih-Hung Wu,et al.  Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks , 2009, Expert Syst. Appl..

[13]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[14]  Thiago S. Guzella,et al.  Identification of SPAM messages using an approach inspired on the immune system , 2008, Biosyst..

[15]  Ray Hunt,et al.  Tightening the net: A review of current and next generation spam filtering tools , 2006, Comput. Secur..

[16]  Tunga Güngör,et al.  Time-efficient spam e-mail filtering using n-gram models , 2008, Pattern Recognit. Lett..

[17]  Tony White,et al.  Developing an Immunity to Spam , 2003, GECCO.

[18]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[19]  D. Dasgupta,et al.  Advances in artificial immune systems , 2006, IEEE Computational Intelligence Magazine.