A Local-Concentration-Based Feature Extraction Approach for Spam Filtering

Inspired from the biological immune system, we propose a local concentration (LC)-based feature extraction approach for anti-spam. The LC approach is considered to be able to effectively extract position-correlated information from messages by transforming each area of a message to a corresponding LC feature. Two implementation strategies of the LC approach are designed using a fixed-length sliding window and a variable-length sliding window. To incorporate the LC approach into the whole process of spam filtering, a generic LC model is designed. In the LC model, two types of detector sets are at first generated by using term selection methods and a well-defined tendency threshold. Then a sliding window is adopted to divide the message into individual areas. After segmentation of the message, the concentration of detectors is calculated and taken as the feature for each local area. Finally, all the features of local areas are combined as a feature vector of the message. To evaluate the proposed LC model, several experiments are conducted on five benchmark corpora using the cross-validation method. It is shown that the LC approach cooperates well with three term selection methods, which endows it with flexible applicability in the real world. Compared to the global-concentration-based approach and the prevalent bag-of-words approach, the LC approach has better performance in terms of both accuracy and F1 measure. It is also demonstrated that the LC approach is robust against messages with variable message length.

[1]  Thiago S. Guzella,et al.  Identification of SPAM messages using an approach inspired on the immune system , 2008, Biosyst..

[2]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[3]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[4]  D. Dasgupta,et al.  Advances in artificial immune systems , 2006, IEEE Computational Intelligence Magazine.

[5]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[6]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[7]  William S. Yerazunis,et al.  Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering , 2004, PKDD.

[8]  KarkaletsisVangelis,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2003 .

[9]  William S. Yerazunis Sparse Binary Polynomial Hashing and the CRM114 Discriminator , 2006 .

[10]  Steffen Bickel,et al.  Dirichlet-Enhanced Spam Filtering based on Biased Samples , 2006, NIPS.

[11]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[12]  Ying Tan,et al.  A three-layer back-propagation neural network for spam detection using artificial immune concentration , 2009, Soft Comput..

[13]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[14]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[17]  Chih-Hung Wu,et al.  Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks , 2009, Expert Syst. Appl..

[18]  Ying Tan,et al.  Concentration based feature construction approach for spam detection , 2009, 2009 International Joint Conference on Neural Networks.

[19]  Carla E. Brodley,et al.  Advances in online learning-based spam filtering , 2008 .

[20]  Tony White,et al.  Developing an Immunity to Spam , 2003, GECCO.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Irena Koprinska,et al.  Learning to classify e-mail , 2007, Inf. Sci..

[23]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[24]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[25]  Ying Tan,et al.  Intelligent Detection Approaches for Spam , 2007, Third International Conference on Natural Computation (ICNC 2007).

[26]  Tunga Güngör,et al.  Time-efficient spam e-mail filtering using n-gram models , 2008, Pattern Recognit. Lett..

[27]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[28]  Bo Thiesson,et al.  Asymmetric Gradient Boosting with Application to Spam Filtering , 2007, CEAS.

[29]  Richard Segal,et al.  Combining Global and Personal Anti-Spam Filtering , 2007, CEAS.

[30]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.