On Some Feature Selection Strategies for Spam Filter Design

Feature selection is an important research problem in different statistical learning problems including text categorization applications such as spam email classification. In designing spam filters, we often represent the email by vector space model (VSM), i.e., every email is considered as a vector of word terms. Since there are many different terms in the email, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be used. Another reason is that some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features. There are many feature selection strategies that can be applied to produce the resulting feature set. In this paper, we investigate the use of hill climbing, simulated annealing, and threshold accepting optimization techniques as feature selection algorithms. We also compare the performance of the above three techniques with the linear discriminate analysis. Our experiment results show that all these techniques can be used not only to reduce the dimensions of the e-mail, but also improve the performance of the classification filter. Among all the strategies, simulated annealing has the best performance which reaches a classification accuracy of 95.5%

[1]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[2]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[3]  N. Soonthornphisaj,et al.  Anti-spam filtering: a centroid-based classification approach , 2002, 6th International Conference on Signal Processing, 2002..

[4]  Ja-Chen Lin,et al.  A new LDA-based face recognition system which can solve the small sample size problem , 1998, Pattern Recognit..

[5]  Chih-Chin Lai,et al.  An empirical performance comparison of machine learning methods for spam e-mail categorization , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[6]  Pang Jian,et al.  Research and Implementation of Text Categorization System Based on VSM , 2001 .

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[9]  Susan Stepney,et al.  The design of S-boxes by simulated annealing , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).