Two-step based hybrid feature selection method for spam filtering

Feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is commonly used in spam filtering. As many classifiers cannot deal with the features with large dimensions, the noisy, irrelevant and redundant data should be removed from the feature spaces. In this paper, a two-step based hybrid feature selection method, called TFSM, is proposed. Firstly, we select the most discriminative features by an existing document frequency based feature selection method (called ODFFS). Secondly, we select the remaining features by combining the ODFFS and a newly proposed term frequency based feature selection method (called NTFFS). Moreover, we propose a new optimizing meta-heuristic method, called GOPSO, to improve the convergence rate of standard particle swarm optimization. In the experiments, Support Vector Machine (SVM) and Naive Bayesian (NB) classifiers are used on four corpuses: PU2, PU3, Enron-spam and Trec2007. The experimental results show that, TFSM is significantly superior to information gain, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain and improved term frequency inverse document frequency method on four corpuses when SVM and NB are applied respectively.

[1]  Dino Isa,et al.  An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization , 2011, Applied Intelligence.

[2]  Shi Gao,et al.  Text clustering based on the improved TFIDF by the iterative algorithm , 2012, 2012 IEEE Symposium on Electrical & Electronics Engineering (EEESYM).

[3]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[4]  Chih-Ming Chen,et al.  Two novel feature selection approaches for web page classification , 2009, Expert Syst. Appl..

[5]  Lei Wang,et al.  Grid Search Optimized SVM Method for Dish-like Underwater Robot Attitude Prediction , 2012, 2012 Fifth International Joint Conference on Computational Sciences and Optimization.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Mikko T. Siponen,et al.  Effective Anti-Spam Strategies in Companies: An International Study , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[8]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[9]  Manish Verma,et al.  A Comparative Study of Various Clustering Algorithms in Data Mining , 2012 .

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[12]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[13]  Wen Feng-we Clustering Algorithm Based on Improved Particle Swarm Optimization , 2014 .

[14]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[15]  Deqing Wang,et al.  Feature selection based on term frequency and T-test for text categorization , 2012, CIKM.

[16]  Jesús S. Aguilar-Ruiz,et al.  Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches , 2012, Expert Syst. Appl..

[17]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[18]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[19]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[20]  Bo Yu,et al.  A comparative study for content-based dynamic spam classification using four machine learning algorithms , 2008, Knowl. Based Syst..

[21]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[22]  Alper Ekrem Murat,et al.  A discrete particle swarm optimization method for feature selection in binary classification problems , 2010, Eur. J. Oper. Res..

[23]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[24]  V. K. Bhuvaneswari,et al.  A Comparative Study of Various Clustering Algorithms in Data Mining , 2014 .

[25]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[26]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[27]  Bart Baesens,et al.  Editorial survey: swarm intelligence for data mining , 2010, Machine Learning.

[28]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[29]  C. Tappert,et al.  A Genetic Algorithm for Constructing Compact Binary Decision Trees , 2009 .

[30]  Lin Chen,et al.  Term-frequency Based Feature Selection Methods for Text Categorization , 2010, 2010 Fourth International Conference on Genetic and Evolutionary Computing.

[31]  Chellali Benachaiba,et al.  Power Quality Enhancement using Shunt Active Power Filter Based on Particle Swarm Optimization , 2011 .

[32]  Gonzalo Álvarez,et al.  Word sense disambiguation for spam filtering , 2012, Electron. Commer. Res. Appl..

[33]  Igor Santos,et al.  Enhanced Topic-based Vector Space Model for semantics-aware spam filtering , 2012, Expert Syst. Appl..

[34]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[35]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[36]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[37]  Hao Dong,et al.  An improved particle swarm optimization for feature selection , 2011 .

[38]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[39]  Y. Zhu,et al.  A Local-Concentration-Based Feature Extraction Approach for Spam Filtering , 2011, IEEE Transactions on Information Forensics and Security.

[40]  Frederick Mosteller,et al.  Association and Estimation in Contingency Tables , 1968 .