A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification

Classification of the spam from bunch of the email files is a challenging research area in text mining domain. However, machine learning based approaches are widely experimented in the literature with enormous success. For excellent learning of the classifiers, few numbers of informative features are important. This researh presents a comparative study between various supervised feature selection methods such as Document Frequency (DF), Chi-Squared (χ2), Information Gain (IG), Gain Ratio (GR), Relief F (RF), and One R (OR). Two corpuses (Enron and SpamAssassin) are selected for this study where enron is main corpus and spamAssassin is used for validation of the results. Bayesian Classifier is taken to classify the given corpuses with the help of features selected by above feature selection techniques. Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and X2 were not so effective methods. Bayesian classifier has proven its worth in this study in terms of good performance accuracy and low false positives.

[1]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[2]  Shrawan Kumar Trivedi,et al.  Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails , 2014, SIAP.

[3]  Shrawan Kumar Trivedi,et al.  Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails , 2013 .

[4]  Shrawan Kumar Trivedi,et al.  Effect of feature selection methods on machine learning classifiers for detecting email spams , 2013, RACS.

[5]  Shrawan Kumar Trivedi,et al.  An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[6]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[8]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[9]  TrivediShrawan Kumar,et al.  Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails , 2014 .

[10]  Shrawan Kumar Trivedi,et al.  Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams , 2013 .

[11]  Huan Liu,et al.  Semi-supervised Feature Selection via Spectral Analysis , 2007, SDM.

[12]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[13]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.