A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, @g^2-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than @g^2-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naive Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms.

[1]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Bo Yu,et al.  Combining neural networks and semantic feature space for email classification , 2009, Knowl. Based Syst..

[6]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[7]  Shuhai Liu,et al.  A comparative study on text representation schemes in text categorization , 2005, Pattern Analysis and Applications.

[8]  Igor Skrjanc,et al.  New results in modelling derived from Bayesian filtering , 2010, Knowl. Based Syst..

[9]  Kevin Lü,et al.  A preprocess algorithm of filtering irrelevant information based on the minimum class difference , 2006, Knowl. Based Syst..

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[12]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[13]  Saket S. R. Mengle,et al.  Ambiguity measure feature-selection algorithm , 2009, J. Assoc. Inf. Sci. Technol..

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[16]  Spiridon D. Likothanassis,et al.  Best terms: an efficient feature-selection algorithm for text categorization , 2005, Knowledge and Information Systems.

[17]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Gary Robinson,et al.  A statistical approach to the spam problem , 2003 .

[22]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[23]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[24]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[25]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[26]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[27]  T. Mexia,et al.  Author ' s personal copy , 2009 .

[28]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[29]  Stuart Barber,et al.  All of Statistics: a Concise Course in Statistical Inference , 2005 .

[30]  Mark Palatucci,et al.  On the chance accuracies of large collections of classifiers , 2008, ICML '08.

[31]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[32]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[33]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[34]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[35]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[36]  Myong Kee Jeong,et al.  Class dependent feature scaling method using naive Bayes classifier for text datamining , 2009, Pattern Recognit. Lett..

[37]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[38]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[39]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[40]  Kin Fun Li,et al.  Recommendation based on rational inferences in collaborative filtering , 2009, Knowl. Based Syst..