The assessment of feature selection methods on agglutinative language for spam email detection: A special case for Turkish

In this study, the assessment of three different feature selection methods including Information Gain (IG), Gini Index (GI), and CHI square (CHI2) is made by utilizing two popular pattern classifiers, namely Artificial Neural Network (ANN) and Decision Tree (DT), on the classification of Turkish e-mails. The feature vectors are constructed by the bag-of-words feature extraction method. This paper is focused on the Turkish language since it is one of the widely used agglutinative languages all around the world. The results obviously reveal that CHI2 and GI feature selection methods are more efficacious than IG method for Turkish language.

[1]  R. Parimala,et al.  A Study of Spam E-mail classification using Feature Selection package , 2011 .

[2]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[3]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[4]  Carlos Gershenson,et al.  Artificial Neural Networks for Beginners , 2003, ArXiv.

[5]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[6]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  Byoung-Tak Zhang,et al.  PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis , 2004 .

[9]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[10]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[11]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[15]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[16]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..