Genetic Algorithm based Feature Selection in High Dimensional Text Dataset Classification

Vector space model based bag-of-words language model is commonly used to represent documents in a corpus. But this representation model needs a high dimensional input feature space that has irrelevant and redundant features to represent all corpus files. Non-Redundant feature reduction of input space improves the generalization property of a classifier. In this study, we developed a new objective function based on models F1 score and feature subset size based. In this paper, we present work on genetic algorithm for feature selection in order to reduce modeling complexity and training time of classification algorithms used in text classification task. We used genetic algorithm based meta-heuristic optimization algorithm to improve the F1 score of classifier hypothesis. Firstly; (i) we’ve developed a new objective function to maximize; (ii) then we choose candidate features for classification algorithm; and (iii) finally support vector machine (SVM), maximum entropy (MaxEnt) and stochastic gradient descent (SGD) classification algorithms are used to find classification models of public available datasets. Key–Words: Feature selection, support vector machines, logistic regression, stochastic gradient descent, document classification

[1]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[2]  Kazuyuki Murase,et al.  A new hybrid ant colony optimization algorithm for feature selection , 2012, Expert Syst. Appl..

[3]  Peter A. Bandettini,et al.  Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images , 2012, NeuroImage.

[4]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[5]  Hazem M. El-Bakry,et al.  Fast information retrieval from web pages , 2008 .

[6]  Wei-Chang Yeh,et al.  Feature selection with Intelligent Dynamic Swarm and Rough Set , 2010, Expert Syst. Appl..

[7]  Yogesh R. Shepal A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data , 2014 .

[8]  Matti Pietikäinen,et al.  Discriminative features for texture description , 2012, Pattern Recognit..

[9]  Chris Mesterharm,et al.  Active learning using on-line algorithms , 2011, KDD.

[10]  Alper Ekrem Murat,et al.  A discrete particle swarm optimization method for feature selection in binary classification problems , 2010, Eur. J. Oper. Res..

[11]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[12]  Duoqian Miao,et al.  A rough set approach to feature selection based on ant colony optimization , 2010, Pattern Recognit. Lett..

[13]  Paolo Napoletano,et al.  Text classification using a few labeled examples , 2014, Comput. Hum. Behav..

[14]  Mingliang Chen,et al.  Building emotional dictionary for sentiment analysis of online news , 2014, World Wide Web.

[15]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[16]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[17]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[18]  Charles Elkan,et al.  Quadratic Programming Feature Selection , 2010, J. Mach. Learn. Res..