Novel artificial bee colony based feature selection method for filtering redundant information

Feature selection, which can reduce the dimensions of feature space without sacrificing the performance of the classifier, is an effective technique for text classification. Because many classifiers cannot deal with the features with high dimensions, filtering the redundant information from the original feature space becomes one of the core goals in feature selection field. In this paper, the concept of equivalence word set is introduced and a set of equivalence word sets (represented as EWS1) is constructed using the rich semantic information of the Open Directory Project (ODP). On this basis, an artificial bee colony based feature selection method is proposed for filtering the redundant information, and a feature subset FS is obtained by using an optimal feature selection (OFS) method and two predetermined thresholds. In order to obtain the best predetermined thresholds, an improved memory based artificial bee colony method (IABCM) is proposed. In the experiments, fuzzy support vector machine (FSVM) and Naïve Bayesian (NB) classifiers are used on six datasets: LingSpam, WebKB, SpamAssian, 20-Newsgroups, Reuters21578 and TREC2007. Experimental results verify that when FSVM and NB are applied, the proposed method is efficient and achieves better accuracy than several representative feature selection methods.

[1]  Stephani Foraker,et al.  Polysemy in Sentence Comprehension: Effects of Meaning Dominance. , 2012, Journal of memory and language.

[2]  Zhen Liu,et al.  A new feature selection algorithm based on binomial hypothesis testing for spam filtering , 2011, Knowl. Based Syst..

[3]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[4]  Yi Mao,et al.  The Locally Weighted Bag of Words Framework for Document Representation , 2007, J. Mach. Learn. Res..

[5]  Zong Woo Geem,et al.  A New Heuristic Optimization Algorithm: Harmony Search , 2001, Simul..

[6]  Eiiti Kasuya,et al.  Wilcoxon signed-ranks test: symmetry should be confirmed before the test , 2010, Animal Behaviour.

[7]  Dervis Karaboga,et al.  A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm , 2007, J. Glob. Optim..

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[12]  Fawaz S. Al-Anzi,et al.  Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing , 2017, J. King Saud Univ. Comput. Inf. Sci..

[13]  Liang Gao,et al.  An improved fruit fly optimization algorithm for continuous function optimization problems , 2014, Knowl. Based Syst..

[14]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[15]  Wen-Tsao Pan,et al.  A new Fruit Fly Optimization Algorithm: Taking the financial distress model as an example , 2012, Knowl. Based Syst..

[16]  Youwei Wang,et al.  Two-step based hybrid feature selection method for spam filtering , 2014, J. Intell. Fuzzy Syst..

[17]  Saverio Perugini,et al.  Symbolic links in the Open Directory Project , 2008, Inf. Process. Manag..

[18]  J. Moody,et al.  Feature Selection Based on Joint Mutual Information , 1999 .

[19]  James Geller,et al.  Using WordNet synonym substitution to enhance UMLS source integration , 2009, Artif. Intell. Medicine.

[20]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[21]  Martina Asenbrener Katic,et al.  Homonyms and Synonyms in NOK Method , 2015 .

[22]  James M. Keller,et al.  A possibilistic fuzzy c-means clustering algorithm , 2005, IEEE Transactions on Fuzzy Systems.

[23]  Youwei Wang,et al.  Novel feature selection method based on harmony search for email classification , 2015, Knowl. Based Syst..

[24]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[25]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[26]  Leslie S. Smith,et al.  Feature subset selection in large dimensionality domains , 2010, Pattern Recognit..

[27]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[28]  刘景华,et al.  Multi-label feature selection based on max-dependency and min-redundancy , 2015 .

[29]  Myron Wish,et al.  Three-Way Multidimensional Scaling , 1978 .

[30]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[31]  Wen Li,et al.  Unsupervised language identification based on Latent Dirichlet Allocation , 2016, Comput. Speech Lang..

[32]  Jurica Seva,et al.  Open Directory Project based universal taxonomy for Personalization of Online (Re)sources , 2015, Expert Syst. Appl..

[33]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[34]  Min Han,et al.  Global mutual information-based feature selection approach using single-objective and multi-objective optimization , 2015, Neurocomputing.

[35]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[36]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[37]  Yan Xu,et al.  A Comparative Study on Feature Selection in Unbalance Text Classification , 2012, 2012 Fourth International Symposium on Information Science and Engineering.

[38]  Xianneng Li,et al.  Artificial bee colony algorithm with memory , 2016, Appl. Soft Comput..

[39]  Deqing Wang,et al.  Feature selection based on term frequency and T-test for text categorization , 2012, CIKM.

[40]  Dervis Karaboga,et al.  A comparative study of Artificial Bee Colony algorithm , 2009, Appl. Math. Comput..

[41]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[42]  Michel Tenenhaus,et al.  PLS path modeling , 2005, Comput. Stat. Data Anal..

[43]  Youwei Wang,et al.  Term frequency combined hybrid feature selection method for spam filtering , 2014, Pattern Analysis and Applications.

[44]  Yishi Zhang,et al.  Feature subset selection with cumulate conditional mutual information minimization , 2012, Expert Syst. Appl..

[45]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..