Feature selection method based on statistics of compound words for arabic text classification

One of the main problems of text classification is the high dimensionality of the feature space. Feature selection methods are normally used to reduce the dimensionality of datasets to improve the performance of the classification, or to reduce the processing time, or both. To improve the performance of text classification, a feature selection algorithm is presented, based on terminology extracted from the statistics of compound words, to reduce the high dimensionality of the feature space. The proposed method is evaluated as a standalone method and in combination with other feature selection methods (two-stage method). The performance of the proposed algorithm is compared to the performance of six well-known feature selection methods including Information Gain, Chi-Square, Gini Index, Support Vector Machine-Based, Principal Components Analysis and Symmetric Uncertainty. A wide range of comparative experiments were conducted on three Arabic standard datasets and with three classification algorithms. The experimental results clearly show the superiority of the proposed method in both cases as a standalone or in a two-stage scenario. The results show that the proposed method behaves better than traditional approaches in terms of classification accuracy with a 6-10% gain in the macro-average, F1.

[1]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[2]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[3]  Arputharaj Kannan,et al.  An intelligent CRF based feature selection for effective intrusion detection , 2016, Int. Arab J. Inf. Technol..

[4]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..

[5]  Nazlia Omar,et al.  Arabic text classification using k-nearest neighbour algorithm , 2015, Int. Arab J. Inf. Technol..

[6]  Hiroshi Nakagawa Automatic term recognition based on statistics of compound nouns , 2000 .

[7]  Hongfei Lin,et al.  A two-stage feature selection method for text categorization , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[8]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[9]  Sri Harsha Vege Ensemble of Feature Selection Techniques for High Dimensional Data , 2012 .

[10]  Feng Qi,et al.  Improved information gain-based feature selection for text categorization , 2014, 2014 4th International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace & Electronic Systems (VITAE).

[11]  Jianhua Dai,et al.  Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification , 2013, Appl. Soft Comput..

[12]  G. Dias,et al.  Automatic Extraction of Multiword Units for Estonian : Phrasal Verbs , 2003 .

[13]  George D. C. Cavalcanti,et al.  A global-ranking local feature selection method for text categorization , 2012, Expert Syst. Appl..

[14]  K. R. Chandran,et al.  An enhanced ACO algorithm to select features for text categorization and its parallelization , 2012, Expert Syst. Appl..

[15]  Jianzhong Wang,et al.  An Improved Feature Selection Based on Effective Range for Classification , 2014, TheScientificWorldJournal.

[16]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[17]  Yifei Chen,et al.  Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection , 2015, BioMed research international.

[18]  Yanjun Qi,et al.  Sentiment classification based on supervised latent n-gram analysis , 2011, CIKM '11.

[19]  Abdelwadood Moh'd. Mesleh Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study , 2007, SCSS.

[20]  Claudio De Stefano,et al.  A GA-based feature selection approach with an application to handwritten character recognition , 2014, Pattern Recognit. Lett..

[21]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[22]  Hiroshi Nakagawa,et al.  A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[23]  Om Prakash Vyas,et al.  A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty , 2014 .

[24]  Yugang Dai,et al.  The naive Bayes text classification algorithm based on rough set in the cloud platform , 2014 .

[25]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[26]  Chu-Ren Huang,et al.  A Framework of Feature Selection Methods for Text Categorization , 2009, ACL.

[27]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[28]  Andrea Esuli,et al.  Feature Selection for Ordinal Text Classification , 2014, Neural Computation.

[29]  Vito D'Orazio,et al.  Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines , 2014, Political Analysis.