Feature selection using an improved Chi-square for Arabic text classification

Abstract In text mining, feature selection (FS) is a common method for reducing the huge number of the space features and improving the accuracy of classification. In this paper, we propose an improved method for Arabic text classification that employs the Chi-square feature selection (referred to, hereafter, as ImpCHI) to enhance the classification performance. Besides, we have also compared this improved chi-square with three traditional features selection metrics namely mutual information, information gain and Chi-square. Building on our previous work, we extend the current work to assess the method in terms of other evaluation methods using SVM classifier. For this purpose, a dataset of 5070 Arabic documents are classified into six independently classes. In terms of performance, the experimental findings show that combining ImpCHI method and SVM classifier outperforms other combinations in terms of precision, recall and f-measures. This combination significantly improves the performance of Arabic text classification model. The best f-measures obtained for this model is 90.50%, when the number of features is 900.

[1]  Mahmoud Ahmed,et al.  Arabic text stemming: Comparative analysis , 2016, 2016 Conference of Basic Sciences and Engineering Studies (SGCAC).

[2]  Siddharth Singh,et al.  Opinion Mining and Analysis of Movie Reviews , 2017 .

[3]  Ayman Mohamed Mostafa An Evaluation of Sentiment Analysis and Classification Algorithms for Arabic Textual Data , 2017 .

[4]  B. S. Harish,et al.  A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents , 2017 .

[5]  Wael Chérif,et al.  New rules-based algorithm to improve Arabic stemming accuracy , 2015, Int. J. Knowl. Eng. Data Min..

[6]  Ali Selamat,et al.  Arabic Web page clustering: A review , 2019, J. King Saud Univ. Comput. Inf. Sci..

[7]  Hamdy M. Mousa,et al.  Arabic Text Categorization Using Mixed Words , 2016 .

[8]  Ayman Helmy Mohamed An Evaluation of Sentiment Analysis and Classification Algorithms for Arabic Textual Data , 2017 .

[9]  Selim Akyokus,et al.  The effectiveness of homogenous ensemble classifiers for Turkish and English texts , 2016, 2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA).

[10]  Ekta Jadon,et al.  Data Mining: Document Classification using Naive Bayes Classifier , 2017 .

[11]  Guanzheng Tan,et al.  The Effect of Preprocessing on Arabic Document Categorization , 2016, Algorithms.

[12]  Walid Cherif,et al.  A hybrid optimal weighting scheme and machine learning for rendering sentiments in tweets , 2016 .

[13]  Abdelwadood Moh'd. Mesleh,et al.  Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..

[14]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[15]  Guang Yang,et al.  Category Discrimination Based Feature Selection Algorithm in Chinese Text Classification , 2016, J. Inf. Sci. Eng..

[16]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[17]  Mohamed Boudchiche,et al.  AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer , 2017, J. King Saud Univ. Comput. Inf. Sci..

[18]  Alaa El-Halees,et al.  An approach for detecting spam in arabic opinion reviews , 2015, Int. Arab J. Inf. Technol..

[19]  Taeho Jo K nearest neighbor for text summarization using feature similarity , 2017, 2017 International Conference on Communication, Control, Computing and Electronics Engineering (ICCCCEE).

[20]  Abdellah Madani,et al.  An improved Chi-sqaure feature selection for Arabic text classification using decision tree , 2016, 2016 11th International Conference on Intelligent Systems: Theories and Applications (SITA).

[21]  Nazlia Omar,et al.  An automated arabic text categorization based on the frequency ratio accumulation , 2014, Int. Arab J. Inf. Technol..

[22]  Djelloul Ziadi,et al.  Rational kernels for Arabic Root Extraction and Text Classification , 2016, J. King Saud Univ. Comput. Inf. Sci..

[23]  Mahmoud Ahmed,et al.  Arabic Text Classification review , 2015 .

[24]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[25]  Fawaz S. Al-Anzi,et al.  Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing , 2017, J. King Saud Univ. Comput. Inf. Sci..

[26]  Wei Luo,et al.  Classification of Chinese Texts Based on Recognition of Semantic Topics , 2015, Cognitive Computation.

[27]  Fouzi Harrag,et al.  Improving arabic text categorization using decision trees , 2009, 2009 First International Conference on Networked Digital Technologies.

[28]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[29]  Mohammed Ramdani,et al.  A hybrid decision trees-adaptive neuro-fuzzy inference system in prediction of anti-HIV molecules , 2011, Expert Syst. Appl..

[30]  Fatiha Barigou,et al.  Improving K-nearest neighbor efficiency for text categorization , 2016 .