Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization

The increase in the number of texts as digital documents from numerous sources such as customer reviews, news, and social media has made text categorization crucial in order to be able to manage the enormous amount of data. The high dimensional nature of these texts requires a preliminary feature selection task to reduce the feature space with a potential increase in the prediction accuracy. In this study, we developed an ensemble feature selection method, namely majority vote rank allocation, was developed for Turkish text categorization purposes. The method uses a majority voting ensemble strategy in combination with a rank allocation approach to combine weak filters such as information gain, symmetric uncertainty, relief, and correlation-based feature selection. Thus, the proposed method measures the quality of the features among all features with the majority votes of the filters and ranking allocation. The feature selection efficacy of the method was tested on two datasets, one from the literature and a newly collected dataset. The effect of the obtained features on the classification prediction performance was evaluated on top of the naive bayes, support vector machine J48, and random forests algorithms. It was empirically observed that the developed method improved the prediction accuracies of the classifiers compared to the mentioned filters. The statistical significance of the experimental results were also validated with the use of a two-way analysis of variance test.

[1]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[2]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[3]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[4]  Hongfei Lin,et al.  A two-stage feature selection method for text categorization , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[5]  Dhruba K. Bhattacharyya,et al.  EFS-MI: an ensemble feature selection method for classification , 2017, Complex & Intelligent Systems.

[6]  Serkan Gunal Hybrid feature selection for text classification , 2012 .

[7]  Verónica Bolón-Canedo,et al.  Ensembles for feature selection: A review and future trends , 2019, Inf. Fusion.

[8]  Özlem Aktaş,et al.  A hybrid sentiment analysis method for Turkish , 2019, Turkish J. Electr. Eng. Comput. Sci..

[9]  Murat Can Ganiz,et al.  Helmholtz principle based supervised and unsupervised feature selection methods for text mining , 2016, Inf. Process. Manag..

[10]  Marcone J. F. Souza,et al.  A VNS algorithm for feature selection in hierarchical classification context , 2018, Electron. Notes Discret. Math..

[11]  Azuraliza Abu Bakar,et al.  Hybrid feature selection based on enhanced genetic algorithm for text categorization , 2016, Expert Syst. Appl..

[12]  K. R. Chandran,et al.  Naïve Bayes text classification with positive features selected by statistical method , 2009, 2009 First International Conference on Advanced Computing.

[13]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..

[14]  J. Novovicova,et al.  Information-theoretic feature selection algorithms for text classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[15]  Huan Liu,et al.  Feature Selection for Classification: A Review , 2014, Data Classification: Algorithms and Applications.

[16]  Esra Saraç,et al.  An Ant Colony Optimization Based Feature Selection for Web Page Classification , 2014, TheScientificWorldJournal.

[17]  Banu Diri,et al.  Abstract feature extraction for text classification , 2012, Turkish Journal of Electrical Engineering and Computer Sciences.

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Jian Weng,et al.  Feature selection for text classification: A review , 2018, Multimedia Tools and Applications.

[20]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[21]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[22]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.

[23]  Yuan Tian,et al.  Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization , 2015 .

[24]  Ferdi Sönmez,et al.  A novel hybrid approach for sentiment classification of TURKISH tweets for GSM operators , 2018 .

[25]  Fardin Ahmadizar,et al.  A novel multivariate filter method for feature selection in text classification problems , 2018, Eng. Appl. Artif. Intell..

[26]  Agnieszka Wosiak,et al.  Integrating Correlation-Based Feature Selection and Clustering for Improved Cardiovascular Disease Diagnosis , 2018, Complex..

[27]  Selma Ayse Özel,et al.  QER: a new feature selection method for sentiment analysis , 2018, Human-centric Computing and Information Sciences.

[28]  Chiman Salavati,et al.  Hybrid fast unsupervised feature selection for high-dimensional data , 2019, Expert Syst. Appl..

[29]  Mohammed Al-Sarem,et al.  Feature selection using an improved Chi-square for Arabic text classification , 2020, J. King Saud Univ. Comput. Inf. Sci..

[30]  Anongnart Srivihok,et al.  Wrapper Feature Subset Selection for Dimension Reduction Based on Ensemble Learning Algorithm , 2015 .

[31]  Amit Kumar Yadav,et al.  Solar energy potential assessment of western Himalayan Indian state of Himachal Pradesh using J48 algorithm of WEKA in ANN based prediction model , 2015 .

[32]  Belhadri Messabih,et al.  Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction , 2011, Evolutionary bioinformatics online.

[33]  Erdal Kilic,et al.  Two new feature selection metrics for text classification , 2019, Automatika.

[34]  Fatih Yücalar,et al.  TTC-3600: A new benchmark dataset for Turkish text categorization , 2017, J. Inf. Sci..