An Effective Method of Feature Selection in Persian Text for Improving the Accuracy of Detecting Request in Persian Messages on Telegram

In recent years, data received from social media has increased exponentially. They have become valuable sources of information for many analysts and businesses to expand their business. Automatic document classification is an essential step in extracting knowledge from these sources of information. In automatic text classification, words are assessed as a set of features. Selecting useful features from each text reduces the size of the feature vector and improves classification performance. Many algorithms have been applied for the automatic classification of text. Although all the methods proposed for other languages are applicable and comparable, studies on classification and feature selection in the Persian text have not been sufficiently carried out. The present research is conducted in Persian, and the introduction of a Persian dataset is a part of its innovation. In the present article, an innovative approach is presented to improve the performance of Persian text classification. The authors extracted 85,000 Persian messages from the Idekav-system, which is a Telegram search engine. The new idea presented in this paper to process and classify this textual data is on the basis of the feature vector expansion by adding some selective features using the most extensively used feature selection methods based on Local and Global filters. The new feature vector is then filtered by applying the secondary feature selection. The secondary feature selection phase selects more appropriate features among those added from the first step to enhance the effect of applying wrapper methods on classification performance. In the third step, the combined filter-based methods and the combination of the results of different learning algorithms have been used to achieve higher accuracy. At the end of the three selection stages, a method was proposed that increased accuracy up to 0.945 and reduced training time and calculations in the Persian dataset.

[1]  Özge Uncu,et al.  A novel feature selection approach: Combining feature wrappers and filters , 2007, Inf. Sci..

[2]  Dalila Boughaci,et al.  Hybrid Harmony Search Combined with Stochastic Local Search for Feature Selection , 2015, Neural Processing Letters.

[3]  Gang Kou,et al.  Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods , 2020, Appl. Soft Comput..

[4]  Kurniabudi Kurniabudi,et al.  Seleksi Fitur Dengan Information Gain Untuk Meningkatkan Deteksi Serangan DDoS menggunakan Random Forest , 2020 .

[5]  Jeffrey Ellen,et al.  Text Classification Methodologies Applied to Micro-Text in Military Chat , 2009, ICMLA.

[6]  Alaa F. Sheta,et al.  A Professional Comparison of C4.5, MLP, SVM for Network Intrusion Detection based Feature Analysis , 2015 .

[7]  Aytug Onan,et al.  Ensemble of keyword extraction methods and classifiers in text classification , 2016, Expert Syst. Appl..

[8]  Jian Weng,et al.  Feature selection for text classification: A review , 2018, Multimedia Tools and Applications.

[9]  Cheng Hao Jin,et al.  A New Ensemble Method with Feature Space Partitioning for High-Dimensional Data Classification , 2015 .

[10]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[11]  Hossam Faris,et al.  An efficient hybrid filter and evolutionary wrapper approach for sentiment analysis of various topics on Twitter , 2020, Knowl. Based Syst..

[12]  Nura Kawa Text Classification , 2016 .

[13]  Banu Diri,et al.  Abstract feature extraction for text classification , 2012, Turkish Journal of Electrical Engineering and Computer Sciences.

[14]  Enghin Omer,et al.  Using machine learning to identify jihadist messages on Twitter , 2015 .

[15]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..

[16]  Mohammed Azmi Al-Betar,et al.  Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering , 2017, Expert Syst. Appl..

[17]  Min Yang,et al.  An efficient automatic multiple objectives optimization feature selection strategy for internet text classification , 2019, Int. J. Mach. Learn. Cybern..

[18]  Juanying Xie,et al.  Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases , 2011, Expert Syst. Appl..

[19]  Heiko Paulheim,et al.  Local and global feature selection for multilabel classification with binary relevance , 2017, Artificial Intelligence Review.

[20]  Manohar Swamynathan,et al.  Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python , 2017 .

[21]  David J. Pauleen,et al.  How social media applications affect B2B communication and improve business performance in SMEs , 2016 .

[22]  Erdal Kilic,et al.  Two new feature selection metrics for text classification , 2019, Automatika.

[23]  F. Arag'on-Roy'on,et al.  FSinR: an exhaustive package for feature selection , 2020, ArXiv.

[24]  Shahid Hussain,et al.  Impact of Membership and Non-membership Features on Classification Decision: An Empirical Study for Appraisal of Feature Selection Methods , 2018, 2018 24th International Conference on Automation and Computing (ICAC).

[25]  Verónica Bolón-Canedo,et al.  Feature selection and classification in multiple class datasets: An application to KDD Cup 99 dataset , 2011, Expert Syst. Appl..

[26]  Bo Tang,et al.  Toward Optimal Feature Selection in Naive Bayes for Text Categorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[27]  Jerzy Surma,et al.  Improving Marketing Response by Data Mining in Social Network , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[28]  R. Srihari,et al.  Optimally Combining Positive and Negative Features for Text Categorization , 2003 .

[29]  Hossam Faris,et al.  Salp Chain-Based Optimization of Support Vector Machines and Feature Weighting for Medical Diagnostic Information Systems , 2019, Algorithms for Intelligent Systems.

[30]  Seema Bawa,et al.  Combining Synthetic Minority Oversampling Technique and Subset Feature Selection Technique For Class Imbalance Problem , 2016 .

[31]  Simon Fong,et al.  Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy , 2018 .

[32]  Shilpa Verma,et al.  Efficacy of a Classical and a Few Modified Machine Learning Algorithms in Forecasting Financial Time Series , 2020 .

[33]  Alper Kursat Uysal,et al.  On Two-Stage Feature Selection Methods for Text Classification , 2018, IEEE Access.

[34]  Information Gain Measured Feature Selection to Reduce High Dimensional Data , 2019 .

[35]  Hiroshi Ogura,et al.  Distinctive characteristics of a metric using deviations from Poisson for feature selection , 2010, Expert Syst. Appl..

[36]  Wu He,et al.  International Journal of Information Management Social Media Competitive Analysis and Text Mining: a Case Study in the Pizza Industry , 2022 .

[37]  Kesari Verma,et al.  Variable Global Feature Selection Scheme for automatic classification of text documents , 2017, Expert systems with applications.