A Two-stage Text Feature Selection Algorithm for Improving Text Classification

As the number of digital text documents increases on a daily basis, the classification of text is becoming a challenging task. Each text document consists of a large number of words (or features) that drive down the efficiency of a classification algorithm. This article presents an optimized feature selection algorithm designed to reduce a large number of features to improve the accuracy of the text classification algorithm. The proposed algorithm uses noun-based filtering, a word ranking that enhances the performance of the text classification algorithm. Experiments are carried out on three benchmark datasets, and the results show that the proposed classification algorithm has achieved the maximum accuracy when compared to the existing algorithms. The proposed algorithm is compared to Term Frequency-Inverse Document Frequency, Balanced Accuracy Measure, GINI Index, Information Gain, and Chi-Square. The experimental results clearly show the strength of the proposed algorithm.

[1]  Qi Wang,et al.  Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection , 2019, Pattern Recognit..

[2]  Yao Zhang,et al.  Feature Selection Based on Term Frequency Reordering of Document Level , 2018, IEEE Access.

[3]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[4]  Jusung Park,et al.  Hybrid Feature Selection Method Based on Neural Networks and Cross-Validation for Liver Cancer With Microarray , 2018, IEEE Access.

[5]  Thar Baker,et al.  Analysis of Dimensionality Reduction Techniques on Big Data , 2020, IEEE Access.

[6]  Daoqiang Zhang,et al.  Constraint Score: A new filter method for feature selection with pairwise constraints , 2008, Pattern Recognit..

[7]  Mahdieh Labani,et al.  A multi-objective genetic algorithm for text feature selection using the relative discriminative criterion , 2020, Expert Syst. Appl..

[8]  Bernd Bischl,et al.  Benchmark for filter methods for feature selection in high-dimensional classification data , 2020, Comput. Stat. Data Anal..

[9]  Bo Tang,et al.  Toward Optimal Feature Selection in Naive Bayes for Text Categorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[10]  Fei Peng,et al.  Face spoofing detection based on color texture Markov feature and support vector machine recursive feature elimination , 2018, J. Vis. Commun. Image Represent..

[11]  ThippaReddy Gadekallu,et al.  Application of Sentiment Analysis in Movie reviews , 2019, Advances in Business Information Systems and Analytics.

[12]  Raghu Machiraju,et al.  Visual Exploration of Neural Document Embedding in Information Retrieval: Semantics and Feature Selection , 2019, IEEE Transactions on Visualization and Computer Graphics.

[13]  Qingshan Jiang,et al.  Feature selection via maximizing global information gain for text classification , 2013, Knowl. Based Syst..

[14]  Kyoungok Kim,et al.  Trigonometric comparison measure: A feature selection method for text categorization , 2019, Data Knowl. Eng..

[15]  Azuraliza Abu Bakar,et al.  Hybrid feature selection based on enhanced genetic algorithm for text categorization , 2016, Expert Syst. Appl..

[16]  Mohammed Azmi Al-Betar,et al.  Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering , 2017, Expert Syst. Appl..

[17]  Aun Irtaza,et al.  Topic Modeling Technique for Text Mining Over Biomedical Text Corpora Through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering , 2019, IEEE Access.

[18]  LeeGary Geunbae,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006 .

[19]  Yuling Wang,et al.  Composite Feature Extraction and Selection for Text Classification , 2019, IEEE Access.

[20]  Odongo Steven Eyobu,et al.  Feature Selection Based on Variance Distribution of Power Spectral Density for Driving Behavior Recognition , 2019, 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[21]  RehmanAbdur,et al.  Feature selection based on a normalized difference measure for text classification , 2017 .

[22]  Moayad Aloqaily,et al.  An incentive-aware blockchain-based solution for internet of fake media things , 2020, Inf. Process. Manag..

[23]  M. Parimala,et al.  Spatiotemporal‐based sentiment analysis on tweets for risk assessment of event using deep learning approach , 2020, Softw. Pract. Exp..

[24]  Esfandiar Eslami,et al.  Global Filter-Wrapper method based on class-dependent correlation for text classification , 2019, Eng. Appl. Artif. Intell..

[25]  Zhaohong Deng,et al.  Robust Relief-Feature Weighting, Margin Maximization, and Fuzzy Optimization , 2010, IEEE Transactions on Fuzzy Systems.

[26]  Fei Peng,et al.  Discrimination of natural images and computer generated graphics based on multi-fractal and regression analysis , 2017 .

[27]  Nikola Bogunovic,et al.  A review of feature selection methods with applications , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[28]  Wazir Zada Khan,et al.  Senti‐eSystem: A sentiment‐based eSystem‐using hybridized fuzzy and deep neural network for measuring customer satisfaction , 2020, Softw. Pract. Exp..

[29]  Alper Kursat Uysal,et al.  On Two-Stage Feature Selection Methods for Text Classification , 2018, IEEE Access.

[30]  Rui Huang,et al.  Manifold-based constraint Laplacian score for multi-label feature selection , 2018, Pattern Recognit. Lett..

[31]  Kup-Sze Choi,et al.  Minimum-maximum local structure information for feature selection , 2013, Pattern Recognit. Lett..

[32]  Praveen Kumar Reddy Maddikunta,et al.  Location Based Business Recommendation Using Spatial Demand , 2020, Sustainability.

[33]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[34]  LiMin,et al.  Feature selection via maximizing global information gain for text classification , 2013 .

[35]  Abdur Rehman,et al.  Feature selection based on a normalized difference measure for text classification , 2017, Inf. Process. Manag..

[36]  Xingming Sun,et al.  Linguistic steganalysis using the features derived from synonym frequency , 2012, Multimedia Tools and Applications.

[37]  Kemal Polat,et al.  A Novel Framework of Two Successive Feature Selection Levels Using Weight-Based Procedure for Voice-Loss Detection in Parkinson’s Disease , 2020, IEEE Access.

[38]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[39]  N K Suchetha,et al.  Comparing the Wrapper Feature Selection Evaluators on Twitter Sentiment Classification , 2019, 2019 International Conference on Computational Intelligence in Data Science (ICCIDS).

[40]  Xi Liu,et al.  Modified Binary Cuckoo Search for Feature Selection: A Hybrid Filter-Wrapper Approach , 2017, 2017 13th International Conference on Computational Intelligence and Security (CIS).

[41]  Kesari Verma,et al.  Variable Global Feature Selection Scheme for automatic classification of text documents , 2017, Expert systems with applications.

[42]  Ao Zhang,et al.  Cross-subject driver status detection from physiological signals based on hybrid feature selection and transfer learning , 2019, Expert Syst. Appl..

[43]  Denis Hamad,et al.  Constraint scores for semi-supervised feature selection: A comparative study , 2011, Pattern Recognit. Lett..

[44]  Hayri Volkan Agun,et al.  Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution , 2019, IEEE Access.

[45]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[46]  MengChu Zhou,et al.  Bilevel Feature Extraction-Based Text Mining for Fault Diagnosis of Railway Systems , 2017, IEEE Transactions on Intelligent Transportation Systems.

[47]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[48]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..