A novel filter feature selection method using rough set for short text data

Abstract High dimensionality problem is an important concern for short text classification due to its effect on computational cost and accuracy of classifiers. Also, short text data, besides being high dimensional, has an incomplete, inconsistent and sparse structure. Selection of important features that provide a better representation is a solution for high dimensionality problem. In this study, we developed a novel filter feature selection method, Proportional Rough Feature Selector (PRFS), which uses the rough set for a regional distinction according to the value set of term to identify documents that exactly belong to a class or that is possibly belong to a class. Documents possible to belong to a class are penalized by multiplying with a coefficient named α . Additionally, the effect of sparsity in the term vector space is calculated with the help of rough set. The PRFS is compared with state-of-the-art filter feature selection methods such as Gini index, information gain, distinguishing feature selector, recently proposed max–min ratio, and normalized difference measure methods. The comparison is carried out using various feature sizes on four different short text datasets with a Macro-F1 success measure. Experimental results demonstrated that the PRFS offers either better or competitive performance with respect to other feature selection methods in terms of Macro-F1. This study may be a pioneering study in this research field as it proposes a novel feature selection method for short text classification using a rough set theory.

[1]  Ming-Wen Shao,et al.  Feature subset selection based on fuzzy neighborhood rough sets , 2016, Knowl. Based Syst..

[2]  Ling Zheng,et al.  Self-adjusting harmony search-based feature selection , 2014, Soft Computing.

[3]  Lipika Dey,et al.  A new customized document categorization scheme using rough membership , 2005, Appl. Soft Comput..

[4]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[5]  Qasem A. Al-Radaideh,et al.  Application of Rough Set-Based Feature Selection for Arabic Sentiment Analysis , 2017, Cognitive Computation.

[6]  Abdur Rehman,et al.  Feature selection based on a normalized difference measure for text classification , 2017, Inf. Process. Manag..

[7]  Rasim Çekik,et al.  A new classification method based on rough sets theory , 2018, Soft Comput..

[8]  N. Xu,et al.  Dimensionality Reduction by Feature Co-Occurrence based Rough Set , 2019, International Journal of Performability Engineering.

[9]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[10]  Abdur Rehman,et al.  Selection of the most relevant terms based on a max-min ratio metric for text classification , 2018, Expert Syst. Appl..

[11]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[12]  James Nga-Kwok Liu,et al.  A rough set-based case-based reasoner for text categorization , 2006, Int. J. Approx. Reason..

[13]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[14]  Jun Li,et al.  Social emotion classification of short text via topic-level maximum entropy model , 2016, Inf. Manag..

[15]  Usman Qamar,et al.  A heuristic based dependency calculation technique for rough set theory , 2018, Pattern Recognit..

[16]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[17]  Jonghun Park,et al.  Language independent semantic kernels for short-text classification , 2014, Expert Syst. Appl..

[18]  Shunxiang Wu,et al.  Online multi-label streaming feature selection based on neighborhood rough set , 2018, Pattern Recognit..

[19]  Qinghua Hu,et al.  Cost-sensitive feature selection based on adaptive neighborhood granularity with multi-level confidence , 2016, Inf. Sci..

[20]  Abdelmonaime Lachkar,et al.  An effective short text conceptualization based on new short text similarity , 2018, Social Network Analysis and Mining.

[21]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[22]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[23]  Zdzislaw Pawlak,et al.  Rough Set Theory and its Applications to Data Analysis , 1998, Cybern. Syst..

[24]  Kalyan Moy Gupta,et al.  Rough Set Feature Selection Algorithms for Textual Case-Based Classification , 2006, ECCBR.

[25]  Oksam Chae,et al.  Simultaneous feature selection and discretization based on mutual information , 2019, Pattern Recognit..

[26]  Hongyun Zhang,et al.  Rough set based hybrid algorithm for text classification , 2009, Expert Syst. Appl..

[27]  Lei Xi,et al.  Rough set and ensemble learning based semi-supervised algorithm for text classification , 2011, Expert Syst. Appl..

[28]  V. Rao Vemuri,et al.  Use of K-Nearest Neighbor classifier for intrusion detection , 2002, Comput. Secur..