Feature Selection with Structural Sparse Mode for Text Categorization

The grouped structure has successfully been embedded in sparse models for feature selection; however, some groups generated by clustering method might be difficult to interpret their semantic information if the number of words in the group is very large. This paper proposes a novel approach in which a group structure is constructed and its corresponding sparse model is used to select features for text categorization. After variable preselection, an algorithm is developed to generate groups, such that each group only contains two or three closely related words, which can reflect more essential semantic meaningful. Finally, structural sparse mode is used to select feature in wrapper way. The experimental results demonstrate that the proposed method achieves comparable precision and improves the sparsity considerably, which means that the model has better interpretability.

[1]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[2]  Anestis Antoniadis,et al.  A sparse version of the ridge logistic regression for large-scale text categorization , 2011, Pattern Recognit. Lett..

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[5]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[6]  Qingshan Jiang,et al.  Feature selection via maximizing global information gain for text classification , 2013, Knowl. Based Syst..

[7]  Bo Tang,et al.  A Bayesian Classification Approach Using Class-Specific Features for Text Categorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Minchao Ye,et al.  A Grouped Structure-based Regularized Regression Model for Text Categorization , 2012, J. Softw..

[10]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[11]  Yuntao Qian,et al.  Collaborative work with linear classifier and extreme learning machine for fast text categorization , 2013, World Wide Web.

[12]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[13]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[14]  Abdur Rehman,et al.  Feature selection based on a normalized difference measure for text classification , 2017, Inf. Process. Manag..

[15]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[16]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..