Improved Document Feature Selection with Categorical Parameter for Text Classification

Social network develops rapidly and thousands of new data appears on the Internet every day. Classification technology is the key to organize big data. Feature Selection (FS) is a direct way to improve classification efficiency. FS can reduce the size of the feature subset and ensure classification accuracy based on features’ score, which is calculated by FS methods. Most previous studies of FS emphasized on precision while time-efficiency was commonly ignored. In our study, we proposed a method named CDFDC at first. It combines both CDF and Category-Frequency. Secondly, we compared DF, CDF, CHI, IG, CDFP_VM and CDFDC to figure out the relationships among algorithm complexity, time efficiency and classification accuracy. The experiment is implemented with 20-news-group data set and NB classifier. The performance of the FS methods evaluated by seven aspects: precision, Micro F1, Macro F1, feature-selection-time, documents-conversion-time, training-time and classification-time. The result shows that the proposed method performs well on efficiency and accuracy when the size of feature subset is greater than 3,000. And it is also discovered that FS algorithm’s complexity is unrelated to accuracy but complexity can ensure time stability and predictability.

[1]  Shubhamoy Dey,et al.  A comparative study of feature selection and machine learning techniques for sentiment analysis , 2012, RACS.

[2]  Jana Novovicová,et al.  Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Hee-Jun Kang,et al.  Bearing-fault diagnosis using non-local means algorithm and empirical mode decomposition-based feature extraction and two-stage feature selection , 2015 .

[4]  Yonghwan Kim,et al.  Grounded Feature Selection for Biomedical Relation Extraction by the Combinative Approach , 2014, DTMBIO '14.

[5]  C. A. Murthy,et al.  Effective Text Classification by a Supervised Feature Selection Approach , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[6]  Liang He,et al.  Improved categorical distribution difference feature selection for Chinese document categorization , 2014, ICUIMC '14.

[7]  Asim Karim,et al.  Fast supervised feature extraction by term discrimination information pooling , 2011, CIKM '11.

[8]  Yuefeng Li,et al.  Relevance Feature Discovery for Text Mining , 2014, IEEE Transactions on Knowledge and Data Engineering.

[9]  Hongfei Lin,et al.  A two-stage feature selection method for text categorization , 2010, FSKD.

[10]  Daoqiang Zhang,et al.  Pattern Representation in Feature Extraction and Classifier Design: Matrix Versus Vector , 2008, IEEE Transactions on Neural Networks.

[11]  Ammar Ismael Kadhim,et al.  Feature extraction for co-occurrence-based cosine similarity score of text documents , 2014, 2014 IEEE Student Conference on Research and Development.