Application of knowledge gain on multi-type feature space in microblog user classification

Feature selection plays an important role in text categorization. Classic feature selection methods such as document frequency (DF), information gain (IG), mutual information (MI) are commonly applied in text categorization. But usually they only take plain text into account. Knowledge Gain (KG) is a new feature selection method which is proposed in my previous paper. It measures attribute's importance based on Rough Set theory. Experiment shows that it performs well in traditional text classification, and it has obvious advantage in unbalanced corpus in recall rate. Unlike traditional text classification, characteristics of microblog reflected in short text and special structure networks, including user social network and behavior network. This results in less text information and more behavior and social information of microblog users. The classic feature selection algorithms, which are proposed based on text feature, is not applicable. In this paper, we validated that KG which is proposed based on the rough set knowledge can select optimal feature consistently in multi-type feature space of microblog user classification. Experiment shows that it has better performance in multi-type feature selection than other classic feature selection methods.