Feature selection plays an important role in text categorization. Classic feature selection methods such as document frequency (DF), information gain (IG), mutual information (MI) are commonly applied in text categorization. But usually they only take plain text into account. Knowledge Gain (KG) is a new feature selection method which is proposed in my previous paper. It measures attribute's importance based on Rough Set theory. Experiment shows that it performs well in traditional text classification, and it has obvious advantage in unbalanced corpus in recall rate. Unlike traditional text classification, characteristics of microblog reflected in short text and special structure networks, including user social network and behavior network. This results in less text information and more behavior and social information of microblog users. The classic feature selection algorithms, which are proposed based on text feature, is not applicable. In this paper, we validated that KG which is proposed based on the rough set knowledge can select optimal feature consistently in multi-type feature space of microblog user classification. Experiment shows that it has better performance in multi-type feature selection than other classic feature selection methods.
[1]
Jahna Otterbacher,et al.
Inferring gender of movie reviewers: exploiting writing style, content and metadata
,
2010,
CIKM.
[2]
Yan Xu.
Rough set and its application in Chinese spam filtering
,
2011,
2011 IEEE International Conference on Granular Computing.
[3]
David Yarowsky,et al.
Classifying latent user attributes in twitter
,
2010,
SMUC '10.
[4]
Ravi Kumar,et al.
"I know what you did last summer": query logs and user privacy
,
2007,
CIKM '07.
[5]
Kyumin Lee,et al.
You are where you tweet: a content-based approach to geo-locating twitter users
,
2010,
CIKM.
[6]
Soo-Min Kim,et al.
Crystal: Analyzing Predictive Opinions on the Web
,
2007,
EMNLP.
[7]
John D. Burger,et al.
An Exploration of Observable Features Related to Blogger Age
,
2006,
AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
[8]
John C. Paolillo,et al.
Gender and genre variation in weblogs
,
2006
.
[9]
David Yarowsky,et al.
Modeling Latent Biographic Attributes in Conversational Genres
,
2009,
ACL.