Feature selection is an important application in the field of Chinese text categorization. However, the traditional Chinese feature selection methods are based on conditional independence assumption; therefore there are many redundancies in feature subsets. In this paper a combined feature selection method of Chinese text is proposed and this method is designed by the regularized mutual information (RMI) and Distribute Information among Classes (DI). It takes two steps to execute feature selection. In the first step, Distribute Information algorithm is used to remove features which are irrelevant of text category and redundant features are eliminated by regularized mutual information in the second step. The experimental results show that this combined feature selection method can improve the quality of classification. Keywordsfeature selection; regularized mutual information; distribute information among class; feature redundancy; Chinese text categorization
[1]
William S. Cooper,et al.
Some inconsistencies and misnomers in probabilistic information retrieval
,
1991,
SIGIR '91.
[2]
Huan Liu,et al.
Efficient Feature Selection via Analysis of Relevance and Redundancy
,
2004,
J. Mach. Learn. Res..
[3]
Fabrizio Sebastiani,et al.
Machine learning in automated text categorization
,
2001,
CSUR.
[4]
Ron Kohavi,et al.
Wrappers for Feature Subset Selection
,
1997,
Artif. Intell..
[5]
Ron Kohavi,et al.
Irrelevant Features and the Subset Selection Problem
,
1994,
ICML.