A Combined Feature Selection Method for Chinese Text Categorization

Feature selection is an important application in the field of Chinese text categorization. However, the traditional Chinese feature selection methods are based on conditional independence assumption; therefore there are many redundancies in feature subsets. In this paper a combined feature selection method of Chinese text is proposed and this method is designed by the regularized mutual information (RMI) and Distribute Information among Classes (DI). It takes two steps to execute feature selection. In the first step, Distribute Information algorithm is used to remove features which are irrelevant of text category and redundant features are eliminated by regularized mutual information in the second step. The experimental results show that this combined feature selection method can improve the quality of classification. Keywordsfeature selection; regularized mutual information; distribute information among class; feature redundancy; Chinese text categorization