Study on Feature Selection in Chinese Text Categorization

This paper introduces and compares eight feature selection methods in text categorization. Among the eight methods, Multi Class Odds Ratio(MC OR), a variant of Odds Ratio which is often used in binary classification, and a new feature selection method based on Class Discriminating Words(CDW) are proposed. Combined with the classic VSM classifier based on cosine similarity and the Nave Bayes classifier, training and test are carried out on two text sets with different class distribution. As the results indicate, MC OR and CDW gain the best selecting effect.